Named entities refer to entities identified by their names,such as people,places,and institutions.Bilingual named entity translation equivalence pairs,which are named entity pairs with translation relationships between two languages,are important resources in natural language processing.Comparable corpora,while not as precise in language alignment as parallel corpora,are rich in resources and easy to obtain.Bilingual named entity pairs can be extracted from comparable corpora.This paper presents a study on constructing a Chinese-English comparable corpus and extracting Chinese-English named entity pairs.Building upon previous literature,we propose a method for constructing a Chinese-English comparable corpus based on keyword similarity.This approach involves normalizing the bilingual text using machine translation,extracting keywords from the text,and determining the comparability of the bilingual texts based on the similarity of their keywords.Experiments show that our keyword-based text similarity calculation method has certain advantages over the traditional dictionary-based method,achieving 73.67%,90.26%,and 81.12% in accuracy,recall,and F-score,respectively.We also propose a method based on multi-feature fusion for extracting bilingual named entity pairs from comparable corpora.This method combines the unique characteristics of Chinese and English,incorporating four features: transliteration information,translation information,word length information,and co-occurrence frequency information of named entities.We design a multi-feature model based on the maximum entropy model,and optimize it to achieve 84.90%,82.57%,and 83.72%in accuracy,recall,and F-score,respectively.Our optimized model demonstrates a22.10% improvement in F-score compared to the default weight model. |