Font Size: a A A

A Study On The Method Of Obtaining Equivalence Of Chinese And Cambodian Naming Entities

Posted on:2017-03-29Degree:MasterType:Thesis
Country:ChinaCandidate:Q XiaFull Text:PDF
GTID:2278330488950199Subject:Computer technology
Abstract/Summary:PDF Full Text Request
As an important basic resource in the study of natural language processing, named entity pairs is applied in some fields greatly, such as cross-culture information retrieval and machine translation. Compared with other major languages, the present Chinese-Khmer named entity pairs is still in its infancy in terms of attainment method due to the lack of corpora size and fundamental research.This study mainly investigates how to obtain Chinese-Khmer named entity pairs. The main contents of this paper are summarized as follows:1. Extraction of Chinese-Khmer named entity pairs based on WikipediaWith Wikipedia on the Internet as the carrier to obtain named entity pairs, extraction of Chinese-Khmer named entity pairs based on Wikipedia uses multilingual descriptions of Wikipedia as the bridge between Chinese and Khmer. Besides, the rules of extraction of Chinese-Khmer named entity pairs are made according to web page structure of Wikipedia and a certain scale of high-quality Chinese-Khmer named entity pairs are extracted to establish Chinese-Khmer named entity corpus.2. Chinese-Khmer transliteration model building based on machine learningChinese-Khmer transliteration model based on machine learning obtains Chinese-Khmer named entity pairs through translating Khmer named entities into Chinese named entities. Chinese-Khmer transliteration model transforms the translation problem into two sequence tagging problems, including syllable segmentation tagging and syllable translation tagging, which is built based on machine learning, such as maximum entropy model and conditional random fields. The better translation is gotten through using the transliteration model based on machine learning than that of statistic machine translation model.3. Excavation of Chinese-Khmer named entity pairs based on comparable corpora with multi-feature similaritiesNamed entities should be recognized from Chinese-Khmer comparable corpora through excavating Chinese-Khmer named entity pairs based on comparable corpora with multi-feature similarities. The characteristics of transliteration, translation, term vector of named entity context and the length of named entity are described according to characteristics of different types of named entities and their characteristics in comparable corpora. The feature similarities of candidate named entities are calculated according to different weights of different feature similarities set for different types of named entity pairs. The final similarity of candidate named entity pairs is calculated through weighted summation of multi-feature similarities to further excavate named entity pairs in Chinese-Khmer comparable corpora. A certain number of Chinese-Khmer named entity pairs are excavated in Chinese-Khmer comparable corpora through similarity calculation.
Keywords/Search Tags:named entity pairs, bilingual Chinese-Khmer, transliteration model, Wikipedia, comparable corpora
PDF Full Text Request
Related items