Font Size: a A A

Research On Cambodian Named Entity Recognition Using Cross-Language Features

Posted on:2019-06-06Degree:MasterType:Thesis
Country:ChinaCandidate:Y J GuoFull Text:PDF
GTID:2438330563457605Subject:Instrumentation engineering
Abstract/Summary:PDF Full Text Request
Bilingual word alignment is an important task in natural language processing technology.Its purpose is to find vocabulary-level mappings in two language texts translated at the sentence level.Word alignment is the basis of many natural language processing tasks.Named entity recognition has always been a hot and difficult issue in the field of natural language processing,and it is also an important basis for statistical machine translation and cross-language information retrieval.The Khmer natural language processing technology started late and is limited by the scarcity of corpus resources.This paper uses the mature technology of English entity recognition to help Khmer named entity recognition.Regarding the issue above,Based on research and analysis of existing research work,This article mainly completes the following characteristic research work:1.Word alignment method based on non-parametric Bayesian modelThe principle of word alignment based on non-parametric Bayesian model is to use the PY process(Pitman-Yor processes)to replace the classification distribution of IBM model 4 to construct a non-parametric Bayesian model that combines language features,and proposes a bilingual word alignment method.The IBM word alignment model is the main model applied to most statistical machine translation systems.The problem with the model is that bilingual language variability is not taken into account and overfitting problems tend to occur during training.It is not suitable to solve the problem of natural language processing in Khmer language where the corpus is scarce.To avoid this problem,In this paper,the non-parametric Bayesian model is used and the language features of the Khmer attributive postposition are added to achieve alignment between English and Khmer words.This method is better than the IBM model in terms of word alignment and has achieved good results.2.Integrating Cross-language Features of Khmer Named Entity Recognition MethodThe method of naming and identifying Khmer entity entities that incorporate cross-language features is used to solve the problem of the lack of effective entity features in named entity entities in Khmer.This will increase the correct recognition rate of Khmer named entities.Considering that the research methods of named entities in the English field are relatively mature,we use the more mature named entity recognition technology in the Englishfield,and use British-Khmer parallel corpus as a bridge to transfer knowledge into Khmer language to realize the recognition of the named entity of Khmer.First of all,Refers to existing mature named entity recognition technology in English.According to the word alignment relationship,the English entity tag is mapped to the aligned Khmer side in a certain way.Through the tag propagation algorithm between the Khmer languages,the distribution of entity tags of all Khmer words is obtained.By setting a threshold,the entity tags are distributed in a Boolean representation,and the results are used as features in the conditional random field model.Names,place names,and organization names are identified.3.Constructed a prototype system for identifying Khmer named entity that integrates cross-language featuresBased on the research results,a prototype system of Cambodian named entity recognition wi th cross-language features was designed and developed.The tools and system framework require d for system construction are introduced,and the process of using the system is described in detail.It achieves the recognition of names,place names,and organization names in the Khmer language document.
Keywords/Search Tags:Word Alignmen, Pitman-Yor processes, Named Entity Recognition, Cross-language Features, Conditional Random Fieldst
PDF Full Text Request
Related items