Font Size: a A A

Research On Chinese-Lao Bilingual Named Entity Recognition And Alignment Method

Posted on:2019-01-03Degree:MasterType:Thesis
Country:ChinaCandidate:R HanFull Text:PDF
GTID:2438330563957653Subject:Computer technology
Abstract/Summary:PDF Full Text Request
In the Lao text,there are a large number of proper nouns such as names of people,place names,organization names,etc.These named entities contain a large amount of information.Understanding the main contents of the articles through named entities is the basis for correct understanding of Lao language.Compared with languages such as English and Japanese,the number of people using Lao is less,and Lao’s domestic Internet technology started relatively late,resulting in an extremely lack of corpus resources.This also adds great difficulty to crosslinguistic information processing between Lao and Chinese.For the study of named entities,research in larger languages such as English,Chinese,and Thai has become more in-depth.However,there are few studies on such small-scale languages as Lao language.An in-depth study of Lao’s named entity has an important role in both Lao’s own language analysis and Lao-Chinese translation.In view of the above situation,this paper proposes the following research contents:Firstly,a conditional random-named Lao entity named entity recognition method is studied.The word vector and word vector clustering are used to identify Lao named entities in the feature addition conditional random field,and the word vectors are improved and a weighted word vector is proposed.Through experiments,it is verified that incorporating word vectors as features into conditional random fields can improve the performance of named entity recognition.Secondly,a bilingual named entity alignment method based on multi-feature fusion and support vector machine model is studied.In the study of bilingual named entity alignment,we first identify Lao and Chinese named entities from bilingual corpus and use multiple features to match named entities,including transliteration features,translation features,co-occurrence frequency features,and mutual information features.By adjusting feature weights to achieve the best results.This article uses two methods to filter the named entity equivalence pairs: one is a thresholddefining method,which filters the scores obtained by combining the features of Han and old named entities,sets a threshold,and filters through the threshold and obviously wrong.Name entity pairs,and improve the overall performance of the system;another method is to use support vector machines as the alignment model for bilingual entities named Han and old.This method is to perform binary classification of candidate named entity pairs.In the selection of features,the four features used by the named entity pairs are extracted.This method can comprehensively consider the distribution of each feature to determine whether it is a correct pair of named entity equivalences,which has high accuracy and can improve the performance of the system.Finally,through the above research content,a Chinese-Lao bilingual named entity dictionary is generated,and a bilingual entity named translate system is designed and implemented.
Keywords/Search Tags:named entity recognition, named entity alignment, word embedding, CRF, SVM
PDF Full Text Request
Related items