Font Size: a A A

Parallel Segment Extraction Of Chinese-Khmer Comparable Corpus Based On Dirichlet Process

Posted on:2019-10-21Degree:MasterType:Thesis
Country:ChinaCandidate:Y NuoFull Text:PDF
GTID:2438330563957686Subject:Software engineering
Abstract/Summary:PDF Full Text Request
Parallel resource acquisition has been a hot and difficult problem in the field of Natural Language Processing,it is an important foundation for Machine Translation and cross-language information retrieval applications,especially for the information developed national minority languages,extraction of bilingual parallel corpora is the priority among priorities.In recent years,with the development of opening up to Southeast Asia,the political,cultural and economic exchanges between China and Khmer are becoming more frequent.The barriers between languages become a stumbling block for the development of bilateral cooperation.In this situation,the information processing of Khmer-Chinese language is becoming more and more important.At present,the study of Khmer with 14 million users is at the initial stage,and the research on bilingual language information in Khmer has not been carried out yet.Because of the complexity and diversity of network information format,and the network of Khmer-Chinese bilingual website is less,so through the network for a certain scale and high quality parallel corpora is difficult;and the artificial construction of large-scale bilingual parallel corpus to understand that the Khmer language and understand the Chinese language experts and very time-consuming,high cost and construction;a part of the existing parallel corpora and Cambodia relating to a single field,fewer types and problems of poor timeliness,difficult it would also give the following Natural Language Processing.On this basis,if you can find a way to the large scale and wide range of comparable data extraction method for parallel corpus,translation knowledge extraction work next,and there is no lack of parallel corpus size,timeliness and so on,it will bring great effect to Natural Language Processing.On the basis of the research and analysis of the existing research work,this paper studies how to extract parallel resources from comparable corpus.The main contents of the whole paper are as follows:(1)construction of a phrasal based bilingual LDA theme model By constructing a bilingual theme model,the topic distribution of bilingual corpus is obtained.Bilingual phrase LDA topic model can not only inherit into the characteristics of traditional LDA based on topic model,to a high dimensional document,reduced to only the document topic words in three dimensions,greatly simplify the complexity of the topic prediction;the bag of words model but also modify LDA traditional topic model,into the phrase(N-gram)concept,in the context of the theme can predict the process of consideration,and also to extend the traditional LDA to the cross linguistic field.Compared with the traditional method,LDA model forecast bilingual phrase to have a better effect on the theme.(2)parallel fragment extraction of comparable corpus based on Dirichlet process Based on Dirichlet process,we can extract and extract bilingual documents from comparable corpus,and extract them into bilingual parallel pairs that we can directly use.In order to overcome the problem of the scarcity of bilingual parallel corpus,this paper presents a Dirichlet based process from comparable methods parallel segments in the corpus,the method does not rely on dictionaries and bilingual parallel corpus,the nonparametric Bayesian model directly from bilingual comparable corpus from bilingual parallel sampling fragments,and obtain better effect.Through comparative experiments,we can see that Dirichlet based bilingual parallel corpus extraction method can get high quality parallel pairs,and do not need to restrict the larger pre bilingual resources,so it is more suitable for languages with scarce bilingual resources.
Keywords/Search Tags:Chinese and Khmer, Phrases, Bilingual LDA, Dirichelt Process, Comparable Corpus, Parallel Fragments
PDF Full Text Request
Related items