Font Size: a A A

Comparable Corpus Acquisition Of Cambodian-Chinese Parallel Sentence Pairs Based On Bidirectional Recurrent Neural Network

Posted on:2020-01-03Degree:MasterType:Thesis
Country:ChinaCandidate:S Y LiFull Text:PDF
GTID:2518305969475174Subject:Electronics and Communications Engineering
Abstract/Summary:PDF Full Text Request
Bilingual parallel sentence is a particularly important language resource for cross-language natural language processing research.Among them,The Chinese-Khmer bilingual parallel sentence has very important research significance for promoting the development of Khmer language information processing technology.The acquisition of bilingual parallel sentence pairs requires a large number of parallel texts as the basis.At present,for Khmer-Chinese bilinguals,the parallel texts that can be used to extract parallel sentence pairs are less.The research methods on the acquisition of bilingual parallel sentences exist.With the complicated structure and insufficient parallel texts.Therefore,we propose a method based on the bidirectional recurrent neural network to obtain the Khmer-Chinese parallel sentence pairs from the comparable corpus.Effectively solve the problems existing in the existing research methods,and greatly promote the cross-language natural language processing work of Khmer and Chinese.In view of the above problems,and based on the full discussion and analysis of the existing research work,this paper studies the idea of obtaining Khmer-Chinese parallel sentence pairs from the comparable corpus,and mainly completed the following research work:1.Construction of Khmer-Chinese bilingual word vector model based on multiple canonical correlation analysis(CCA)As the underlying input of neural network,bilingual word vector can effectively improve the performance of natural language processing tasks.Because the existing bilingual word vector research method requires a large number of bilingual parallel texts to obtain bilingual word vectors,for Khmer-Chinese bilinguals,there are key issues of insufficient parallel text,and English as a common language,English-Chinese and English-Khmer bilingual parallel texts are more accessible,so we made further improvements in the typical correlation analysis cross-language word vector model,and a method for constructing Cambodian-Chinese bilingual word vector based on multiple CCA algorithm with English as the intermediate language.By projecting English and Chinese word vectors to the English-Chinese vector Space,the English-Khmer word vectors are projected into the Khmer-English vector space,and the English-Chinese,English-Khmer bilingual word vectors are obtained according to the CCA algorithm;then the English-Chinese,English-Khmer bilingual word vectors obtained in the previous step are in English as the intermediate word is projected into the same vector space of the third party,the projection transformation matrix in the new vector space is obtained again according to the CCA algorithm.Finally,the Khmer-English-Chinese multilingual word vector is calculated.The multilingual word vector contains the Khmer-Chinese bilingual word vector.Compared with the traditional method,this method can directly construct the Khmer-Chinese bilingual word vector,and solve the problem of the initial Khmer-Chinese parallel sentence pair scarcity faced by other models,and the obtained Khmer-Chinese bilingual word vector has higher quality and is used as the input layer representation of the neural network.2.Obtaining parallel sentence pairs of comparable corpus based on bidirectional recurrent neural networkRelying on the recurrent neural network to obtain parallel sentence pairs from comparable corpora,the bilingual parallel sentence pairs can be extracted from the corpus that does not correspond to the bilingual comparable corpus.The method takes advantage of the excellent characteristics of the Gated Recurrent Unit in the recurrent neural network.The input discrete word vector is transformed into a sentence vector through the BI-GRU network.The sentence vector,then calculate the similarity of the sentence vector according to the sentence similarity calculation method based on the Manhattan distance algorithm,and set the threshold.The double sentence pair corresponding to the source language sentence vector satisfying the threshold condition and the target language sentence vector is the bilingual parallel sentence pair.Compared with existing methods based on parallel web pages,through machine translation and other neural network models to obtain parallel sentence pairs,this method effectively improves work efficiency and sentence-parallel accuracy,and does not need to provide bilingual parallel corpora.To solve the problem of the acquisition of Khmer-Chinese bilingual parallel corpus.3.Constructed a prototype system for parallel sentence pair acquisition in comparable corpus based on in-depth learningBased on the research results,a prototype system of parallel sentence pairs acquisition in comparable corpus based on in-depth learning is designed and developed.The tools and system framework needed for system construction are introduced,and the design process of the system is elaborated.It is realized that parallel sentence pairs are obtained from the Khmer-Chinese comparable corpus.
Keywords/Search Tags:Bilingual Word Vector, Canonical Correlation Analysis, Bidirectional Gated Recurrent Unit, Sentence Similarity, Bilingual Parallel Sentence Pair
PDF Full Text Request
Related items