Font Size: a A A

Research On The Calculation Method Of Similarity Between Chinese And Old Bilingual Sentences

Posted on:2020-04-30Degree:MasterType:Thesis
Country:ChinaCandidate:L HeFull Text:PDF
GTID:2438330599455739Subject:Computer system architecture
Abstract/Summary:PDF Full Text Request
Large-scale Chinese and Lao parallel sentence pair corpus is an indispensable resource for Chinese-Lao bilingual machine translation,and similarity calculation of Chinese-Lao bilingual sentences is undoubtedly the most basic and important way to construct parallel corpus.Due to the lack of resources and the low accuracy of word segmentation in Lao language,there is still no good method to connect Chinese and Lao bilingual sentences and calculate their similarities.To solve this problem,the following research work has been carried out.First,the Lao word segmentation method.Inspired by the method of Chinese four-word position labeling and segmentation,according to the linguistic characteristics of Lao words composed by syllables,this paper uses artificial word segmentation corpus to carry out syllable-based four-word position labeling(BMES),and pretrains the Bidirectional Long Short-Term Memory(BLSTM)neural network model.Firstly,Lao sentences are divided into syllables and trained into vectors.Then,these vectors are used as the input of BLSTM neural network model to predict the label of the syllable,and then the label is determined by sequence inference algorithm.Finally,the experiment results show that the Lao word segmentation effect of this method is better than that of previous word segmentation methods.Second,the calculated of similarity between Chinese and Lao bilingual sentences.Inspired by cross-linguistic distributed representation learning,this paper uses the Deep Canonical Correlation Analysis(DeepCCA)model to connect bilingual sentences and calculate their similarities.Firstly,two sentences are vectorized,and then the pre-trained DeepCCA model is used to map the two sentence vectors into a new space.Finally,the cosine distance of the mapping sentence vectors is used to calculate the similarity between the Chinese and the Lao sentences in the new space.Experiments show that this method can effectively calculate the similarity between Chinese and Lao sentences.Thirdly,further improve the accuracy of calculating the similarity between Chinese and Lao sentences.In order to further improve the accuracy of similarity calculation,based on the deep canonical correlation analysis method,this paper extracts the similarity features of sentence length,number matching and linear andnon-linear sentence vectors between Chinese and Lao bilinguals,and further calculates the similarity between Chinese and Lao bilingual sentences by using the method of multi-feature fusion.Firstly,according to the aligned corpus,we extract four features of Chinese and Lao bilingual texts: digital matching,sentence length,similarity of DeepCCA and linear CCA.Then we weigh different features to get the best results.The experimental results show that this method can better calculate the similarity between the Chinese and the Lao sentences.
Keywords/Search Tags:Chinese and Lao, Sentence Similarity Calculation, Lao Word Segmentation, Deep Canonical Correlation Analysis, Multi-Feature
PDF Full Text Request
Related items