Font Size: a A A

Research On The Extraction Method Of Old-Chinese Bilingual Parallel Sentence Pairs Based On Graph Matching

Posted on:2020-06-28Degree:MasterType:Thesis
Country:ChinaCandidate:S Z LiFull Text:PDF
GTID:2438330596497560Subject:Computer technology
Abstract/Summary:PDF Full Text Request
With the expanding and deepening of Chinese-Lao relations,it is of great practical significance to strengthen the academic research on the relationship between Chinese and Lao language.In many natural language processing tasks,parallel corpus provides essential experimental data for the construction of statistical machine translation model,and the mining of parallel sentence pairs is a key part of the construction of bilingual corpus.Generally speaking,two translated texts correspond almost one to one in text and paragraph,so the focus of research on alignment is mainly on sentence alignment and word alignment.This paper mainly uses the best matching of bipartite graphs to obtain the Chinese-Lao bilingual parallel sentence pairs with high similarity in the strict sense of 1: 1,and finally participates in the construction of the Chinese-Lao bilingual parallel corpus.The main research results are as follows:(1)Chinese-Lao Bilingual Sentence Similarity Calculation Based on Mutual Translation Feature Pair MatchingSince almost all the parallel corpus and bilingual texts on the Internet are paragraph-and text-aligned,it is necessary to align the corpus into the required sentence-aligned format.For the alignment blocks(paragraph alignment or text alignment)in the Chinese-Lao bilingual corpus,This paper proposes a method to calculate sentence similarity based on matching of feature word pairs and similarity dictionary of Chinese-Lao bilingual,which can be used to identify parallel sentence pairs of Chinese-Lao bilingual with high similarity and participate in the construction of Chinese-Lao bilingual parallel corpus.By calculating the matching values of the feature word pairs,Since each word pair has a prior probability of alignment,it can be used to calculate and evaluate the similarity of the final sentence pairs,so that the alignment process can be carried out according to the similarity,the Lao and Chinese sentences with high similarity or meeting certain conditions can be aligned,and the flow of sentence alignment can be simplified.The experimental results show that this method improves the accuracy of sentencesimilarity calculation in parallel corpus bilingual to some extent.(2)Computation of Sentence Matching Values for Chinese-Lao Bilingual Sentences with Multiple FeaturesNames of people and places are regarded as important features of sentence alignment between Chinese-Lao bilingual texts,but they cannot be directly translated into each other in bilingual dictionaries,which leads to confusion,arbitrariness and inconsistency in translation.Through a large number of questionnaires and manual identification,Chinese-Lao name are extracted,This paper summarizes the characteristics of place names,and summarizes the rules of transliteration of Chinese-Lao person names and place names,and proves that this rule is applicable to most of the Chinese-Lao bilingual names and place names,and the translation quality is better.At the same time,a Chinese-Lao dictionary of place names is constructed,Since the relative positions of the numeral sequences in the cross-language sentences are roughly identical and can be easily distinguished in sentence matching,this paper extracts the Chinese-Lao bilingual numeral features,and calculates the matching values of Chinese-Lao numeral features.Reference Gale uses bilingual text length normalized variables to predict the Chinese-Lao bilingual length feature matching more accurately.After calculating the muti-feature matching values of the Chinese-Lao bilingual sentences,the weights of each feature are assigned,and the final similarity values of the Chinese-Lao bilingual sentences are calculated by muti-feature fusion.The experimental results show that the accuracy of sentence similarity calculation is improved after muti-feature fusion.(3)Graph Matching Based Parallel Sentence Pair Extraction from Chinese-Lao Bilingual SentencesIt is noteworthy that the pairs of parallel sentences selected in this paper are strictly 1: 1 pairs of pearls.If only traversal is used,there will be two or more Lao sentences corresponding to the same Chinese sentence,and there is no guarantee that the sum of weights will be the largest and the matching will be the best.After muti-feature fusion,the similarity value of Chinese-Lao bilingual sentences can be calculated,and the similarity value can be used as the weight value of edgeconnection,and several Lao sentences and Chinese sentences can be used as the vertices of graphs.By means of bipartite graphs,the problem of sentence alignment can be transformed into the problem of finding the best match of bipartite graphs,and strict 1: 1 sentence beads can be obtained.Experimental results show that compared with the muti-feature fusion method based on SVM and other scholars' methods,this method improves the accuracy of the Chinese-Lao bilingual parallel sentence pairs extraction to some extent.Firstly,the matching values of feature word pairs are computed,and then the matching values of Chinese-Lao bilingual sentences are computed by fusing multiple features.Then the matching values are weighted to bipartite graphs,and the parallel sentence pairs with strict 1: 1 form and high similarity are obtained by the best matching algorithm of bipartite graphs.The experiments show that the accuracy of extraction of parallel sentence pairs is improved to some extent.
Keywords/Search Tags:Sentence similarity, Mutual translation of feature word pairs, Muti-feature fusion, Graph matching, Align sentences
PDF Full Text Request
Related items