Font Size: a A A

A Research On Sentence Alignment Of Han-Lao Bilingual Sentences Fused With Multi-features

Posted on:2022-09-20Degree:MasterType:Thesis
Country:ChinaCandidate:Q H TanFull Text:PDF
GTID:2518306524952329Subject:Computer technology
Abstract/Summary:PDF Full Text Request
As an important neighbor of China,Laos has close economic ties with China,and it is of great strategic significance to carry out research on natural language processing in Laos.Among them,cross-language information processing tasks such as machine translation and information retrieval require the support of Chinese-Lao bilingual parallel corpus.As a key technology for constructing bilingual parallel corpus,bilingual sentence alignment aims to extract parallel sentence pairs with the same semantics in bilingual texts,so it has important research significance.This paper proposes a multi-featured Chinese-Old bilingual sentence alignment method to solve the problems in the research.The main work is as follows:(1)The similarity calculation method of Chinese-Old bilingual sentences fused with text featuresDue to the lack of Chinese-Lao parallel corpora and the obvious differences in terms of semantic expressions and sentence structure between Lao and Chinese,the study of similarity in bilingual sentences between Chinese and Lao is difficult.This paper proposes a calculating method with textual features for the similarity of bilingual sentences between Chinese and Lao,and constructs a model of sentence similarity.Firstly,text features,in the model of sentence similarity,such as part of speech and number co-occurrence in Chinese and Lao,are fused with Glove pretrained word vectors,so as to enrich sentence features and improve calculation accuracy of the model.Secondly,long-distance context features and deep-level semantic information are distinguished based on a multi-layered twin network.The network is composed of bidirectional long-term and short-term memory self-attention networks,an assurance of the effective usage of semantic information.At last,the method of transfer learning is used to initialize the model by its parameters,and in the mean time,different strategies of fine-tuning are used to enhance the generalization ability of the model.Experiment suggests that,in this paper,the accuracy,recall rate and F1 value of the proposed method have respectively reached 82.5%,85.78% and 84%.(2)Chinese-Old bilingual sentence alignment method combining character features and related neural networksAiming at the difficulty of obtaining high-quality Chinese-Lao parallel corpus,this chapter proposes a Chinese-Lao bilingual sentence alignment method that combines character feature vectors and related neural networks.On the basis of method(1),through further research and analysis of transfer learning methods,construct The GCNN-CorrNet model fused with character features makes full use of character word formation information and Chinese Lao text features to enrich the semantic information contained in word vectors,and map the Chinese-old sentence vector representation to the shared semantic space to jointly learn general semantic representation.The universal representation in the shared space calculates the similarity distance between the two languages,thereby judging whether the Chinese-old bilingual sentences are aligned in parallel,and achieving more accurate bilingual sentence matching.It can be seen from the experimental results that the F1 value of the GCNN-CorrNet model after fusing character features and text features reaches 84.30%,which proves the effectiveness of the method.(3)Multi-featured Chinese-Old bilingual sentence alignment methodBecause there is non-monotonic alignment(cross alignment and space alignment)in Chinese-Lao bilingual texts,which thereby affects the effect of Chinese-Lao sentence alignment;furthermore,names of people and places as news factors mostly belong to unknown words,and also increases the difficulty of Chinese-Lao sentence alignment research.This paper presents a method of Chinese-Lao bilingual sentence alignment that integrates local and global semantic information.Firstly,we integrate the features of Chinese-Lao bilingual sentence length and person and place names into Glove word vectors,and then make use of bidirectional gated recurrent unit to encode character word vectors to obtain more fine-grained sentence local information.Secondly,we introduce interactive attention mechanism to extract the global information in bilingual sentences to ensure the effective use of contextual semantic features.Finally,we introduce the KM algorithm based on multilayer perceptron,which can process non-monotonic aligned text and improve the generalization ability of the model.The experimental results show that this method can significantly improve the alignment performance of Chinese-Lao bilingual news corpora.
Keywords/Search Tags:Bilingual alignment, Laos, multiple features, transfer learning
PDF Full Text Request
Related items