Font Size: a A A

Research And Application Of Neural Network-based Sentence Alignment

Posted on:2021-03-31Degree:MasterType:Thesis
Country:ChinaCandidate:J Y HuangFull Text:PDF
GTID:2428330605474881Subject:Software engineering
Abstract/Summary:PDF Full Text Request
Neural Machine Translation(NMT)model has rapidly become the mainstream model in the field of machine translation due to its outstanding translation performance.The training process of the NMT model usually depends on parallel corpora.Whether or not there are enough parallel corpora is often one of the important decisive factors for the performance of the Neural Machine Translation system.As a technology to explore effective ways to get parallel sentences from a mass of candidate monolingual text,sentence alignment technology for machine translation has been widely studied.Previous sentence alignment techniques usually depends on feature engineering,with the use of sentence length,length ratio or bilingual dictionary and so on to align sentences.However,the feature extraction process of this method is often relatively cumbersome and those features usually can not cover all kinds of bilingual information of the sentence pairs.In consideration of the successful application of the Neural Network in sequence processing problems,this paper uses a neural network-based sentence alignment method to automatically get bilingual information through the neural network training.In addition,this paper employs the sentence alignment technology to extract parallel sentences from World Wide Web,and uses the obtained parallel sentences to improve the performance of Neural Machine Translation model.The main content of this paper mainly includes the following three aspects:(1)Sentence alignment based on sentence representation.Neural networks have been very successful in processing text sequence tasks.Sentence alignment task is also a bilingual text sequence problem.Therefore,this paper proposes to align sentences with a NMT-based model,which regards language agnostic sentences as the input and transforms all these sentences to sentence embeddings in a joint space,and relies on the vector's similarity to get parallel sentence pairs.We experiment this method on the BUCC-2017 sentence alignment task,and the result shows that our method can effectively align sentences in four language pairs;In addition,some deficiencies of the BUCC-2017 dataset has been discovered by our model analysis(2)Unsupervised sentence alignment.Previous sentence alignment technique relies heavily on parallel corpora,such as parallel sentences,bilingual dictionary,ect.For low-resource senarios,the lack of parallel corpora makes it difficult to align sentences in such way.With the hope to improve the scarcity situation of parallel corpora for low-resource language pairs,we attempt to align sentences in an unsupervised way.With a simple and effective way,we construct binlingual sentence embeddings based on unsupervised bilingual word embedding technique.We conduct experiments on three language pairs' sentence alignment task of BUCC-2017,and experiments show that it is also feasible to get parallel sentences in an unsupervised way.(3)Mining and using parallel sentences from World Wide Web by constructing a parallel sentence extraction system.We construct a parallel sentence extraction system,and mine parallel sentences from the web using sentence alignment technique.In this paper,the bilingual parallel sentences are obtained through the system and applied to train neural machine translation model.Our experiment shows that we can effectively obtain the bilingual parallel sentences including low-resource scenarios and improve the performance of the Neutral Machine Translation model by the extracted parallel sentences.It is worth noting that the corpora constructed by the system in the process of acquiring bilingual parallel sentences can also be used for other NLP tasks,such as parallel document extraction from the web.
Keywords/Search Tags:Neural Machine Translation, Sentence Representation, Sentence Alignment
PDF Full Text Request
Related items