Font Size: a A A

Bilingual Word Alignment System Based On English-chinese Parallel Corpus

Posted on:2020-10-02Degree:MasterType:Thesis
Country:ChinaCandidate:J J ZhouFull Text:PDF
GTID:2428330590471684Subject:Electronic and communication engineering
Abstract/Summary:PDF Full Text Request
The core of statistical machine translation is bilingual parallel corpus,which requires statistical analysis of a large number of parallel corpus to construct a translation model.Bilingual word alignment is a key step in statistical machine translation system.The accuracy of word alignment will directly affect the performance of the translation system.In addition,the corpus of the word alignment information has great application value.It can provide important support for natural language processing tasks such as dictionary compilation,cross-language information retrieval and semantic disambiguation.Therefore,how to obtain high-quality bilingual word alignment has great research value.The existing word alignment methods are often aligned by statistical information,without fully considering the linguistic characteristics between different languages.In the training process,a large number of labeled words are usually required to align the data,while the manually labeled alignment data is too small to meet the training requirements.The traditional word alignment model considers the lexical features to be sparse,which results in poor alignment of low-frequency words in the corpus.In view of the above problems,this thesis uses deep learning method to study word alignment.Specifically,this work mainly consists of the following three aspects:(1)Research on word alignment method based on Recurrent Neural Networks.This method incorporates the traditional Hidden Markov Model into the Recurrent Neural Networks.By considering the context information of sentences and using the similarity of vocabulary,the low-frequency words in the sentence are replaced by the common words with similar meanings.Through this common word,the corresponding relationship with the target language words can be found,and the alignment information of low-frequency words can be obtained.The model adopts unsupervised learning method,which saves the cost of manual tagging alignment corpus.The experimental results show that this method improves the quality of word alignment.(2)This thesis proposes a word alignment method that incorporates dependency relation.This method requires dependency analysis of the input sentences.The Bi-directional Long Short-Term Memory is used to extract the word embedding features of context,and the Attention Mechanism is introduced to control the fusion of features.Finally,the dependent analyzer with better analysis results is obtained.And the bilingual training corpus is labeled with dependency relationship through this analyzer.Dependency relation information and part of speech information are used as features and integrated into a log-linear model to obtain word alignment information.(3)On the basis of word alignment,a phrase-based statistical machine translation system is implemented,which mainly includes the training of translation model,training of language model,and decoding module.The translation system in this thesis is compared with the commonly used online translation platform.
Keywords/Search Tags:word alignment, parallel corpus, neural network, statistical machine translation
PDF Full Text Request
Related items