Font Size: a A A

Bilingual Word Embedding Based Word Alignment On Large-Scale Corpus

Posted on:2018-07-12Degree:MasterType:Thesis
Country:ChinaCandidate:K HuangFull Text:PDF
GTID:2428330542465869Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
The big data era brings both opportunities and challenges to the development of corpus studies.Nowadays,the increment of data scale results in the improvement of statistical machine translation,a discipline relies much on corpus,while the disadvantages of traditional machine translation methods,however,make it difficult to process data in large-scale corpora in an effective way.Therefore,new schemes of machine translation are urgently needed,which is capable of conquering the challenges of large-scale datasets.As an indispensable ingredient of machine translation,word alignment problem is concerned by a large amount of researchers,since researches on machine translation always require word-aligned data.However,the large-scale word translation probability table produced by traditional word alignment algorithms makes it difficult to align words under distributed circumstance.The popularity of deep learning and the application of word embedding in natural language processing provide word alignment algorithm a brand new way,and the sprouting of bilingual word embedding makes it possible to calculate cross-language word similarity in a word embedding way.Our work puts a method forward to calculate word translation probability by using bilingual word embedding,and further more,implements a word alignment algorithm based on this kind of translation probability.Comparing with the large-scale word translation probability table in traditional methods,the light-weighted word vector table is less costly while being transmitted,and this makes our work more adaptable to big data and distributed computing.Since the existing bilingual word embedding algorithms of low performance are not so satisfactory,our work brings forward a parallel one based on Spark.According to current bilingual word embedding theories,there are mainly two ways to get word vectors:the monolingual vector based method and the bilingual training method.These two methods prevail over each other in different circumstance so that our work implements the parallel Spark versions of both.Our work offers two schemes named Scheme A and Scheme B,which are both the parallel implementations of monolingual vector based method,to meet the accuracy and performance requirement of user respectively.As to bilingual training method,we firstly build parallel Skip-gram model based on Negative Sampling,which is the premise of this bilingual word embedding method.And then,a complete parallel bilingual word embedding algorithm based on bilingual training method is implemented.According to our experiments,all the parallel algorithms achieve their missions to lower the time consumption of word vectors training and obtain bilingual word vectors with high performance.The efficient parallel bilingual word embedding algorithms bring convenience to make words aligned by using bilingual word embedding.Our work firstly builds a basic word alignment model and then tuning this model according to the specialty of bilingual word embedding.Results of our experiments prove that bilingual word embedding based word alignment algorithm surpasses traditional word alignment algorithm in alignment accuracy.Our work also accomplishes the parallel version of this word alignment algorithm to gain better efficiency.At last,our work builds a word-aligned English-Chinese corpus of 16 million sentence pairs and by using above parallel methods,this high-quality and large-scale bilingual corpus is built within 3 hours,which includes the time cost from word vector training to corpus building.In order to get higher accuracy of bilingual word embedding and better quality of word alignment,our work analyses the weaknesses of current works and proposes a MPS-Neg bilingual word embedding model,and puts forward a MPS-Neg bilingual word embedding-word alignment algorithm based on this model.MPS-Neg model reinforces the relationship between word vectors of two languages and reserves more translation information,which makes this model fulfills word alignment tasks better than other bilingual word embedding models.Experiments show that MPS-Neg algorithm possesses higher accuracy than word alignment algorithms based on existing bilingual word embedding models,and has a 9-percent accuracy improvement comparing with traditional word alignment algorithms.Without any accuracy loss,it is proved that MPS-Neg has higher performance than traditional methods.
Keywords/Search Tags:bilingual word embedding, word alignment model, bilingual corpus, parallel algorithm
PDF Full Text Request
Related items