Font Size: a A A

Training Large-Scale Statistical Machine Translation Models On Spark

Posted on:2017-02-16Degree:MasterType:Thesis
Country:ChinaCandidate:J ZhouFull Text:PDF
GTID:2428330485458823Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the development of computer technology,the application data generated by the Internet is rapidly increasing,and the size of available corpora is largely growing too.Researches reveal that the quality of statistical machine translation(SMT)increases as the size of training corpora increases.However,with the ever-increasing amount of parallel training corpora,existing training methods based on single machine and traditional distributed tools are not efficient enough to train the translation models and thus no longer meet the requirements of SMT researchers.In this paper,we focused on translation models,and proposed parallel algorithms to speed up training.First,we extracted two fundamental methods and algorithms,distributed maximum likelihood estimation and distributed parameters management mechanism,as our main research object through parallel analysis of training algorithms of translation model.For the distributed maximum likelihood estimation,we proposed the join-based MLE algorithm which performs almost linear data scalability.For the distributed parameters management mechanism,two methods were proposed.One was based on broadcast variables,which shared parameters by network.The other one was based on HDFS,which implemented parameter sharing by loading data from HDFS.These two methods could both implement distributed parameter loading,accessing and updating steadily.Then,we proposed following two methods and frameworks and studied a series parallel model training algorithms:(1)The distributed EM training method and framework:This method and framework can achieve better performance by optimizing the data storage and partitioning method.We implemented parallel training algorithms of "IBM model 1","HMM alignment model”and Align_on_MGIZA algorithm which called MGIZA++distributedly to train word alignment according to this framework.(2)The translation model training method and framework:We implemented parallel training algorithms of phrase-based translation model and hierarchical phrase-based translation model according to this framework.At last,we implemented a series parallel training algorithms and developed a large-scale translation model training toolkit,Seal.Compared with the MapRedcue methods,the experimental results show that the parallel training algorithms of IBM model one and HMM model in seal achieve 2-5 times speedup,and Align_on_MGIZA algorithm gets 1-2 times speedup.Also,the training algorithms of phrase-based translation model and hierarchical phrase-based translation model are 2-4 times and 5-8 times faster than the MapRedcue methods separately.Seal can efficiently speed up the model training in large-scale SMT systems and also perform good scalability.Seal can better meet the requirements of large scale training of corpora.
Keywords/Search Tags:statistic machine translation, translation model, word alignment model, large scale training, parallel algorithm
PDF Full Text Request
Related items