Font Size: a A A

Offline Model Training Method And System For Large-Scale Distributed Statistical Machine Translation

Posted on:2018-01-06Degree:MasterType:Thesis
Country:ChinaCandidate:W J YangFull Text:PDF
GTID:2428330512998204Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the rapid development of information technology and wide-ranging academic and cultural exchanges,on the one hand,parallel corpus for machine translation show explosive growth,on the other hand,the demand for machine translation is also increasing.The quality of Statistical Machine Translation(SMT)is largely determined by the scale of the parallel corpus.However,with the increasement of the parallel corpus,the training time of model in typical stand-alone machine translation system increases sharply,which immensely restricts the algorithm research and application of SMT.Therefore,the research of large-scale distributed machine translation system has great research significance and practical value.Existing distributed SMT tools usually have some inevitable problems,such as the inferior parallel performance and scalability,incomplete SMT offline training pipeline.In fact,it is extremely difficult to implement an efficient distributed SMT offline training pipeline.Firstly,SMT system has complicated tasks require different computing resources,making it hard to design efficient parallel algorithms.Secondly,there exist massive I/O and network node communications overhead in model training,which greatly affects the model parallel efficiency.Thirdly,it is easy to slow down the entire training process if the data skew problems in model training are not properly handled.For above difficulties,by analyzing the deficiency of existing research work and the difficulty of distributed machine translation model,this paper implements a complete,efficient,flexible and scalable distributed SMT offline training pipeline,which provides favorable support for the model research and application of SMT.The main contents and contributions of this paper are as follows(1)This paper analyzes each model in SMT offline training pipeline,and implements the process of large-scale distributed training.Word alignment model part includes parallel training of preprocessing and corresponding word alignment model.Translation model part mainly contains three different parallel translation models.Distributed language model training can support four different probability-smoothing algorithms.(2)System balancing and network communication optimization.In the word alignment training,we set the proper block threshold in data preprocessing.In the translation model training,the training data is processed numerically to cut down the scale of the intermediate data.Owing to these measures,we reduce the system load and network traffic,and meanwhile improve the training efficiency of the model.(3)Considering the multiple times parameters estimation in model training,this paper optimize the join operation based parallel Maximum Likelihood Estimation algorithm.There are two optimization strategies.First one is broadcasting a smaller table to a large distributed table to avoid global join.The second one is using the same partition function for two distributed tables,making their records in advance to meet the same partition rules in order to avoid the data shuffle in the execution process.(4)For data skew problems in model training,two methods are studied and implemented.One is to enhance the degree of model training parallelism(or repartition data).The other is to employ the two-stage aggregation strategy.Firstly,extending the key with a random prefix in order to make the records evenly distributed.Then aggregating in each node.Finally,removing the prefix for global aggregation.(5)Finally,based on the widely used distributed data-parallel computing platform Spark,this paper implements a large-scale distributed SMT offline training pipeline prototype system called Seal,and conducts the system performance comparison analysis under massive corpus.Experiments show that,the parallel performance of Seal is superior to existing standalone and distributed machine translation training tools,meanwhile,Seal has better scalability.
Keywords/Search Tags:Statistical Machine Translation, Distributed Machine Translation Model, Distributed Data-Parallel Computing Platform
PDF Full Text Request
Related items