Training Large-Scale Statistical Machine Translation Models On Spark

Posted on:2017-02-16

Degree:Master

Type:Thesis

Country:China

Candidate:J Zhou

Full Text:PDF

GTID:2428330485458823

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

With the development of computer technology,the application data generated by the Internet is rapidly increasing,and the size of available corpora is largely growing too.Researches reveal that the quality of statistical machine translation(SMT)increases as the size of training corpora increases.However,with the ever-increasing amount of parallel training corpora,existing training methods based on single machine and traditional distributed tools are not efficient enough to train the translation models and thus no longer meet the requirements of SMT researchers.In this paper,we focused on translation models,and proposed parallel algorithms to speed up training.First,we extracted two fundamental methods and algorithms,distributed maximum likelihood estimation and distributed parameters management mechanism,as our main research object through parallel analysis of training algorithms of translation model.For the distributed maximum likelihood estimation,we proposed the join-based MLE algorithm which performs almost linear data scalability.For the distributed parameters management mechanism,two methods were proposed.One was based on broadcast variables,which shared parameters by network.The other one was based on HDFS,which implemented parameter sharing by loading data from HDFS.These two methods could both implement distributed parameter loading,accessing and updating steadily.Then,we proposed following two methods and frameworks and studied a series parallel model training algorithms:(1)The distributed EM training method and framework:This method and framework can achieve better performance by optimizing the data storage and partitioning method.We implemented parallel training algorithms of "IBM model 1","HMM alignment model�and Align_on_MGIZA algorithm which called MGIZA++distributedly to train word alignment according to this framework.(2)The translation model training method and framework:We implemented parallel training algorithms of phrase-based translation model and hierarchical phrase-based translation model according to this framework.At last,we implemented a series parallel training algorithms and developed a large-scale translation model training toolkit,Seal.Compared with the MapRedcue methods,the experimental results show that the parallel training algorithms of IBM model one and HMM model in seal achieve 2-5 times speedup,and Align_on_MGIZA algorithm gets 1-2 times speedup.Also,the training algorithms of phrase-based translation model and hierarchical phrase-based translation model are 2-4 times and 5-8 times faster than the MapRedcue methods separately.Seal can efficiently speed up the model training in large-scale SMT systems and also perform good scalability.Seal can better meet the requirements of large scale training of corpora.

Keywords/Search Tags:

statistic machine translation, translation model, word alignment model, large scale training, parallel algorithm

PDF Full Text Request

Related items

1	Offline Model Training Method And System For Large-Scale Distributed Statistical Machine Translation
2	Research And Implementation Of Contract Translation Based On Neural Machine Translation Model
3	Alignment Based Acquisition Of Collocation And Application In Machine Translation
4	Study On Word Alignment Technology And Construction Of Statistical Machine Translation Platform
5	Morphology-Processing In Chinese-Mongolian Statistical Machine Translation
6	Study On Several Key Problems In The Training Process Of Phrase-based Statistical Machine Translation
7	Research On Reording Model Of Tree-to-string Machine Translation
8	The Research On English-Chinese Name Entity Translation
9	Research On Bilingual Corpus-Based Machine Translation
10	Design And Implementation Of Heuristic Analogy Translation Mechanism In IHSMTS