Font Size: a A A

N-gram Language Model Based On Distributed System

Posted on:2012-01-24Degree:MasterType:Thesis
Country:ChinaCandidate:Z Z ShenFull Text:PDF
GTID:2248330395955497Subject:Computer system architecture
Abstract/Summary:PDF Full Text Request
With progress of natural language processing technology and the emergence oflarge-scale corpus, large-scale language model training become a reality. N-gramlaguange model(LM) is a important tool in many research areas of natural languageprocessing,such as information retrieval,machine translation,speech recognition,etc.Using higher order models and more training data can significantly improve theperformance of applications.Therefore,the research on large scale language modelingdraws more attention.However,for limited resources of PC,can not be a large-scalelanguage modeling,so it is necessary to use a distributed system for modeling.This theis presents the work of building a large scale distributed n-gram languagemodel using a MapReduce platform named Hadoop and a distributed database calledHbase.We give a method focusing on the time cost and storage size of themodel,exploring different Hbase table structures and compression approaches. Themethod is applied to build up training and testing processes using Hadoop MapReduceframework and Hbase. The experiments evaluate and compare five different databasetable structures on training40million words for unigram, bigram and trigram models,and the results suggest a table based on half ngram structure is a good choice fordistributed language model. The results of this work can be applied and furtherdeveloped in machine translation and other large scale distributed language processingareas.
Keywords/Search Tags:Distribute, Language Model, Smoothing Methods, HDFS, MapReduce, Hadoop, Hbase
PDF Full Text Request
Related items