N-gram Language Model Based On Distributed System

Posted on:2012-01-24

Degree:Master

Type:Thesis

Country:China

Candidate:Z Z Shen

Full Text:PDF

GTID:2248330395955497

Subject:Computer system architecture

Abstract/Summary:

PDF Full Text Request

With progress of natural language processing technology and the emergence oflarge-scale corpus, large-scale language model training become a reality. N-gramlaguange model(LM) is a important tool in many research areas of natural languageprocessing,such as information retrieval,machine translation,speech recognition，etc.Using higher order models and more training data can significantly improve theperformance of applications.Therefore,the research on large scale language modelingdraws more attention.However,for limited resources of PC,can not be a large-scalelanguage modeling,so it is necessary to use a distributed system for modeling.This theis presents the work of building a large scale distributed n-gram languagemodel using a MapReduce platform named Hadoop and a distributed database calledHbase.We give a method focusing on the time cost and storage size of themodel,exploring different Hbase table structures and compression approaches. Themethod is applied to build up training and testing processes using Hadoop MapReduceframework and Hbase. The experiments evaluate and compare five different databasetable structures on training40million words for unigram, bigram and trigram models,and the results suggest a table based on half ngram structure is a good choice fordistributed language model. The results of this work can be applied and furtherdeveloped in machine translation and other large scale distributed language processingareas.

Keywords/Search Tags:

Distribute, Language Model, Smoothing Methods, HDFS, MapReduce, Hadoop, Hbase

PDF Full Text Request

Related items

1	Optimization And Application Research Of MapReduce Computing Model Based On Hadoop
2	Research On Distributed Processing Of Massive Video Data Based On Hadoop
3	The Design And Implementation Of Massive Data Storage And Calculation Platform Based On Hadoop
4	HDFS To Copy Data Storage Optimization And The Study Of Mass Data Storage
5	The Design And Implementation Of A CBIR System Based On Hadoop And Lucene
6	The Research Of Algorithm About Social Network Recommendation Service Based On Hadoop
7	Design And Implementation Of Massive Log Data Quasi-Real-Time Query System Based On Hadoop
8	Parallel Clustering Algorithm Based On MapReduce
9	The Performance Optimization And Improvement Of MapReduce In Hadoop
10	Research And Establishment Of Autonomous Learning Platform Based On Hadoop