The Research On Chinese Word Segmentation Based On Conditional Random Fields In Big Data Environment

Posted on:2018-09-19

Degree:Master

Type:Thesis

Country:China

Candidate:C Peng

Full Text:PDF

GTID:2348330542461690

Subject:Software engineering

Abstract/Summary:

PDF Full Text Request

The rapid development of Internet technology promotes the advent of big data era.How to get valuable information from the massive data has become the focus of attention of the industry and academia.Furthermore,as the basis of Chinese Natural Language Processing,the importance of Chinese Word Segmentation is of great importance.The Chinese Word Segmentation method based on statistical learning has been widely used in practical applications due to its excellent effect.However,the time consuming of model training needs to be reduced.The good news is the emergence of a variety of distributed parallel processing frameworks has provided strong support to overcome this shortfall.In order to reduce the time consuming of the model training,we combine the Chinese Word Segmentation based on statistical learning with the distributed parallel processing platform.Hadoop is a mainstream parallel distributed computing platform based on MapReduce framework.To improve efficiency,it divides the task into different sub-tasks and then integrates the results of each subtask.Hence,we have done the following works.First,three mainstream Chinese Word Segmentation methods are analyzed and compared,and method based on statistical learning is the focus in this paper.We choose conditional random fields model as the statistical model in Chinese Word Segmentation task,as it overcomes the disadvantages of hidden Markov model and maximum entropy model.The task of the training process of the conditional random fields is to obtain every parameter of the model,which represents the weight of each characteristic function.The L-BFGS algorithm is used in the training process,numerous experiments show that most of the time consumed in gradient calculation.We reduce the time consuming by using parallel algorithm.Another step of Chinese Word Segmentation based on conditional random fields is the model prediction,Viterbi algorithm is an effective method to deal with the task.The algorithm is also optimized based on the MapReduce framework in this paper.Finally,the parallel L-BFGS algorithm and the parallel Viterbi algorithm are tested on the Hadoop cluster,comparing with the serial processing method from four dimensions:precision,recall rete,F1-score and the speed of model training.The results show that the training speed of the conditional random fields model is obviously improved when the other three items are almost the same.

Keywords/Search Tags:

Chinese Word Segmentation, Conditional Random Fields, Big Data, Hadoop

PDF Full Text Request

Related items

1	Research Of Chinese Word Segmentation With Conditional Random Fields
2	Research And Application Of Chinese Word Segmentation Based On Conditional Random Fields
3	Research And System Implementation Of Chinese Word Segmentation In Specialized Fields Based On Conditional Random Fields
4	Research And Implementation Of Chinese Segmentation System Based On Conditional Random Fields Model
5	Research Of Named Entity Recognition Based On Conditional Random Fields
6	The Research Of Applying Conditional Random Fields To Chinese Word Segmentation And Part-Of-Speech Tagging
7	The Research Of Chinese Word Segmentation Based On CRF
8	Application Of Conditional Random Fields In Mongolian Word Segmentation
9	Research Of Chinese Word Segmentation With Conditional Random Fields And Implementation
10	Text Categorization Based On The Conditional Random Fields