Font Size: a A A

The Research On Chinese Word Segmentation Based On Conditional Random Fields In Big Data Environment

Posted on:2018-09-19Degree:MasterType:Thesis
Country:ChinaCandidate:C PengFull Text:PDF
GTID:2348330542461690Subject:Software engineering
Abstract/Summary:PDF Full Text Request
The rapid development of Internet technology promotes the advent of big data era.How to get valuable information from the massive data has become the focus of attention of the industry and academia.Furthermore,as the basis of Chinese Natural Language Processing,the importance of Chinese Word Segmentation is of great importance.The Chinese Word Segmentation method based on statistical learning has been widely used in practical applications due to its excellent effect.However,the time consuming of model training needs to be reduced.The good news is the emergence of a variety of distributed parallel processing frameworks has provided strong support to overcome this shortfall.In order to reduce the time consuming of the model training,we combine the Chinese Word Segmentation based on statistical learning with the distributed parallel processing platform.Hadoop is a mainstream parallel distributed computing platform based on MapReduce framework.To improve efficiency,it divides the task into different sub-tasks and then integrates the results of each subtask.Hence,we have done the following works.First,three mainstream Chinese Word Segmentation methods are analyzed and compared,and method based on statistical learning is the focus in this paper.We choose conditional random fields model as the statistical model in Chinese Word Segmentation task,as it overcomes the disadvantages of hidden Markov model and maximum entropy model.The task of the training process of the conditional random fields is to obtain every parameter of the model,which represents the weight of each characteristic function.The L-BFGS algorithm is used in the training process,numerous experiments show that most of the time consumed in gradient calculation.We reduce the time consuming by using parallel algorithm.Another step of Chinese Word Segmentation based on conditional random fields is the model prediction,Viterbi algorithm is an effective method to deal with the task.The algorithm is also optimized based on the MapReduce framework in this paper.Finally,the parallel L-BFGS algorithm and the parallel Viterbi algorithm are tested on the Hadoop cluster,comparing with the serial processing method from four dimensions:precision,recall rete,F1-score and the speed of model training.The results show that the training speed of the conditional random fields model is obviously improved when the other three items are almost the same.
Keywords/Search Tags:Chinese Word Segmentation, Conditional Random Fields, Big Data, Hadoop
PDF Full Text Request
Related items