Font Size: a A A

Research And Application Of Chinese Word Segmentation Based On Conditional Random Fields

Posted on:2010-01-20Degree:MasterType:Thesis
Country:ChinaCandidate:J YanFull Text:PDF
GTID:2178360275451380Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Natural Language Processing(NLP) is not only an application of technology for the social magnitude needs,but also a very important science significance natural science.Due to own special characteristics of the Chinese language,the majority of Chinese NLP tasks have to be built on the Chinese word segmentation,and therefore the accuracy of the Chinese word segmentation will have a direct influence on a series of follow-up research and application.Because the own complexity of the Chinese language,Chinese word segmentation has always been the first problem in Chinese NLP.Conditional Random Fields(CRF) was put forward recent years is not only a conditional probability model for marking and segmenting sequential data but also an undirected graph model that calculates the conditional probability of the output nodes on the given input nodes.It does not need strict independence assumptions of a generative model,such as Hidden Markov Model,and overcome the marked bias problem exhibited by Maximum Entropy Model and other discriminative models.The model can be easily incorporate arbitrary features of the input sequence and add also some other information,such as word-building rules.This paper first introduces the NLP research status,as well as word the importance of segmentation in NLP,then proposes common word segmentation methods with their strengths and weakness,analyze the difficult problems of Chinese word segmentation.This article describes the definition of CRF model,model structure,parameter estimation,its corpus selection and so on,and applies them to Chinese word segmentation use the Chinese characters tagging.A large number of experimentation on CRF model by Yangtze River Daily benchmark corpus,and closed test.We analyze CRF model parameters and the character label selection impact on the result of experimentation,CRF model has the advantage that it can be added any feature,add some new features to the model,such as from the such as the word formation power of a character.The experimentation results in the corpus show that:the introduction of the feature of the words location probability significantly improves the accuracy,recall rate and F1 score. Chinese word segmentation of a wide range of applications,this paper,we mostly introduce its application on Chinese text automatic proofreading.Chinese text automatic proofreading opens up broad possibilities for the application of natural language processing.According to the distribution of Chinese single-character after word segmentation in Chinese text with the characteristic of errors and character trigram model,presents an effective text automatic proofreading algorithm. Experiments show that our method achieves better precision and recall.
Keywords/Search Tags:Natural Language Processing, Chinese Word Segmentation, Hidden Markov Model, Maximum Entropy Model, Conditional Random Fields, Automatic Proofreading
PDF Full Text Request
Related items