Font Size: a A A

The Research On CRFs-based Chinese Automatic Segmentation

Posted on:2010-04-01Degree:MasterType:Thesis
Country:ChinaCandidate:Y Y LuoFull Text:PDF
GTID:2178360275458234Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Chinese Word Segmentation is a basic task of Natural Language Processing.It's the basis of machine translation,question answering and automatic abstract,and it's also one of the key techonologies in Chinese search engines.However,because of the complexity of Chinese language,it's still a challenge.Based on the existed theories,this paper focuses on the research of Chinese Word Segmentation based on Conditional Random Fields(CRFs).And the goal of this paper is to improve the performance of the segmentation systems,so as to serve the next syntax analysis or other processing tasks in the field of natural languages processing.The kernel work of this thesis can be generalized as follows:(1) This paper gives a brief description on the theories of some related models. Futhermore,this dissertation discusses the definition of CRFs motivated by the principle of Maximum Entropy.CRFs is one of the best conditional probabilistic models for labeling and segmenting sequential data.As an undirected graph model,it can not only avoid bias problems,but also incorporate arbitrary features of the input sequences to get optimum labeling results.(2) Analyzing the results obtained by sole CRFs which use the characters as labeling units,we find that the errors mainly happened in labels with low marginal probabilities. Two methods,Forward Maximum Matching(FMM) and class-based hidden Markov model (HMM) are respectively introduced to correct the errors.Experimental results show that the method based on the probabilities of CRFs has better performance than the sole CRFs method.(3) Comparisions and analysis are drawn from the character-based method of segmentation and the novel word-based method.According to the characters of Chinese,the role features are introduced,which are helpful to improve the performance of the novel word-based method on OOV recognition.Contributions of this study can be summaried as follows:(1) The information of marginal probabilities is helpful to improve the performance of segmentation,which can be referred to other labeling tasks in natural languages processing.(2) A combination of previous methods has been made and improved,and the experimental results prove that CRF-based methods are effective for Chinese Word Segmentation.
Keywords/Search Tags:Maximum Entropy, Conditional Random Fields, Class-based Hidden Markov Model, Marginal Probabilities, Labeling Units
PDF Full Text Request
Related items