The Research On CRFs-based Chinese Automatic Segmentation

Posted on:2010-04-01

Degree:Master

Type:Thesis

Country:China

Candidate:Y Y Luo

Full Text:PDF

GTID:2178360275458234

Subject:Computer application technology

Abstract/Summary:

Chinese Word Segmentation is a basic task of Natural Language Processing.It's the basis of machine translation,question answering and automatic abstract,and it's also one of the key techonologies in Chinese search engines.However,because of the complexity of Chinese language,it's still a challenge.Based on the existed theories,this paper focuses on the research of Chinese Word Segmentation based on Conditional Random Fields(CRFs).And the goal of this paper is to improve the performance of the segmentation systems,so as to serve the next syntax analysis or other processing tasks in the field of natural languages processing.The kernel work of this thesis can be generalized as follows:(1) This paper gives a brief description on the theories of some related models. Futhermore,this dissertation discusses the definition of CRFs motivated by the principle of Maximum Entropy.CRFs is one of the best conditional probabilistic models for labeling and segmenting sequential data.As an undirected graph model,it can not only avoid bias problems,but also incorporate arbitrary features of the input sequences to get optimum labeling results.(2) Analyzing the results obtained by sole CRFs which use the characters as labeling units,we find that the errors mainly happened in labels with low marginal probabilities. Two methods,Forward Maximum Matching(FMM) and class-based hidden Markov model (HMM) are respectively introduced to correct the errors.Experimental results show that the method based on the probabilities of CRFs has better performance than the sole CRFs method.(3) Comparisions and analysis are drawn from the character-based method of segmentation and the novel word-based method.According to the characters of Chinese,the role features are introduced,which are helpful to improve the performance of the novel word-based method on OOV recognition.Contributions of this study can be summaried as follows:(1) The information of marginal probabilities is helpful to improve the performance of segmentation,which can be referred to other labeling tasks in natural languages processing.(2) A combination of previous methods has been made and improved,and the experimental results prove that CRF-based methods are effective for Chinese Word Segmentation.

Keywords/Search Tags:

Maximum Entropy, Conditional Random Fields, Class-based Hidden Markov Model, Marginal Probabilities, Labeling Units

Related items

1	Study Of Automatic Segmentation Technique Based On Conditional Random Fields
2	Research And Application Of Chinese Word Segmentation Based On Conditional Random Fields
3	Information Diffusion Models In Micro-blogging Networks Based On Hidden Markov Theory And Conditional Random Fields
4	The Research Of Applying Conditional Random Fields To Chinese Lexical Analysis And Chunk Parsing
5	Research Of Named Entity Recognition Based On Conditional Random Fields
6	Conditional Random Fields Based English Name Entity Recognition
7	Research On Image Segmentation Method Based On MAP-MRF
8	The Application And Research Of Condition Random Fields And Maximum Entropy In Tag Mining
9	Named Entity Recognition Based On Conditional Random Fields
10	Research On Online Detection Method Of Reputation Fraud Campaign Based On Conditional Random Fields