Font Size: a A A

Research Of Chinese Word Segmentation With Conditional Random Fields

Posted on:2009-12-24Degree:MasterType:Thesis
Country:ChinaCandidate:Q Z ShenFull Text:PDF
GTID:2178360245463706Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
During the last decade, Natural Language Processing (NLP) has become a hot research field. Due to special characteristics of the Chinese language, Chinese word segmentation plays a critical role in many Chinese NLP applications and has become a bottleneck in Chinese Information Processing.Conditional Random Fields (CRFs) is not only a conditioned probabilistic model for labeling and segmenting sequential data, but also an undirected graph model that calculates the conditional probability over output nodes given the input nodes. It relaxes the strong independence assumptions of a generative model (e.g. Hidden Markov Model) and overcomes the label-bias problem exhibited by the Maximum Entropy Markov Model and other discriminative models. CRFs can easily incorporate arbitrary features of the input sequence and introduce some other information, such as the rules of word's formation.This paper proposes a CRFs-based Chinese word segmentation system with focus on the importance of parameter selection and different tagging strategies. Within the infrastructure of CRFs, we also explore some new features, such as the word formation power of a character. Evaluation on the SIGHAN PKU benchmark corpus shows that the new features significantly improve the F1 score by 3.5%. It also shows that our system achieves 94.5% in F1. This suggests that CRFs works well and holds great potential in Chinese word segmentation. In addition, we also explore the effect of integrating different models, including CRFs, HMM and MEMM. Evaluation on the SIGHAN PKU benchmark corpus shows that these models are quite complementary and the integrated system achieves 95.6% in F1, which much outperforms the state-of-the-art systems.
Keywords/Search Tags:Natural Language Processing, Chinese Word Segmentation, Conditional Random Fields, Word Formation Power, Model Integration
PDF Full Text Request
Related items