Font Size: a A A

Research On Web Text Segmentation Based On Conditional Random Fields

Posted on:2014-01-08Degree:MasterType:Thesis
Country:ChinaCandidate:Y X CuiFull Text:PDF
GTID:2248330398450623Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Natural language processing has become a hot research topic in the field of information processing. The majority of Chinese processing tasks are performed on the basis of word segmentation. After years of in-depth study, Chinese automatic word segmentation technology has achieved desirable results in terms of the traditional text. However, the segmentation of web text is still unsatisfactory, mainly due to the difference between the web text and traditional text, which proposes a new issue for segmentation.Conditional Random Fields (CRFs) don’t need the strict and independent assumptions and have overcome the label-bias problem, so it is widely applied to the Chinese automatic word segmentation, and has yielded good results. This paper, based on the character-based and word-based CRFs model, studies the method suitable for web text segmentation.The major work of this research includes four parts:(1) This research introduces Hidden Markov model, Maximum Entropy Markov model and Conditional Random Fields model and elaborates the advantages of the CRFs model in Chinese annotation.(2) This study analyzes the characteristics of the web text, improves the label set based on a character-based and word-based CRFs model, and then adopts suitable feature template for web text to improve the capabilities of our system on the web text segmentation.(3) This research proposes a method combining average mutual information and C-value to filter out the unknown word that cannot be recognized by the CRFs model. There are a lot of Out-of-vocabulary (OOV) words in the web text due to the difference between web text and traditional text.(4) This research summarizes the error of segmentation by analyzing segmentation results, and then proposes some rules-based amendments on account of the features of the web text corpus. In this way, the overall segmentation has been improved.Our experiments prove that in the test for web text corpus, our improved segmentation method based on CRFs model has given rise to an increase in the precision, recall and F-score. The feasibility of this approach is, therefore, verified.
Keywords/Search Tags:Web Text, Automatic Segmentation, Conditional Random Fields, A Latticeof Words, Out-of-vocabulary
PDF Full Text Request
Related items