Font Size: a A A

The Research Of Chinese Word Segmentation Based On CRF

Posted on:2007-11-08Degree:MasterType:Thesis
Country:ChinaCandidate:F JiangFull Text:PDF
GTID:2178360212957584Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Word Segmentation is a fundamental problem of the Chinese Natural Language Processing. It is a premise to finish many tasks of Natural Language Processing, and its accuracy has a direct effect on many NLP tasks. Because of the complexity of Chinese, word segmentation has been a difficult problem of NLP.Conditional Random Fields (CRF) is arbitrary undirected graphical model that bring together the best of generative models and Maximum Entropy Markov Models (MEMM). Like MEMM, CRF can accommodate many statistically correlated features of the inputs, and they are trained discriminatively. But like generative models, they can trade off decisions at different sequence positions to obtain a globally optimal labeling.There are two kinds of statistical word segmentation, one is by character labeling and the other is based Omni-segmentation. CRF is mainly used in the first approach, but it can only use limited domain information. In this paper, we present. CRF-based Chinese word segmentation theory and technology, and implement a CRF-based Omni-segmentation Chinese word segmentation system. Since there are no clear boundaries between Chinese words, it is impossible to directly use CRF for word segmentation. To solve this problem, the forward matrix and the backward matrix of the words are used in this paper; they are used to establish the CRF model. Contrary to character labeling, CRF-based Omni-segmentation Chinese word segmentation approach can easily use the knowledge of the lexicons and domain. In this paper, we use the Chinese words and their tags as the features of the model, we also use some strategy to improve the performance of the training system.The paper train the CRF model using one month of the PKU (People's Daily) corpus, and the precision is 0.967. The result shows that the CRF-based Omni-segmentation Chinese Word Segmentation is an effective method for Chinese word segmentation.
Keywords/Search Tags:Word Segmentation, Conditional Random Fields, Omni-segmentation
PDF Full Text Request
Related items