Font Size: a A A

The Research On Character-word Based Joint Decoding For Chinese Word Segmentation

Posted on:2012-11-04Degree:MasterType:Thesis
Country:ChinaCandidate:D Q TongFull Text:PDF
GTID:2218330368488091Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Chinese word Segmentation (CWS), which has been an active research area for many years, plays a prominent role in Chinese language processing. It directly influences the accuracy of subsequent Natural Language Processing tasks. In recent years, although Chinese word segmentation has already achieved great success, because of the complexity of Chinese, there are still a series of problems. Many CWS systems show impressive results, but they are limited to testing corpora on specific areas. Thus, the domain-adaptive word segmentation is introduced into Sighan Bakeoff 2010.In this paper, based on existing methods, we propose a new joint decoding strategy that adopts a character-based and word-based conditional random field model, which combines both into a unified framework. For the method of character-based CRFs, the global optimal path is as the final path. By observing the results, the global optimal path, is often not a local optimum. We can put all local results into a unified framework, and choose the best combination from all possible paths. Through utilizing the word lattice to integrate of word-level information, the two methods are effectively combined.According to the characteristics of the cross-domain segmentation, context information is reasonably used to guide CWS. In order to make unknown candidate words are equally appeared in lattice as known words, similar contexts among synonyms are used to recall some OOVs.The proposed method is evaluated by the simplified Chinese test data from SIGHAN Bakeoff 2010. Except for the domain of literature, the F-scores are higher than the best performance of the corresponding open test, with the rate of OOV recall being 70.7%,84.3%, 79.0% and 86.2%, respectively. The experiments also show that the method with joint decoding has better performance than two single methods. It can further improve the ability of recognizing unknown words by using the word-level information of candidate words.
Keywords/Search Tags:Cross-Domain CWS, Conditional Random Fields, Joint Decoding, Context Variables, Semantic Resources
PDF Full Text Request
Related items