Font Size: a A A

Chinese POS Tagging Employing Maxent And Word Clustering

Posted on:2011-10-08Degree:MasterType:Thesis
Country:ChinaCandidate:Z Z LiFull Text:PDF
GTID:2178360305955938Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Chinese Part-of-Speech Tagging is a fundamental task in the field of Chinese information processing, and essential for the follow-up tasks such as syntactic parsing, chunk analysis and semantic analysis. The paper built a POS tagger based on MaxEnt and word clustering.The MaxEnt allows the mixture of diverse sources of information without necessarily assuming independence between the features, and is prone to get a relatively high baseline. First, have a tagging by Maximum Entropy model as a baseline. Secondly, clusters all the words in the corpus into 1024 clusters automatically. Then the word cluster will be added to the feature template, thus solve the problem of data sparseness to some extent. We try three kinds of clustering algorithm, including Maximum Mutual Information, Function Word based and High Frequency Word based, and have a comparison between them. Clustering is a kind of unsuperised learning, which makes it can employ great amount of unlabeld corpus, thus decrease the dependency on relatively expensive annotated corpus. According to our experiments, the method achieves an accuracy of 93.50% on 3M TCT training corpus which is released by CIPS-ParsEval-2009, and better than the previous method based on Maximum Entropy model solely.Our methods are expected to extend to other tasks of NLP area.
Keywords/Search Tags:Part-of-speech tagging, Maximum Entropy, Word Clustering, Data Sparseness
PDF Full Text Request
Related items