Font Size: a A A

Research On Methods Of Chinese Word Classification And POS Tagging

Posted on:2012-03-19Degree:MasterType:Thesis
Country:ChinaCandidate:Y Z ZhangFull Text:PDF
GTID:2218330338974022Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
The classification and POS tagging are important basic research subjects in Natural Language Processing, and also bases of future research, such as:shallow parsing, text classification, machine translation. There are mainly three methods to research the subjects:based on rule, based on statistics and combination of the both. The research method based on statistics is divided into supervised learning, unsupervised learning and semi-supervised. This article explores classification and tagging of POS mainly from the view of statistics.The main work is as follows:1, The disambiguation of multi-category words is one of the difficulties for POS tagging of Chinese words. In order to tackle this problem, this article integrates three types of classification model:Support Vector Machine, Maximum Entropy and Conditional Random Fields. With voting to disambiguate the multi-category word, the POS that get the most votes is view as the word's POS.120 common multi-category words from People's Daily corpus published on January 1998 are tested. The average accuracy of open test is up to 89.69%, showing a relative good result2, The classification of words refers to the classification of words in grammar, namely the category of words reflected in the process of phase combination. With Syntactic Function Information Database as a blueprint.Using 14 properties that the Database listed as the feature space, using it's the statistical of words syntactic function as the characteristic value, and normalized the spatial characteristics. With AP cluster, this article clusters 3514 words it includes, getting 62 classes. The same or nearly same words are basically classified as a class.3, This paper researches the multi-category words using the method of clustering. With People's Daily in January 1998 as the experimental corpus, we take 12 typical words from this corpus. These multi-category words are not well classified which their baselines are low and their discriminate validity are not good. By AP clustering algorithm, k-means and spectrum, we cluster these words using Euclidean distance, Dice coefficient, cosine value as measures of similarity. By using context word frequency as feature space, improved the word to the more class, and normalised the space value, achieving good results.
Keywords/Search Tags:word classification, POS tagging, words clustering, Multi-category words, Disambiguation of Multi-category words
PDF Full Text Request
Related items