Research On Methods Of Chinese Word Classification And POS Tagging

Posted on:2012-03-19

Degree:Master

Type:Thesis

Country:China

Candidate:Y Z Zhang

Full Text:PDF

GTID:2218330338974022

Subject:Computer application technology

Abstract/Summary:

The classification and POS tagging are important basic research subjects in Natural Language Processing, and also bases of future research, such as:shallow parsing, text classification, machine translation. There are mainly three methods to research the subjects:based on rule, based on statistics and combination of the both. The research method based on statistics is divided into supervised learning, unsupervised learning and semi-supervised. This article explores classification and tagging of POS mainly from the view of statistics.The main work is as follows:1, The disambiguation of multi-category words is one of the difficulties for POS tagging of Chinese words. In order to tackle this problem, this article integrates three types of classification model:Support Vector Machine, Maximum Entropy and Conditional Random Fields. With voting to disambiguate the multi-category word, the POS that get the most votes is view as the word's POS.120 common multi-category words from People's Daily corpus published on January 1998 are tested. The average accuracy of open test is up to 89.69%, showing a relative good result2, The classification of words refers to the classification of words in grammar, namely the category of words reflected in the process of phase combination. With Syntactic Function Information Database as a blueprint.Using 14 properties that the Database listed as the feature space, using it's the statistical of words syntactic function as the characteristic value, and normalized the spatial characteristics. With AP cluster, this article clusters 3514 words it includes, getting 62 classes. The same or nearly same words are basically classified as a class.3, This paper researches the multi-category words using the method of clustering. With People's Daily in January 1998 as the experimental corpus, we take 12 typical words from this corpus. These multi-category words are not well classified which their baselines are low and their discriminate validity are not good. By AP clustering algorithm, k-means and spectrum, we cluster these words using Euclidean distance, Dice coefficient, cosine value as measures of similarity. By using context word frequency as feature space, improved the word to the more class, and normalised the space value, achieving good results.

Keywords/Search Tags:

word classification, POS tagging, words clustering, Multi-category words, Disambiguation of Multi-category words

Related items

1	Research On Chinese Parts Of Speech Tagging And POS Guessing Over Unknown Words
2	Research Kazakh Part Of Category Words Tagging
3	Chinese Multi-category Product Words Segmentation And Recognition Based On Electronic Commerce
4	Automatic Recognition Research On Syntactic Category Of Common Words
5	Natural Language Processing, Words Related To Knowledge No Guide For Build And Balanced Classifier
6	Research On Content-based Scene And Object Category Recognition
7	The Retrieval System Based On Mongolian Tagging Corpus
8	Research On The Consistency Check And Auto-collation On POS Tagging Of Chinese Corpus
9	Research On Feature Words Extraction And Emotional Tendency Analysis Of Video Commentary
10	Research Of Sentense-Level Sentiment Analysis Based On Association Rule And Graph Ranking