Font Size: a A A

Tendentious Classification System For Chinese Text

Posted on:2010-07-06Degree:MasterType:Thesis
Country:ChinaCandidate:Z Y DengFull Text:PDF
GTID:2208330332478286Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Text tendency classification is an important part of text classification; it analyzes words, phrases, sentences in the text or documents, searches for their indicative subjective factors, conducts emotional analysis and determines which category (approving or disapproving) they fall into. Text tendency classification has much practical value in information filtering, product recommendation, information security, public opinion analysis, automatic abstracting and information mining; it is a research focus in the field of text classification.Based on the mature technology available in thematic text classification technology, this paper takes words in the text as research objects. Adopting similarity calculation method based on statistical classification technology as well as decision tree method and neural network method based on automatic classification of machine learning technology respectively, this paper conducts research on Chinese text tendency classification. The main research achievements are as follows:1) Words resources construction:this paper, on the basis of commendatory terms dictionary and derogatory terms dictionary, extends text corpus by adding some Chinese characters or words which have approving or disapproving characteristics in their industry but not included in the above dictionaries. Furthermore, the added characters or words will have double-weights in the system; besides, for those characters which don't have approving or disapproving characteristic but not included in the dictionaries as well as those negative words not included in the dictionaries either, they are added into the system as supplements. Meanwhile, some words, such as "exceed", "good", "strong" and "push", acquire new meanings as the results of the development of the Internet. Although these new definitions are not "official", they have strong emotional meanings; therefore, these kinds of words are also added into the system.2) Feature items selection:this paper proposes a feature selection method according to the feature entries'meanings. For approving feature items, only "nouns" and "adjectives" are selected as feature entries, and for disapproving feature items, the following eight part of speech:"nouns", "adjectives", "verbs", "an", "idioms", "adverbs", "vn" and "ad" are selected as feature entries. This approach can better tackle the lower accuracy problem result from number imbalance between approving and disapproving feature entries in experiment adopts Rocchio.3) According to the above results of the research, the paper, using C language, designs and implements a feature items selection module, an item weight module and a vector space model construction module. In addition, this paper uses principal component analysis method and feature selection method to reduce the model's dimensionality.The experiment adopts three classification methods (Rocchio, decision tree and neural network) as well as two dimensionality reduction methods (feature selection and principal component analysis) to carry out text tendency classification experiments on seven Chinese text samples. Results show that Rocchio method, decision tree method and neural network method can reach an average F1 of 68.7%,75.8% and 74.9% in open test respectively. Besides, the combination of PCA dimensionality reduction method and decision tree classification method can achieve a better classification performance for Chinese text tendency classification with an average Fl of 89.6%.
Keywords/Search Tags:Text tendency classification, Vector space model, Feature Reduction, Principle component analysis, Rocchio, Decision tree, Neural network
PDF Full Text Request
Related items