Tendentious Classification System For Chinese Text

Posted on:2010-07-06

Degree:Master

Type:Thesis

Country:China

Candidate:Z Y Deng

Full Text:PDF

GTID:2208330332478286

Subject:Computer software and theory

Abstract/Summary:

Text tendency classification is an important part of text classification; it analyzes words, phrases, sentences in the text or documents, searches for their indicative subjective factors, conducts emotional analysis and determines which category (approving or disapproving) they fall into. Text tendency classification has much practical value in information filtering, product recommendation, information security, public opinion analysis, automatic abstracting and information mining; it is a research focus in the field of text classification.Based on the mature technology available in thematic text classification technology, this paper takes words in the text as research objects. Adopting similarity calculation method based on statistical classification technology as well as decision tree method and neural network method based on automatic classification of machine learning technology respectively, this paper conducts research on Chinese text tendency classification. The main research achievements are as follows:1) Words resources construction:this paper, on the basis of commendatory terms dictionary and derogatory terms dictionary, extends text corpus by adding some Chinese characters or words which have approving or disapproving characteristics in their industry but not included in the above dictionaries. Furthermore, the added characters or words will have double-weights in the system; besides, for those characters which don't have approving or disapproving characteristic but not included in the dictionaries as well as those negative words not included in the dictionaries either, they are added into the system as supplements. Meanwhile, some words, such as "exceed", "good", "strong" and "push", acquire new meanings as the results of the development of the Internet. Although these new definitions are not "official", they have strong emotional meanings; therefore, these kinds of words are also added into the system.2) Feature items selection:this paper proposes a feature selection method according to the feature entries'meanings. For approving feature items, only "nouns" and "adjectives" are selected as feature entries, and for disapproving feature items, the following eight part of speech:"nouns", "adjectives", "verbs", "an", "idioms", "adverbs", "vn" and "ad" are selected as feature entries. This approach can better tackle the lower accuracy problem result from number imbalance between approving and disapproving feature entries in experiment adopts Rocchio.3) According to the above results of the research, the paper, using C language, designs and implements a feature items selection module, an item weight module and a vector space model construction module. In addition, this paper uses principal component analysis method and feature selection method to reduce the model's dimensionality.The experiment adopts three classification methods (Rocchio, decision tree and neural network) as well as two dimensionality reduction methods (feature selection and principal component analysis) to carry out text tendency classification experiments on seven Chinese text samples. Results show that Rocchio method, decision tree method and neural network method can reach an average F1 of 68.7%,75.8% and 74.9% in open test respectively. Besides, the combination of PCA dimensionality reduction method and decision tree classification method can achieve a better classification performance for Chinese text tendency classification with an average Fl of 89.6%.

Keywords/Search Tags:

Text tendency classification, Vector space model, Feature Reduction, Principle component analysis, Rocchio, Decision tree, Neural network

Related items

1	Research On Text Classification Based On Neural Network And Decision Tree And Its Application
2	Text Classification Research Based On Improved PCA-SOM Neural Network
3	Research On Feature Dimension Reduction In Text Classification
4	Application Of Various Classification Methods In Spam Message Recognition
5	The Research And Implementation Of Chinese Text Classification Technology Based On Decision Tree
6	Chinese Text Classification Based On Structural Covering Algorithm
7	Automatic Classification Research On Chinese Web Document Orientation
8	Research Of Text Orientation Classification Based On Neural Netword
9	Based On The Chinese Text Of The Rough Set And Neural Network Classification
10	Dimension Reduction Method Research In Text Classification