Research On Text Classification Methods Based On Content And Emotion

Posted on:2014-03-29

Degree:Master

Type:Thesis

Country:China

Candidate:C Y Zhang

Full Text:PDF

GTID:2268330422462834

Subject:Industrial Engineering

Abstract/Summary:

PDF Full Text Request

Text classification has a wide range of applications in the field of natural language processing, information organization and content filtering. Traditional K Nearest Neighbor method is simple, strong and free parameters,and can reach high classification accuracy, but it needs to calculate the distance between a new text and all of the training texts, thus it requires a lot of computing time. According to this problem,texts are clustered before using KNN.First each class of the training set is clustered by CHAMELEON algorithm,centers of these clusters are token as generalized instance sets.Then we search k1nearest neighbors of the unknown document from generalized instance sets.Finally KNN is employed on the unknown document and the original training texts who generate those k1generalized instances.Experiments on Tan corpus and Fudan corpus have shown that this method can achieve the same precision and recall as traditional KNN, but offers a much lower computational cost.Consumer product reviews have become an important part of the e-commerce trust mechanism,most sites can’t divide reviews into praise and poor based on semantics.With HowNet emotional words as seed vocabulary, this paper proposed a Bootrapping mining algorithms of emotional words based on Conditional Random Fields. Then emotional words were divided into praise and poor in the light of mutual information. According to the number of positive and negative emotion words that contained in a sentence, book reviews on the e-commerce site were divided into good and bad basis.2,026book reviews were tested, and82%of them were divided correctly, indicating the effectiveness of this algorithm.Segmentation and feature selection is the preliminary work for text classification. Experiment on the Chinese corpus provided by Microsoft Research proved that Conditional Random Fields is superior to Hidden Markov Model. Information Gain, Mutual Information, Expected Cross Entropy and the chi-square statistic are four feature selection methods, our contrast experiments show that Information Gain and chi-square statistic have nice performance in text classification.

Keywords/Search Tags:

Text Categorization, Emotion Mining, Chinese Word Segmentation, Feature Selection, Hidden Markov Model, Conditional Random Fields

PDF Full Text Request

Related items

1	Research And Application Of Chinese Word Segmentation Based On Conditional Random Fields
2	Text Categorization Based On The Conditional Random Fields
3	Research Of Chinese Word Segmentation With Conditional Random Fields
4	The Effect Of Part Of Speech On Chinese Word Segmentation
5	Research And Implementation Of Chinese Segmentation System Based On Conditional Random Fields Model
6	Study Of Automatic Segmentation Technique Based On Conditional Random Fields
7	Excellent Cross-validation Based Model Selection Method For Chinese Word Segmentation System Design And Development
8	Product Review Emotion Classification Based On CRFs
9	Research And System Implementation Of Chinese Word Segmentation In Specialized Fields Based On Conditional Random Fields
10	Research On Chinese Text Categorization Algorithms Based On Technology Text