Font Size: a A A

Research On Text Classification Methods Based On Content And Emotion

Posted on:2014-03-29Degree:MasterType:Thesis
Country:ChinaCandidate:C Y ZhangFull Text:PDF
GTID:2268330422462834Subject:Industrial Engineering
Abstract/Summary:PDF Full Text Request
Text classification has a wide range of applications in the field of natural language processing, information organization and content filtering. Traditional K Nearest Neighbor method is simple, strong and free parameters,and can reach high classification accuracy, but it needs to calculate the distance between a new text and all of the training texts, thus it requires a lot of computing time. According to this problem,texts are clustered before using KNN.First each class of the training set is clustered by CHAMELEON algorithm,centers of these clusters are token as generalized instance sets.Then we search k1nearest neighbors of the unknown document from generalized instance sets.Finally KNN is employed on the unknown document and the original training texts who generate those k1generalized instances.Experiments on Tan corpus and Fudan corpus have shown that this method can achieve the same precision and recall as traditional KNN, but offers a much lower computational cost.Consumer product reviews have become an important part of the e-commerce trust mechanism,most sites can’t divide reviews into praise and poor based on semantics.With HowNet emotional words as seed vocabulary, this paper proposed a Bootrapping mining algorithms of emotional words based on Conditional Random Fields. Then emotional words were divided into praise and poor in the light of mutual information. According to the number of positive and negative emotion words that contained in a sentence, book reviews on the e-commerce site were divided into good and bad basis.2,026book reviews were tested, and82%of them were divided correctly, indicating the effectiveness of this algorithm.Segmentation and feature selection is the preliminary work for text classification. Experiment on the Chinese corpus provided by Microsoft Research proved that Conditional Random Fields is superior to Hidden Markov Model. Information Gain, Mutual Information, Expected Cross Entropy and the chi-square statistic are four feature selection methods, our contrast experiments show that Information Gain and chi-square statistic have nice performance in text classification.
Keywords/Search Tags:Text Categorization, Emotion Mining, Chinese Word Segmentation, Feature Selection, Hidden Markov Model, Conditional Random Fields
PDF Full Text Request
Related items