Font Size: a A A

Sentiment Classification For Chinese Reviews Based On Machine Learning

Posted on:2011-11-02Degree:MasterType:Thesis
Country:ChinaCandidate:G BaiFull Text:PDF
GTID:2178360305955056Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
The development of the Internet, especially the boom of Web2.0 website offers mass data containing the sentiment of users which are called UGC. The sentiment of UGC can be like or dislike of some products, agreeable or disagreeable of some big events, etc. Classifying the data can provide the reference for the customer on product or let businessman grasps the feedback information of the product. From the aspect of national security, it can show the attitude of people home and abroad towards some events and can be used as reference when making policy. Therefore classifying the data by sentiment has widespread application.The goal of Sentiment Classification is looking for the emotion tendency of users by analyzing UGC. The emotion tendency can be good or bad, positive or negative, happy or sad, conservative or radical, etc. Sentiment Classification is different from topic classification, not depending on topic words but understanding the emotion tendency of document. According to the category of documents, Sentiment Classification can be divided into two kinds: single category sentiment classification and multi categories sentiment classification. In single category sentiment classification, the documents have the same subject and belong to the identical category. This kind of sentiment classification can be used to find the emotion tendency of buyers towards a single kind of products. In multi categories sentiment classification, the documents don't have specific category information and belong to different categories. This kind of sentiment classification can be used in analyzing the emotion tendency of blog writers or SNS users. As these two kinds of sentiment classification have different data categories, the methods are different.At present the research of this domain is mainly on single category English reviews, little research on Chinese reviews can be found, especially multi categories Chinese reviews. In this paper we do the research on the review level, the emphasis are single category sentiment classification and multi categories sentiment classification.In the process of single category sentiment classification, in order to detect punctuating influencing on single category sentiment classification, the research of the paper, which starts with sentiment classification by using review as metadata, is mainly on the sentence as metadata level of sentiment classification. For sentiment classification selecting review as metadata, firstly we analyze three classifiers (NB, ME and SVM), based on machine learning; then the test on single category sentiment classification which is based on these three classifiers is carried out. By using many kinds of text features, we find the best classifier and feature set. For sentiment classification selecting sentence as metadata, we have two processes. One is the training process, the other is predicting process. In the training process, we punctuate the reviews into sentences, and then get the model by training the sentences based on the best classifier and feature set we have got before. In the predicting process, we punctuate the reviews into sentences, then predict the sentence score by using the model and get the final review score by summing up all the sentences which belongs to one review.In the process of multi categories sentiment classification, in the paper we bring forward three methods:1) The first is Bridge Category Sentiment Classification, which aims at validating whether the model got hold true for other categories reviews, namely whether the sentiment classification is only in the domain. The process is as follows: we train the model on the identical category reviews based on the classifier and the feature set got in the single category sentiment classification.2) The second is Neglect Category Sentiment Classification, which aims at detecting whether text features of multi categories reviews are universal. The process is as follows: we train the model by using all categories reviews based on the classifier and the feature set got in the single category sentiment classification, and then predict all reviews from different categories.3) The last method is Cross Category Sentiment Classification. Because of the good performance of single category sentiment classification, so I want to validate whether the multi categories problem can be transformed into the single category problem. The process is as follows: firstly we train the model separately by using their own categories reviews based on the classifier and the feature set got in the single category sentiment classification. In order to transform the multi categories problem to single category problem, the paper use Domain Classifier to implement. For training this Domain Classifier, when using Unigram as feature set and using SVM as classifier, the accuracy is 95%, and when joining topic words into the feature set, the accuracy is up to 100%. That is, when facing a new review, this Domain Classifier can classify correctly. After getting the category of the review, the system begins classifies the review's sentiment by using the corresponding category model.The test on single category sentiment classification shows, the performance of SVM and the feature set made of Unigram, Polarity and Neg is best, and its accuracy reaches 85%; punctuating can improve the accuracy of the review level from 85% to 91%, so punctuating is very effect in improving the accuracy. And the test on multi categories sentiment classification shows, the model trained before doesn't apply to other categories reviews, namely Sentiment Classification is effect in specific domain; and text features of multi categories reviews aren't universal; but multi categories problem can be transformed into the single category problem well, namely Cross Category Sentiment Classification is good and its average accuracy reaches 82%, almost the same as the level of single category sentiment classification.
Keywords/Search Tags:Sentiment Classification, Naive Bayes, Maximum Entropy, Support Vector Machine
PDF Full Text Request
Related items