Font Size: a A A

Fuse Multi-features To Identify Product Review Spam

Posted on:2013-08-22Degree:MasterType:Thesis
Country:ChinaCandidate:M WuFull Text:PDF
GTID:2308330461476049Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
In the past few years, sentiment analysis and opinion mining become one of the most popular tasks and these studies all hypothesis that opinion resources are real and faithful. However, in many product website, commentators may write false comments which contain incorrect positive or malicious negative views, this caused scholars wide attention. At present, works on product review spam identification have achieved some results, both at home and abroad. But these works overlooked two features of spam detection:the stability of the model and the imbalance of corpus. Ignore them will lead to lower identification accuracy. So in this paper we will focus on the identification of the product review spam. The details of the research are proposed as followed:(1)To solve the problem of overusing features in the product review spam identification based on logistic regression recently proposed by N. Jindal et al., which may produce overfitting. We analysis the model and puts forward take significance testing on these features, and then rebuild the regression model with significant features. Our experiments on the Amazon dataset show that the new regression model based on the significant features is better than the model based on the whole features. This new model not only solves the problem mentioned above, but also achieves the same performance with lower calculation cost; it shows that modeling on the significant features contributes to improving the detection quality.(2)Due to the poor performance of SVM and NaiveBayes in product review spam identification which caused by imbalance data, we combined significant features in (1) and propose a model based on random forests to identify review spam. Build the model with balanced random forest algorithm or weighted random forest can effectively reduce the error caused by imbalance data set, greatly improve the review spam recognition accuracy. The experiment results show that our proposed method is effective in review spam recognition compare to the SVM and NaiveBayes, and random forest which combined with significant features perform better.(3)With combining the study of (1) and (2), we design a prototype system called fuse multi-features to identify product review spam. And we also brief description each function module of the system. The system effectively combines the significant features and the advantages of the random forest model, overcoming the overfitting caused by using all product features and the error in classification caused by imbalance data.
Keywords/Search Tags:Logistic Regression (LR), product review spam, Significance Testing(ST), significant features, random forests(RF)
PDF Full Text Request
Related items