Fuse Multi-features To Identify Product Review Spam

Posted on:2013-08-22

Degree:Master

Type:Thesis

Country:China

Candidate:M Wu

Full Text:PDF

GTID:2308330461476049

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

In the past few years, sentiment analysis and opinion mining become one of the most popular tasks and these studies all hypothesis that opinion resources are real and faithful. However, in many product website, commentators may write false comments which contain incorrect positive or malicious negative views, this caused scholars wide attention. At present, works on product review spam identification have achieved some results, both at home and abroad. But these works overlooked two features of spam detection:the stability of the model and the imbalance of corpus. Ignore them will lead to lower identification accuracy. So in this paper we will focus on the identification of the product review spam. The details of the research are proposed as followed:(1)To solve the problem of overusing features in the product review spam identification based on logistic regression recently proposed by N. Jindal et al., which may produce overfitting. We analysis the model and puts forward take significance testing on these features, and then rebuild the regression model with significant features. Our experiments on the Amazon dataset show that the new regression model based on the significant features is better than the model based on the whole features. This new model not only solves the problem mentioned above, but also achieves the same performance with lower calculation cost; it shows that modeling on the significant features contributes to improving the detection quality.(2)Due to the poor performance of SVM and NaiveBayes in product review spam identification which caused by imbalance data, we combined significant features in (1) and propose a model based on random forests to identify review spam. Build the model with balanced random forest algorithm or weighted random forest can effectively reduce the error caused by imbalance data set, greatly improve the review spam recognition accuracy. The experiment results show that our proposed method is effective in review spam recognition compare to the SVM and NaiveBayes, and random forest which combined with significant features perform better.(3)With combining the study of (1) and (2), we design a prototype system called fuse multi-features to identify product review spam. And we also brief description each function module of the system. The system effectively combines the significant features and the advantages of the random forest model, overcoming the overfitting caused by using all product features and the error in classification caused by imbalance data.

Keywords/Search Tags:

Logistic Regression (LR), product review spam, Significance Testing(ST), significant features, random forests(RF)

PDF Full Text Request

Related items

1	Research On Identifying Review Spam For Product Reviews
2	Review Spam Detection Based On User Evaluation
3	The Research Of Web Pages Filtering Based On Random Forests Algorithms
4	Research And Practice Of Spam Comment Detection In Product Review Website
5	Detecting Review Spammers Based On Review Feature
6	Research On Review Spam Detection Based On Hierarchical Neural Network And Multivariate Features
7	An empirical study of Classification and Regression Tree and Random Forests
8	Research On Identifying Review Spam For Product Reviews Based On Data Mining
9	The Design And Implementation Of Web Spam Detection System
10	On Detecting The Cloaked WEB SPAM