Font Size: a A A

Identifying Deceptive Reviews Based On Labeled And Unlabeled Data

Posted on:2016-03-09Degree:DoctorType:Dissertation
Country:ChinaCandidate:Y F RenFull Text:PDF
GTID:1108330461453060Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
The rapid development of the Web has dramatically changed the ways people express themselves and interact with others. Currently, it is a common and typical practice for consumers to read online reviews or comments for advice before purchasing some commodities or services. The business organizations also can benefit a lot from this practice in that they can effectively adjust their products and marketing strategies by collecting and analyzing the feedback from the Web. Hence, sentiment analysis and opinion mining based on product reviews have become a popular topic of NLP (Natural Language Processing).The researches from sentiment analysis and opinion mining comply with the same assumption:the datasets based on reviews and opinion should be trustful and reliable. Since reviews information can guide people purchase behavior, positive reviews can result in huge economic benefit and fame for organizations or individuals. This gives powerful incentive to promote the generation of deceptive reviews. So identifying and filtering out deceptive reviews has important practical meaning and theoretical value. In this paper, we model the problem of deceptive reviews detection from different angles based on hotel reviews.We make the following contributions:1. Previous researches mainly focus on some heuristic methods or simple modeling of the review text, which limit the performance of this task. We firstly construc-t the dataset. Based on the knowledge from the computational linguistics and psycholinguistics, we use supervised learning methods to evaluate the performance of different feature modeling, and select the best mixed features. Then, two semi-supervised learning methods are developed to exploit the large amount of unlabeled data. One is the Co-training algorithm, which utilizes two views containing review text and reviewers to build the final classifier. The other is the Tri-training al-gorithm, which utilizes three views to build the final classifier. These three views includes lexical, syntactic and psycholinguistical features, respectively. Two semi-supervised methods not only can exploit the large amount of unlabeled data, but also can achieve a better result than the single-view methods for deceptive reviews detection.2. Faced with the difficulty of constructing the datasets containing deceptive reviews, we propose to build a accurate classifier based on some truthful reviews and a lot of unlabeled reviews. Firstly, Some deceptive reviews with high confidence are extracted from unlabeled reviews. Secondly, n representative truthful reviews and deceptive reviews are computed based on LDA (Latent Dirichlet Allocation), respectively, for the remaining reviews of unlabeled reviews (we call them spy reviews, which are easily mislabeled), their category label are decided by incorpo-rating population and individual property. Finally, multiple kernel learning are used to construct the final classifier. The results display the effectiveness of our proposed method.3. There must be some mislabeled labels by assigning the category label of spy re-views, these mislabeled examples will affect the generalization of the final classifier. We propose to compute the probability weights belonging to the positive and neg-ative class for all spy reviews, the probability weights are obtained by combining population property and individual property. Then probability weights are in-corporated into SVM (Support Vector Machine) to get a accurate classifier. The results show that the probability weights can get the better performance compared with the method of giving a category label.4. The human-annotated datasets must include some mislabeled examples. We pro-pose a novel method, from the viewpoint of correcting the mislabeled examples, to identify deceptive reviews. Firstly, we partition a dataset into several subsets. Then we construct a classifier set for each subset and select the best one to evaluate the whole dataset. Meanwhile, error count variables are defined to compute the probability that the examples have been mislabeled. Finally, the mislabeled ex-amples are corrected based on two threshold schemes, majority and non-objection. The results display significant improvement in our method in contrast to the ex-isting baselines.5. Faced with the hidden nature and diversity of deceptive reviews, we propose mul-tiple kernel SVM to enhance classification ability by mapping the features to a broader space. Firstly, review texts are modeled by integrating the knowledge of computational linguistics and psycholinguistics. Then, genetic algorithm (GA) is used to optimize the parameters and weight value of kernel function. According to the characteristic of the problem, we devise the especial coding method and genetic operator, and utilize the adaptive crossover and mutation probability to speed up convergence rate of the population and avoid premature convergence. Experimen-tal results show that our proposed method outperforms the current best baseline.
Keywords/Search Tags:deceptive reviews, supervised learning, semi-supervised learning, support vector machine, computational linguistics
PDF Full Text Request
Related items