Font Size: a A A

Research Of E-Commerce Review Spam Detection Based On Imbalanced Data Processing

Posted on:2021-03-03Degree:MasterType:Thesis
Country:ChinaCandidate:L ShiFull Text:PDF
GTID:2428330623472811Subject:Management Science and Engineering
Abstract/Summary:PDF Full Text Request
As rapid development of Internet,numerous spam reviews are appearing.These reviews have so disturbed the shopping advice to customers,as well as normal order of the network environment.As a result,effective identification of e-commerce spam reviews is an urgent need for both consumers and businesses.In recent years,with the continuous improvement of e-commerce model and e-commerce law,the data of product reviews also changes.In order to effectively identify spam reviews,e-commerce spam review indexes for selection should be further improved.However,the amount of spam review data would be much less than it was,due to the restriction of e-commerce law.Therefore,the imbalance of data is required to be addressed in this thesis,which affects the recognition of spam reviews.This thesis is going to research on identifying e-commerce spam reviews in the following four aspects.First,the reasonable identification index of spam reviews is determined.Combined with e-commerce law and regulation,this thesis presents seven indexes which are most reliable to identify spam reviews in e-commerce,such as product name,product property,the length of review texts,the positive(negative)words in the reviews,the votes of the reviews,the credit of users who make reviews,etc.All the indexes are on the basis of sufficiently studying the above index at home and abroad,and summarize the findings from the most leading researchers.Then,the indexes selected in this thesis and those selected by existing scholars are compared and verified in different classifiers,and better results are obtained accordingly.Second,pre-processing of e-commerce spam review data.The spam review data in this thesis is taken from the public data in the big data platform,and English-language reviews of Unlocked-Mobile on the Amazon website.There are413,670 sizes of data,and six attributes of the data.The preprocessing of data includes sample deduplication,attribute filtering,artificial marking,text segmentation,stop words deletion,stem extraction,extraction of product attributive words,extraction of emotional feature words from review text,and index assignment,which pave the way for the subsequent identification of spam reviews.Third,an improved hybrid sampling algorithm is proposed in the data layer.The majority samples are adopted as an improved K-means algorithm.Euclidean distance is first used to determine the distance between each two samples,so that cluster center is obtained;then the samples can be deleted effectively by calculating the distance between the samples in each cluster and the center of the cluster;finally,some samples close to the center of the cluster are selected,so that the sample set of the new majority class is obtained.For the other minority samples,an improved algorithm,known as Borderline-SMOTE is adopted.Firstly,calculating the Euclidean distance between the majority samples and the minority samples,and boundary samples can be determined by the above distance.Furthermore,improving the quality of the boundary sample set and synthesizing them with SMOTE,the new minority sample set can be acquired at last.Additionally,the new majority sample should be equal to the new minority sample.Fourth,a combined classifier algorithm of heterogeneous individuals is proposed in the classification algorithm layer.The samples obtained by hybrid sampling are used in the Naive Bayes,Decision tree,Support Vector Machine and C4.5heterogeneous individual combination classifier algorithm.The model can be generalized with the training data,and verifying the model with the training data.In the end,the comprehensive modified algorithm,combined with the hybrid sampling algorithm and combined classifier algorithm of heterogeneous individuals,is applied to the identification of e-commence spam reviews.It indeed improves the precision of spam reviews identification.By reselecting the indexes and processing imbalanced spam review data,spam reviews are easier to be identified,which plays an active role in both theoretical research and practical application.
Keywords/Search Tags:index selection, review spam, imbalanced data, classification, combined classifier
PDF Full Text Request
Related items