Font Size: a A A

Research On Outlier Detections For Review Spam Filtering

Posted on:2015-12-01Degree:MasterType:Thesis
Country:ChinaCandidate:Z Y DingFull Text:PDF
GTID:2348330509460899Subject:Management Science and Engineering
Abstract/Summary:PDF Full Text Request
Outlier detection techniques have widely been applied in medicine, finance, information security and so on. It can effectively dig out not only meaningless data to clear but also interesting information to recommend. It is an important research problem to find out the data one is interested in and filtering out “junk” information from mass data accurately and efficiently. There are a variety of outlier detection methods, including the distance-based algorithm, the density-based algorithm and the cluster-based algorithm. Among them, the method of clustering has been more widely used in text filtering. By clustering the degree of similarity for the document, the text is divided into different sets. Then text sets can be divided according to the definition of outliers, so as to achieve the purpose of text filtering.In this paper, a new definition of outlier on uncertain data is defined. A distance based outlier detection method and a tuple-compress based outlier detection method are proposed. And use this algorithm to solve the problem of spam comment filtering. The characteristics of review spam are seen as the uncertain data objects. The uncertain database includes all the characteristics of spam comments. For each comment, compare its characteristics with those in uncertain database and compute the probability whether it is a normal comment. Thus then, the outliers whose probability are below the threshold can be chose through the algorithm of outlier detection.Innovations in this paper:(1) For cases that normal database cannot deal with the uncertain data, a new definition of outlier on uncertain data is defined. A distance based outlier detection method and a tuple-compress based outlier detection method are proposed. Experimental results show that the proposed approach can efficiently detect outliers in data set.(2) Analysis the characteristics of spam and the difficulties when combined the outlier detection with spam filter. This paper proposed a new way to filter the comment spam such as comments from online shopping website and forum.(3) The dataset is about the comments on the commodity from ?Taobao? and irrigation comments from ?Tianya? forum. After experimenting with the new algorithm, the satisfactory results have been obtained.
Keywords/Search Tags:outliers, spam comments, uncertain data, text mining
PDF Full Text Request
Related items