Font Size: a A A

Research Of Spam Comment Identification In The Microblog Based On AdaBoost-LC

Posted on:2015-01-01Degree:MasterType:Thesis
Country:ChinaCandidate:L HuangFull Text:PDF
GTID:2268330422972528Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With the rapid development of Internet and Web2.0, social network service growsexplosively. As an important representative of social network service, microbloggradually becomes one of the main activities of Internet users. Due to the characteristicsof the microblog with ease, convenience, speed, extensive, high efficiency, back to face,microblog has attracted the attention of a large number of spammers. For variouspurposes spammers published a large variety of spam comments in the microblogs, andthese lots of spam comments not only affected communication between Internet usersand even made them to be deceived, but also hindered the data mining work ofcomments. Therefore the identification and filtering of spam comments in the microblogbecomes important.This article is intended to research the identification of spam comments in themicroblog, and the main research work and achievements are as follows:①In view of the feature sparse problem after word segmentation because of theshort comments, this paper put forward using feature vectors which consisted of ninefeature values to represent the comments, which could describe the content of thecomments from different angles. On this basis, this paper proposed a method based onAdaBoost-LC to identify spam comments in the microblog. In this method, the binaryclassifier which is the most simple classifier among linear classifiers is used to be thebase classifier, and then use AdaBoost algorithm which is used of integrating learningalgorithms to enhance the accuracy of the base classifier.②In view of the existing of inadequacies in the AdaBoost-LC algorithm, such asheavy degradation caused by the rapid expansion weights of the "difficult" sample andexpensive cost of normal samples which were incorrectly identified under the commentspam recognition scene, an improved AdaBoost-Ex algorithm was put forward toidentify spam comments.③Aiming at the relearning problem because of the emergence of new features ofspam comments and the degradation of classifier’s performance over time, this paperdesigned a modular incremental learning model to solve the problem. The model onlyneeds to learn the rules from new samples on the basis of keeping the original studyingrules. The new classifiers learned from new samples were put into incremental learningsystem with their weights. Overall, the algorithm had the ability to learn gradually which enhanced the practicality on the basis of all of this.Finally, the proposed methods in this paper were evaluated by real datasets whichwere extracted from hot Sina microblogs’ comments, and the results proved that themethods of identification of spam comments in the microblog had a good recognitioneffect.
Keywords/Search Tags:Microblog, Spam comment identification, Feature vector, AdaBoostalgorithm, Incremental learning
PDF Full Text Request
Related items