Font Size: a A A

Research On The Method Of Identifying Microblogging Spam Reviews

Posted on:2018-08-16Degree:MasterType:Thesis
Country:ChinaCandidate:D M LanFull Text:PDF
GTID:2348330518957159Subject:Software engineering
Abstract/Summary:PDF Full Text Request
Spam comments are the irrelevant comments with blog article or meaningless or milicious comments that users post on microblog.We recognized these comments with artificial recognition early,which maily bases on verification code and auditing mechanism;automatic identification in middle time,which mainly bases on key words,number of links and relative degree of threshold.Recently,we screen out the hyperlinks,special characters and some other obvious displaying spam comments,which bases on rules method firstly.And then we recognize the spam comments on microblog by adopting the method basing on theme features and combining with classifier.At present,there are two main ways to obtain the data of microblog.They are network crawler and API,the open platform on microblog.The former one is slow,which takes a large amount of time to process the experimental data that needed in the paper,while the visits of the later one are limited by server of microblog platform.So,these two ways to obtain experimental data are not so perfect.Thus,this paper proposes a method basing on cookie and regular expression to obtain the experimental data,which contains the original microblog,author information and the comments on microblog.This design adopting the above two methods and the paper processes the method to get the comments data of the microblog with the topic of divorce addressed by Wang Baoqiang,who is verified by microblog.The experimental results show that comparing to the above two methods,this method is not only easy to operate but also fast in data acquisition.A microblog and a comment are limited,which can have a maximum of 140 characters.So the contents are short.The theme feature of a microblog is not so obvious,so we can't only consider relevance degree between comments and microblog when we recognize the spam comments on microblog.Considering a single element may increase the misjudgment rate of spam comments,so we try to use Co-Training to enhance the classifier performance and put forward a method to recognize the spam comments basing on Co-Training in this paper.To the original microblog and the information of the author,the relative information phrases preprocessed in the paper,own emotional words of microblog,and emotional words with more than 5 emotional intensities of information retrieval laboratory in Dalian Univesity of Technology make up characteristics of vocabulary.As to the comments on microblog,the paper screens out the obvious spam comments through defined method basing on rule recognization and preprocesses the rest relative comments.Then,on the one hand,we can get the relative comments phrases and the results through calculate the similarity of Chinese thesaurus between them and the characteristics of vocabulary,which could be sent into the classifier AdaBoost.On the other hand,we fetch the characteristics,and take these characteristics comments as characteristics vectors to train the classfier SVM.At the last,we make Co-Training for the two classfiers through the Co-Training basing on spam comments on microblog.So we can judge weather the comments are spam comments or not with the model been trained.The method promotes the classfication accuracy,as well as saves much work to mark the simple.Through the comparative analysis on the method in this paper and the other two typical methods,the result shows that the method in this paper has a acceptable feasibility and effectiveness.
Keywords/Search Tags:spam comments on microblog, collaborative training, synonym word forest, support vector machine
PDF Full Text Request
Related items