Font Size: a A A

Research On Identification And Filtering Of Spam Comments For BBS Comments

Posted on:2015-01-05Degree:MasterType:Thesis
Country:ChinaCandidate:H Q MaFull Text:PDF
GTID:2268330422469476Subject:Electronics and Communications Engineering
Abstract/Summary:PDF Full Text Request
With the vigorous development of the Internet and the emerging SoLoMo(Socialize,Localize and Mobilize) trend, many large portal websites and BBS have witnessed a significantincrease in interactive quantity. Among them, there is a large number of illegal and spamadvertising information released by posting machines and Internet Water Army which seriouslyaffects the users’ visiting experience, decreases the users’ activeness and website traffic, and atthe same time influences the data mining work on the comments. In terms of how to effectivelyidentify and filter the junk posts, this paper has carried out the following work.Firstly, the author did data collection from BBS. By designing web crawler, the author gotthe information about the main post, comments and other relevant additional information within acertain period, such as the poster, the post time, page view, replies, etc., and stored them in alocal database.Secondly, in order to have a relatively accurate recognition and filtration of the spamcomments, we must have a clear understanding about the features of BBS. Based on theunderstanding of network language features, the author browsed a large number of replies,analyzed and concluded the behavioral and language characteristics of spam comments in thispaper. Then defines different kinds of spam comments and proposes highly targeted and matchedfiltering mechanism respectively.This paper adopts a multi-level filtering method, thus improving the recognition rate of thejunk posts. In the preprocessing stage of the paper, the author uses the stop list and BBS commonword dictionary to identify and filter part of the junk posts and spamming. As to those junk poststhat are further processed, the paper employs the Regular Expression Matching method to filterout some advertising posts.Finally, based on the analysis of various filtered spam comments, the author concludes thereply tendencies of posters, and identifies some professional spam-comment makers. Proved bytests, the identifying-and-filtering method, adopted in this paper, can be used to identify spamposts effectively, and make a relatively reasonable classification and identification of posters’reply tendencies.
Keywords/Search Tags:BBS, Spam comments, Web crawler, Multi-level filtering, Cosine similarity, Replytendency
PDF Full Text Request
Related items