Font Size: a A A

Research On Identifying Comments Spam For Blog Comments

Posted on:2012-05-05Degree:MasterType:Thesis
Country:ChinaCandidate:B N DengFull Text:PDF
GTID:2178330338495369Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Blog itself has the characteristics of free, freedom, and share, so a large number of spam comments which include the information of advertising, hyperlinks, abusive or defamatory etc. appear in blog. These spam comments bring inconvenience to read comments and interaction each other for network users, and also affect mining the content of the comments.Our paper make spam comments identify for the comments in blog area, the main works include the following aspects:When identifying the comments, taking into account the length of comments may be long or short, if identify the short reviews which only contains common net words use the method of comparing the similarity of comments and articles, it is easy to identify the normal short comment as spam, so the paper takes different methods to identify the comments. To the short comments, we compare the number of garbage words and normal words in the short comments to determine the type of the short comments, and filter the short spam comments.Identifying the long comments, we improve the traditional cosine similarity formula, bringing near relationship between the words, the word's location information and keywords similarity into the traditional formula, so that make up the shortage of the traditional formula which does not recognize the near-words. At the same time, taking into account the correlation of the keywords and topics will be change, we propose to use the above improved formula to identify the longer comments k rounds. Every round of identification, we combine with legitimate comments and the information of near-words to adjust the weight of keywords, so that reflect the degree of correlation between keywords and topics; selecting near-words and high-frequency words from legitimate comments to extend the keywords, so that adapt to the diversity of topics.Finally, after identify all the comments, we use usually common net words and update keywords to filter the identified spam comments secondly, so reduce the probability of legal reviews which were identified as spam.Experimental results show that our method can improve the accuracy and the recall in identifying comments spam.
Keywords/Search Tags:Blog, Spam Comments, Parasynonyms, Keywords Similarity, Cosine Similarity
PDF Full Text Request
Related items