Font Size: a A A

Study On Spam-microblog Detection Based On Integrated Multi-feature Clustering

Posted on:2016-06-25Degree:MasterType:Thesis
Country:ChinaCandidate:J LiuFull Text:PDF
GTID:2308330479984807Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Over the last few years, Microblog has been one of the main ways for user to keep track and interact online, due to the huge convenience of reaching friends and news. The ease also attract the interest of Spammers. A lot of advertising microblogs are born every day too, which is hard to tell the information harmful or not, seriously affects user experience, what’s more, many microblogs including winning, lottery or illegal medicine even steal the private information of user, threat to the user’s health and property. These information should and must be deleted. Text classification based on feature extraction is a common method to identify spam information. Features will determine the accuracy of spam-microblog classification. This paper studies deeply the ways of the spam-microblog detection based on Sina Microblogs, specific research contents are as follows:Firstly, the basic feature of spam-microblog is including third-party contact information. In which URL is used most widely. Many anti-spam researches have achieved detection of malicious URL, and the way Sina Microblog uses to detection spam-microblog is also based on URL, so many spammers are trying to use new contact ways. But previous studies always ignore the point. To detect spam-microblog more widely, in addition to URL, we also consider the other contact ways Spammers may use, which include: obfuscated URL, QQ number, Wchat number and phone number.Secondly, to overcome the randomness and fuzziness of microblog content features, we propose one kind of spam-microblog detection method based on clustering of similar microblogs. In Sina Microblog, the length of almost 30% Microblog is less than 15. Spam-microblogs with such short length look likely very similar to the normal microblog, and in which the effective information is too less to mark it as spam-microblog. To get more victims, spammers also use more than one account to post spam-microblogs. The same spam-microblog are post many times. The similar content description and the same contact way are used a lot, which is different with normal microblogs. After clustering of similar microblogs, the spam-microblog will be in one cluster, the character of cluster will overcome the randomness and fuzziness, tell the truth. Finally, the comparative experiment is performed on real microblog dataset, classification results prove the characters of similar microblogs cluster is more effective than characters of single microblog, the accuracy of detecting spam-microblog are promoted 10%.At last, many spammers always try to obfuscate the spam-microblog. For example, instead of real spam information, popular news or network events are used as spam-microblog content description. In the case, characters of content are useless, content fails to tell it as spam or not. After the above analysis, we bring in user features. Obfuscating spam-microblog content is very easy, while user attributes such as registration time, microblog count is identified, very difficult to obfuscate. These features is helpful to tell one microblog as spam. For example, if one freshmen-user always posts microblogs a lot every day. The microblog this user posts is spam more likely. Based on all these analysis, we finally propose our detecting method: Spam-Microblog detection based on integrated multi-feature clustering. The experiment conducted on real Sina Microblog dataset proves integration multi-feature clustering promoted largely the ability to identify spam information, our method has much better F-score.
Keywords/Search Tags:Spam-microblog detection, multi-contact detection, similar microblogs clustering, integration user feature
PDF Full Text Request
Related items