Font Size: a A A

Research And Application Of Detections On The Outbreak Of Micro-blogging Spam

Posted on:2014-11-11Degree:MasterType:Thesis
Country:ChinaCandidate:J C HeFull Text:PDF
GTID:2268330425475890Subject:Computer technology
Abstract/Summary:PDF Full Text Request
With the development of Web2.0technology in recent years, social networks, the representativesof the Web2.0era, gradually penetrate into people’s daily life. They influence and change every aspectof human beings. At the beginning of2009, micro-blogging, as an important part of social networks,came into Chinese view. Due to its brief writing and quick publishment, strong sense of real-time anddynamic as well as frequent interaction between users, micro-blogging is increasingly favored bymany users.However, since the publishment of micro-blogging with a very low barrier of entry, and therelated legal system is imperfect, large numbers of spam and nonsense texts have appeared in themicro-blogging platform. Unfortunately, this spam will not only have a bad impact on user experience,but also make user deceived easily and result in loss of money. What’s worse, owing to hacking,computer vulnerabilities, viruses and other reasons, plenties of user accounts have been hijacked byhackers. Such kinds of spam will be published through these user accounts, resulting in the outbreakof that spam in a short time.This paper focuses on the micro-blogging text, and because of the need to detect the outbreaks ofspam, we take the clustering and classification of micro-blogging as our research objects.Due to the ineffectiveness of using traditional simhash algorithm on the clustering ofmicro-blogging texts, this paper introduces a feature extraction method that continuously extract textblock, and a weight setting method called FF-FID(Feature Frequency-Feature In Documents). Byusing these two methods to calculate a text’s simhash fingerprint, the clustering is more effective onmicro-blogging texts. Besides, it is hard to aggregate the unreadable texts that contain mainly thesame contents, so we use singular transition as a key feature and propose a clustering algorithm formassive micro-blogging texts which combine the advantages of K-Means and DBSCAN.Experimental results indicate that this algorithm has a positive result on the clustering ofmicro-blogging texts that has similar user behavior as well as text content.According to the classification of micro-blogging text, this paper combines the definition ofspam with the readability of the text and text content to classify a text cluster. For one, by using theuser behavior, we build a decision tree classifier to classify the readability of text. For another, usingChinese text content to decide whether this text is spam. Experimental results show that the classification is more precise than that only considering of text features when user behaviors and textfeatures are both considerated.Finally, on the basis of the study showed above, we design and implement a detecting systemfocused on the outbreaks of micro-blogging spam. Moreoover, we take all original texts in one hour toprocess. At last, experimental results show that, this system basically meets the demand for thedetection of spam outbreak and has a high practicality.
Keywords/Search Tags:micro-blogging spam, singular transition, simhash, clustering, classification
PDF Full Text Request
Related items