Font Size: a A A

The Research And Implementation On The Technology Of Spammer Detection For Sina Mircoblog

Posted on:2016-05-06Degree:MasterType:Thesis
Country:ChinaCandidate:L ZhangFull Text:PDF
GTID:2348330509960909Subject:Computer technology
Abstract/Summary:PDF Full Text Request
Today new social networking media such as Twitter and Sina Microblog has become an important way for people to access information, share experiences and communicate with each other. When this kind of social networking media with both openness and immediacy is used by more and more people, the Microblog platform gathers a large number of users' information and users' attention. At the same time, the number of spammer is also expanded. The surge in the number of spammer makes the social network media filled with a lot of false opinions and spam, deteriorating the social networking environment. Therefore, the technology of spammer detection has become one of the most concerned issues.The technology of spammer detection can not only improve the quality of the user experience in the social network, but also regulate the direction of public opinion about social network and avoid the adverse impact caused by the transmission of prejudiced information. The technology of spammer detection also has the ability of data deduplication, which can be widely used in a series of studies such as public opinion analysis and influence analysis. Therefore, the study of spammer detection technology has an important practical significance.In this paper, considering the development of Chinese Microblog this year, Sina Microblog is selected as a research subject for the study. We design and implement a method for identifying spammer in the Sina Microblog. The main work and achievements are as follows:(1) Constructing a feature vector of users for spammer identification. Based on Sina Weibo user' features, we consider three aspects such as users' information, users' behaviors, and the contents of blogs, and analyze the properties including the number of users' concerns(Friends), the number of fans(Followers), the number of blogs issued(tweets), the posting law, URL rate(URatio), and ratio of similarity(Simratio) among the blogs, and finally propose a characteristic about the ratio of topic mobility(Tmratio). Each important characteristic is analyzed by using CDF(cumulative distribution function) to construct feature vectors which can be used for spammer detection.(2) We propose systematic research and design about short text clustering algorithm. In order to obtain the data such as the text similarity(Simratio) and the topic mobility(Tmratio) that the recognition model can handle, in this paper, we define the overclass K-means algorithm, and use the overclass to divide the textual data and thereby obtain the users' topic mobility. On the other hand, we label and assign all the standardized texts by using the Simhash algorithm, and calculate the similar degree by using Hamming Distance and cluster them, thereby obtaining the similarity feature among texts.(3) Establishing a spammer detection model based on Logistic Regression and analyzing the application of various machine learning methods on the spammer detection fields. Due to the simplicity and convenience of logistic regression algorithm, a logistic regression model is selected to establish detection model. Marked date set is used for model training, and the characteristic coefficient are obtained by gradient descent algorithm, thus establishing spammer detection model with automatic recognition capability. Using cross-validation method to evaluate the classification performance of the logistic regression spammer detection model. Furthermore, relying on a variety of experimental methods, we detect the sensitivity of spammer detection model on factors such as the size of training set sample and the features of inputs.
Keywords/Search Tags:Sina Microblog, Spammer Detection, Feature Vector, Short Text Clustering, Logistic Regression, Overclass K-means
PDF Full Text Request
Related items