Font Size: a A A

Spammer Detection Research Based On Micro-blog Statistics Characteristic

Posted on:2018-07-14Degree:MasterType:Thesis
Country:ChinaCandidate:J T ZhaoFull Text:PDF
GTID:2348330569486392Subject:Electronic and communication engineering
Abstract/Summary:PDF Full Text Request
In social networks,spammers who create a large number of unwanted or disruptive information do not only annoy users,but also lead to network security issues.Existing common detection methods are building supervised classification models with statistics characteristic of users to detect spammers.The spammer detection based on supervised learning has a high accuracy,but in order to improve the generalization ability of the classifier,it is necessary to add a large amount of labeled samples for training.However,obtaining the labeled data set requires a lot of resources,which makes spammer detection impractical through supervised learning.At present,the main problem of constructing classifier is: how to effectively use unlabeled data to improve the accuracy of classification model.In order to address the issue,two research works have been done in this thesis:1.A semi-supervised detection model based on tri-training is designed for micro-blog spammer detection.This detection model has a good performance in the case of lacking sufficient labeled data,and solves the labeling bottleneck problem of traditional micro-blog spammer detection.At the same time,the similarity calculation of micro-blog users is integrated into tri-training algorithm to avoid the introduction of noise data.The main steps of the work are: first,train three original classifiers with a small amount of labeled data.Then,calculate the similarity between the unlabeled samples and labeled samples,and select confident users that are labeled for a classifier if the other two classifiers and similarity result agree on the labeling as new training data.Finally,repeat these steps until three classifiers are not updated.The experimental results show that the proposed algorithm has a good performance in the case of lacking sufficient labeled data.2.A method of labeled sample selection for MSDTT is designed.This method avoids selecting the samples that are similar with the labeled sample set,thus the samples distributed in different regions have the same opportunities to be labeled.Moreover,through the comparison of information entropy of candidate sample,the selected labeled samples can contain more category information.The main steps of the work are: first,add both a sample which is selected randomly from the sample set and the users who are similar with the sample to the candidate samples.Then compare the information entropy of the candidate samples,and label the sample of the highest information entropy manually.Finally,repeat these steps until a sufficient number of labeled samples are selected.Experimental results show that the proposed algorithm avoids selecting the samples that are similar with the labeled sample set,and the samples distributed in different regions have the same opportunities to be labeled,thus the proposed algorithm ensures the stability of MSDTT.
Keywords/Search Tags:micro-blog, semi-supervised learning, spammer detection, sample selection
PDF Full Text Request
Related items