Font Size: a A A

The Detection Method Of SPAM Based On Data Driven

Posted on:2017-08-09Degree:MasterType:Thesis
Country:ChinaCandidate:Q X LiuFull Text:PDF
GTID:2348330509450137Subject:Control engineering
Abstract/Summary:PDF Full Text Request
As a by-product of email, spam has brought about serious impact on people's life, work and so on. How to detect the spam has become an urgent problem to be solved. Therefore, this paper is to study the common spam text detection. Specific description is as follows:(1) A spam detection method based on Boosting Tree is proposed in this paper. Boosting tree uses decision Tree as the basic classification algorithm in boosting algorithm framework. The text of historical mail(training set) is trained to get different training samples through the boosting algorithm framework. The basic decision tree classifier is generated by training the sample set. And T based decision tree classifier is obtained after the T round of training. Then, these basic decision tree classifiers are weighted to generate a final result of the classifier. Finally, using this classifier to detect and classify email. The comparison results of the traditional Bias algorithm, decision tree algorithm and boosting tree algorithm show that the spam classifier based on boosting tree algorithm is better than the others.(2) A spam detection method based on random forest is proposed. Random forest uses the decision tree as the based classifier. Multiple training samples were selected from the training sample by bagging sampling method, and the model of decision tree is set up for each training sample. Then, according to the results of the classification of each decision tree to “vote”, and the decision tree with maximum number of votes will use as the optimal classification tree. Simulation results of random forest and Tree Decision algorithm show that the effect of spam classifier based on Tree Boosting algorithm is better.(3) A new elastic net-decision tree two step spam classification algorithm is proposed. The algorithm makes up the defect of the effect difference of the classification of feature reduction or regression analysis by using some algorithms. Firstly, the elastic net algorithm is used to reduce the dimension of the email text data. Then select the low dimensional variables as the inputs of the Tree Decision to classify, and judge the mail's category. Simulation results show that the elastic net-decision tree two step classification algorithm has obvious advantages compared to PLS, PCA and Lasso algorithm.
Keywords/Search Tags:spam, boosting tree algorithm, random forest, elastic net-decision tree method
PDF Full Text Request
Related items