Font Size: a A A

Research And Application Of Personalized Spam Filtering Technology Based On Ensemble Learning

Posted on:2021-04-10Degree:MasterType:Thesis
Country:ChinaCandidate:P C XiangFull Text:PDF
GTID:2428330614471704Subject:Computer technology
Abstract/Summary:PDF Full Text Request
With the advent of the Internet era,e-mail has become an indispensable way for people to transmit information in their daily work and study due to its convenience and speed.Meanwhile,the emergence of spam has caused many problems.Spam will take up much network resources,distract users from work and study,threaten privacy of users,bring negative influence to the Internet environment.Therefore,research on the spam filtering technology has great practical significance.In view of the shortcomings in the current research on spam filtering technology,this thesis proposes relevant solutions,including the main work:?1?Aiming at the problem that the existing spam filtering technology extracts incomplete mail features,this thesis proposes an Ada-CK mail classification method based on ensemble learning combining with the features of mail structure,which divides mail content into mail header and mail body.The header of the mail behavior characteristics build CART decision tree classifier,and the content of the text semantic characteristics build K-nearest neighbor classifier.Meanwhile,in the K-nearest neighbor classifier,the thesis put forward an improved similarity threshold based on the text similarity comparison method,which divides text keywords into approximate words and general words,then the linear combination are calculated respectively and get the final text similarity.Based on the ensemble learning idea of Adaboost,the CART decision tree of the mail header and the K-nearest neighbor method of the mail body are used as the base classifier.After training of different sample weights and sample characteristics by multiple base classifiers,the classification results and discourse power of the base classifier are obtained,then the final mail classification results are obtained.By comparing Ada-CK methods with Ada-CART and Ada-KNN based on a single base classifier,as well as other methods Co-PRFC,L1-SVM and TSVM-NB in the experiment.It shows that Ada-CK is obviously better than other methods in the precision index of mail classification,which is in line with the high accuracy of mail application.?2?According to different mailbox users have different cognition of spam,the thesis put forward an active learning method ALUP based on user personalization.By introducing the concept of user interest set from the text of email,the thesis put forward the user interest set model and specific classification method.Meanwhile,in the process of incremental learning of mail,the method of active learning is introduced to select incremental samples with high uncertainty for update training based on the distribution density of samples,avoiding the problem of high time complexity caused by adding all incremental samples into the training set.By comparing ALUP with other mail classification methods ALNSTC,SVM-AL and MFL in terms of performance in the experimental.It shows that ALUP can guarantee a high accuracy of mail classification,significantly reduce the time consumption,meet the requirements of online mail application with high accuracy and speed,and follow the characteristics of user personalization.
Keywords/Search Tags:Spam filtering, Ensemble learning, Decision tree, K-nearest neighbor, User interest set, Active learning
PDF Full Text Request
Related items