Font Size: a A A

Research On Spam Filtering Technology Based On Bayesian Classification

Posted on:2021-05-21Degree:MasterType:Thesis
Country:ChinaCandidate:L WangFull Text:PDF
GTID:2428330647967280Subject:Control engineering
Abstract/Summary:PDF Full Text Request
With the rapid development of internet technology,email has become an indispensable part of people's daily work due to its advantages such as convenience,speed,and environmental protection.But at the same time,the emergence of spam mail has severely affected users and brought great property damage and security threats to the society.Therefore,it is of great significance to study spam filtering technology.Based on the existing theories and research,this paper systematically introduces the spam filtering methods,and makes some improvements to the current naive bayes algorithm to filter spam fashion and existing shortcomings.The main research contents are as follows:(1)Further research on anti-spam related technologies,including email pre-processing,text representation models,feature extraction,and more.The principle and source of naive bayes classification algorithm are studied,and its advantages and disadvantages in text classification are analyzed.(2)The principle of random forest algorithm and its advantages in feature selection are analyzed,and a classification algorithm using random forest combined with naive bayes is proposed.Aiming at the problem of dimensional disasters that are common in spam filtering systems,random forest feature selection is used to filter out feature words with a gini impurity of zero in the mail set,and the naive bayes algorithm is used to calculate the posterior of the test mail after feature selection probability to get the category to which the test message belongs.(3)A naive bayes classification algorithm based on tree structure is proposed.Aiming at the problem that the naive bayes algorithm consumes a lot of system and network resources during the training stage of the early classification and seriously affects the classification efficiency,a tree structure is used instead of the array originally used in the algorithm to maintain the number of occurrences of feature words in the training sample.The bayesian algorithm performs a square treatment on the conditional probability of feature words when the number of mail sample attributes is large and the classification effect is poor.(4)The classification performance of the filtering algorithm is tested by the designed mail filtering system.The experimental results show that the naive bayes algorithm combined with random forest has better classification performance than the original algorithm.The naive bayes algorithm based on the tree structure is better than the original algorithm.The original algorithm takes significantly less time during the training of email samples.As the samples continue to increase,the training time is only slowly increasing.By selecting the appropriate z value of the number of squares to reduce the false positive rate of spam,the improved algorithm has a better effect on spam filtering.
Keywords/Search Tags:spam, random forest, naive bayes, feature selection
PDF Full Text Request
Related items