Font Size: a A A

Research And Implementation Of Spam Filtering System Based On Improved Bayesian Algorithm

Posted on:2022-09-29Degree:MasterType:Thesis
Country:ChinaCandidate:R F HuangFull Text:PDF
GTID:2518306347492864Subject:Electronics and Communications Engineering
Abstract/Summary:PDF Full Text Request
With the rapid development of the Internet,e-mail has become an important communication tool in people's daily life.E-mail has diversity,its content is not limited to text,but also includes pictures,videos and other multimedia information.It has the advantages of easy sending and receiving,low cost,high efficiency,and greatly meets the communication needs of study,work and life.But at the same time,the number of spam is also showing an explosive growth.Spam is deceptive,inflammatory,commercial or unhealthy,which seriously affects people's daily life.Anti-spam technology has been a hot topic in the world.How to find an accurate and efficient anti-spam technology is particularly important..The classic spam filtering technologies include black-and-white list based filtering tech-nology,rule-based filtering technology and content statistical classification based filtering technology.Spam filtering based on Bayesian algorithm is a filtering technology based on content statistical classification.The theory of Bayesian algorithm comes from the Bayesian theorem of classical mathematical probability theory,which has rigorous mathematical logic.Compared with other algorithms,Bayesian algorithm has higher accuracy and faster com-puting speed in dealing with spam filtering.Compared with the general Bayesian model,the naive Bayesian model in Bayesian algorithm greatly simplifies the complexity of the model because it assumes that the feature attributes are independent of each other.However,the most commonly used naive Bayes model also has some shortcomings.When dealing with the problem of natural text classification,there are more or less certain correlations between the feature attributes of documents,which will lead to errors in the prediction results.In order to improve the Bayesian algorithm to have a higher accuracy in spam processing,this paper designs a method to set the judgment threshold in the classification module stage to improve the Bayesian algorithm.The setting of judgment threshold refers to the actual situation,people hope that normal mail will not be misclassified to avoid greater economic losses.In this paper,a spam filtering system based on Improved Bayesian algorithm is designed and implemented,especially the modularization of the system.The system is divided into four modules:preprocessing module,training module,classification module and cross validation module.This paper introduces the process and implementation of each module in detail to facilitate the subsequent modification and maintenance.The four modules complete a variety of tasks of the system,including document analysis,word segmentation,data cleaning,feature selection,probability calculation,training thesaurus,classification judgment,cross test and so on.In this paper,we use the corpus downloaded from the Kaggle forum as the data set,which contains the proportion of spam and normal e-mail exactly simulates the proportion of e-mail types in real life,in order to make the follow-up experiments more universal.In order to verify the effectiveness and superiority of the improved Bayesian algorithm,this paper carried out two groups of experiments on the selection of judgment threshold and the number of training samples in the experimental stage.Through the comparison of accuracy,recall,F1 value and the analysis of trend graph,the best judgment threshold and the best number of samples were obtained,and the accuracy of the improved Bayesian algorithm in filtering was verified It is better than naive Bayes algorithm.
Keywords/Search Tags:spam filtering, Bayesian theorem, Bayesian classifier, judgment threshold, number of training samples
PDF Full Text Request
Related items