Font Size: a A A

Research On Spam Text Classification Based On Improved Naive Bayes Algorithm

Posted on:2022-02-10Degree:MasterType:Thesis
Country:ChinaCandidate:G PengFull Text:PDF
GTID:2518306602470674Subject:Computer technology
Abstract/Summary:PDF Full Text Request
With the rapid development of communication technology and the popularity of 5g communication,e-mail data is growing explosively.With the proliferation of spam,the classification and filtering of spam becomes more and more urgent.Naive Bayes algorithm plays a very good role in the processing of spam data with such a large amount of data and various forms.However,due to the experimental principle that naive Bayes algorithm is based on independent attributes of features,and the limitation of single sample data set,the experimental process is not sufficient,the data redundancy is large,and the experimental accuracy is not high enough.This paper proposes an optimization and improvement strategy based on traditional naive Bayes in spam text filtering,which not only makes up for the single limitation of data set in sample data,but also proposes an improved strategy of feature weighting in feature classification strategy.The main contributions of this paper are as follows:(1)Aiming at the limitation of traditional naive Bayes algorithm based on independent feature attributes,this paper proposes a new algorithm based on Naive Bayes classification model combined with the improved strategy of word segmentation feature weighting,according to the sample data set in the case of multi category and multi feature,according to the method of adding semaphore to the sample data,the weight value and feature classification are added to each feature attribute,so as to expand the feature extraction of word segmentation vector.Finally,naive Bayes classification model is promoted to calculate the posterior probability of samples through conditional probability,so as to realize the classification and filtering of spam text.(2)The construction of naive Bayes classification model needs a huge data set to test and train the model fully,but the sample data selected in this paper is not enough.In order to solve the limitations of single dataset and insufficient model training,this paper introduces the Ten-fold cross validation modeling strategy.In the data preprocessing stage,the standard dataset is divided by using the experimental principle of cross validation,and then the naive Bayes classification model is constructed.Through this strategy,the limitation of data set is effectively solved.The experimental results show that the method and strategy proposed in this paper has played an obvious improvement in the application of garbage text classification.From the single limitation of the data set,combined with the Ten-fold cross validation method,it not only increases the sample data,but also provides sufficient training for the construction of classification model.From the introduction of word segmentation feature weight strategy,the limitation of feature independence is effectively avoided,and the improved method has good execution efficiency,which not only reduces the data redundancy,but also improves the accuracy of garbage text classification.
Keywords/Search Tags:spam, naive Bayes, Ten-fold cross validation, feature weight
PDF Full Text Request
Related items