Font Size: a A A

Research And Design Of Multiple Mail Filtering System Based On BP Neural Network

Posted on:2019-04-02Degree:MasterType:Thesis
Country:ChinaCandidate:Z K WangFull Text:PDF
GTID:2428330590478652Subject:Computer technology
Abstract/Summary:PDF Full Text Request
With the rapid development of the Internet,the frequency of using e-mail between people is increasing day by day,and it has gradually become an important communication medium.However,with the continuous popularization of e-mail,the proliferation of spam is accompanied by poor control and even affecting people's normal work and life.There are still many shortcomings in the existing spam filtering,and the spam filtering cannot be well filtered.In order to deal with this shortcoming,the research of strengthening spam filtering technology is particularly important.This study try to design a spam filtering system model on statistical-based method.Model training uses BP neural network learning algorithm.In the experimental process,the data preprocessing and algorithm training of the public PU corpus are used to obtain a large number of models,and then the model selection is carried out.Finally,the main and multiple collaborative models of the spam filtering system are obtained through the combination of models.In the filtering process of the model,the mail is divided into multiple data streams into the FC layer,and the results are output in the Output layer respectively,and then the weight is calculated according to the sub-model false reporting rate(FALLOUT)to obtain the final judgment result.The theory preprocessing process includes word frequency statistics based on Hadoop,dictionary dimensionality reduction based on improved TF algorithm and vector matrix generation of vector space model(VSM).The word frequency statistics obtain a feature word's list of the total mails,a feature word's list of ham,a feature word's list of the spam,and a feature word's list of each mail.In this study,the traditional TF algorithm is improved for data preprocessing.The word statistic is used to reduce the dimension of the corpus feature word set.The dimension is controled within 2000 dimensions,and the better experimental results are obtained.The generation of VSM form sparse matrix is realized by JAVA programming.The selection of the main and auxiliary models is divided into three subgroups of A,B and C by data partitioning.The subgroups are used to design the programs for training,including A+B_C,A+C_B and A_B+C.Finally,we obtain the main and auxiliary model by caculating the model simulation of the average accuracy.Model selection is a key part of this research.The experiment compares the models through different matching schemes,compares the optimal single model with the SVM algorithm training model,and compares the optimal single model with the system combination model to verify the performance of the system model step by step.At the end of the experiment,the performance of the system model was further tested and evaluated by calculating the recall rate,correct rate,F value,accuracy,AUC(Area Under Curve)value,model calculation based on MACCs and FLOPS,and memory occupancy.The final conclusion of the experiment is that the odd-numbered optimal models are combined into one classifier.Through multiple filtering,the judgment accuracy and system generalization ability can be improved,and the false positive judgment of legitimate mail can be effectively reduced.
Keywords/Search Tags:Mail Filtering, VSM Formal Matrix, Primary And Secondary Multi-Filter Model, AUC Value, Performance Evaluation
PDF Full Text Request
Related items