Font Size: a A A

Email Classification Based On Word2vec

Posted on:2021-01-07Degree:MasterType:Thesis
Country:ChinaCandidate:P E MiaoFull Text:PDF
GTID:2428330602476677Subject:Software engineering
Abstract/Summary:PDF Full Text Request
With the continuous development of information technology,email has become the most widely used service on the Internet due to its cheapness,practicality,and immediacy.Because of its outstanding convenience,email greatly facilitates people's daily communication and communication.The development of the economy has produced a tremendous promotion effect,and the slight flaw of jade also brought an unexpected by product spam.The spread of spam not only caused great economic losses,but also threatened the security of information.It not only affected the social atmosphere and polluted the human environment,but also distorted people's outlook on life and values,and caused many social problems.The confidence of people in network communication hinders the development of the Internet.Therefore,how to solve the problem of spam and how to improve the spam filtering technology have become an urgent problem.At present,in the research of domestic and foreign spam filtering technology,the classification based on email content has become the mainstream,but the traditional machine learning algorithms inevitably exist in the process of text characterization,such as too large dimensions,too sparse data sets The data is independent of each other,and loses too many important features,which leads to the classification accuracy not meeting people's ideal requirements.Based on the content of the email,this thesis uses the Skip-gram model in Word2vec+Negative sampling strategy to train the distribution Text word vector,and adjust the model accordingly to the overfitting phenomenon,the relevant work of this article is as follows:(1)The data set uses a public spam corpus provided by the International Text Retrieval Conference Chinese data set Trec06c,training word vectors after word segmentation,uniformly setting the dimension of word vectors to 200 dimensions,considering the length of the email content,the word vector After performing weighted average processing,it is input to the BP neural network model as an input data set.Compared with the One-hot processed vector,the prediction accuracy of the model has been improved.(2)Because the number of parameters in depth study,very prone to overfitting and slow calculation,so this will be added Dropout layer model,the data set into a plurality of bitch-size,batch input model,based on additional accuracy The convergence curve graph will do multiple comparison experiments,choose the best Dropout coefficient,can make the model have a better generalization.(3)In the past,the Sigmoid classifier is generally used in the 2 classifications.However,in view of the shortcomings of the Sigmoid function reaching the limit,the slower the convergence rate,the Softmax classifier will be used in this article,and the L2 regular penalty term will be added to the classifier,which can make the model The sample shows better robustness.(4)In order to improve the convergence speed and make up for the disadvantage of the traditional gradient descent method where the learning rate is fixed and cannot be modified,the adaptive learning rate optimization algorithm Adam algorithm is used instead.(5)Compare the best results with the traditional Bayesian model and KNN model.The improved BP neural network model Prediction accuracy,Accuracy,and Recall are better than traditional machine learning algorithms.
Keywords/Search Tags:Spam, Word Vector, Machine Learning, Overfitting, Speed Up Convergence
PDF Full Text Request
Related items