Font Size: a A A

Research On Chinese Spam Filtering Based On SDA

Posted on:2020-12-19Degree:MasterType:Thesis
Country:ChinaCandidate:L Y ZhangFull Text:PDF
GTID:2428330590977365Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the rapid development of internet technology,spam is widely used on the internet because of it has the advantages of low cost and convenient transmission,it has gradually become an important communication and communication tool for people.But driven by commercial interests,enterprises and individuals often pass Mass mailings are used for marketing,so that the number of spam messages received by users far exceeds the number of normal mails.According to statistics,in 2018,the daily volume of false emails in the world is estimated to be as high as 6.4 billion.The FBI recently reported that the cost of corporate email fraud(BEC fraud)has reached $12 billion in the past few years.It can be seen that research on anti-spam technology is necessary.The traditional spam filtering method has the problems of low accuracy and difficult to extract data features in Chinese spam filtering.The shallow neural network model can't represent complex objective function in spam filtering,and it is easy to fall into local optimum during training.This paper proposes a Chinese spam filtering method based on SDA.First,unsupervised layer-by-layer pre-training of the deep network,initialize network parameters;Then,through supervised learning,the network parameters are inversely fine-tuned to obtain the optimal model parameters;The deep network model can filter short-text Chinese spam,and the model is optimized and improved,and the deep network is slow in training,the model is poor in robustness,easy to be affected by noise,and prone to over-fitting.The main work of this paper is as follows:(1)This paper uses TREC06 C dataset to extract 11360 short text Chinese spam data as sample data,then use the CBOW model in Wrod2 vec to obtain the word vector needed for deep network classification,and apply deep learning model stacked denoising autoencoder in natural language processing to Chinese spam;(2)Since the deep network is prone to over-fitting during training,this paper adds Dropout technology to the W2C_SDA deep network Chinese spam filtering model to prevent network over-fitting,experiments show that after adding Dropout to the network,the result is more stable and the generalization effect of the network is better;(3)The W2C_SDA deep network Chinese spam filtering model used in this paper is a SDA,which is composed of a stacked denoising autoencoder and a Softmax classifier;In order to improve the convergence speed of the network and alleviate the over-fitting phenomenon,adding L2 regularization to the Softmax classifier can alleviate over-fitting while also accelerat the convergence speed of the network,experiments show that L2 regularization is added and the accuracy is improved by 0.2%;(4)The optimal parameters of the W2C_SDA deep network are obtained experimentally,and compared with the classification effect of Bernoulli Bayesian filter model and KNN filter model on one dataset,experiments show that compared with Bayesian filtering model and KNN filtering model,this method has better effect in Chinese spam filtering.
Keywords/Search Tags:Spam, Word embedding, Stacked denoising autoencoder, Dropout, L2 regularization
PDF Full Text Request
Related items