Font Size: a A A

Research On Spam-filtering Based On Deep Learning

Posted on:2020-09-26Degree:MasterType:Thesis
Country:ChinaCandidate:H HuangFull Text:PDF
GTID:2428330590996066Subject:Computer technology
Abstract/Summary:PDF Full Text Request
With the rapid development of Internet applications,the advancement of advertising technology and the popularity of email,more and more spam is full of our lives.How to effectively distinguish spam has gradually become a hot topic.There is a strong correlation between the natural language and the structure,and there is an excessive dimension for the direct conversion of Chinese mail into a vector,which affect the accuracy of the final classification.Currently commonly used spam filtering methods mainly include content-based recognition technology and email source-based recognition technology.For example,Naive Bayesian model text recognition is a content-based recognition technology.Whitelisting and blacklisting mechanisms are based on email source technology.The number and style of emails is increasing,and the rule-based approach requires not only constant updates to the email signature rule base,but also a lot of manpower.The content-based approach has yielded promising results,but with the development of technology,spammers are now using a lot of image spam and spreading it.Spam in this way is more difficult to detect and consumes more network bandwidth.This thesis analyzes and summarizes the commonly used spam filtering methods,and selects the classification algorithm based on deep learning as the focus of this paper to establish a spam filtering model.Specific research work and contributions include:1.A Skip-gram based CNNs-Highway Mail Filtering Model(SGCH)is proposed.Since the previous word representation method is mainly One-hot,the disadvantage of this method is that the dimension is too high and the data is sparse.For spam filtering,the semantic information before and after the word cannot be well preserved.At present,word embedding can effectively preserve the word vector conversion of lexical grammar and semantic information.The proposed method will map the word distribution to a low-dimensional space based on the Skip-gram model in word embedding,solve the problem that the traditional One-hot coded word vector is too high,and then combine CNNs and Highway with different convolution kernels.The network's cascading network performs text feature extraction,and finally experiments on different Chinese and English mail data sets to prove its effectiveness.2.A spam filtering model based on Deep Convolutional Neural Network and two-way GRU network(DCNN-BiGRU)is proposed.CNNs can learn local features very well,but the disadvantage is that they cannot learn the relationship between sequences.The cyclic neural network can learn the relationship between sequences well,but can not learn the local information like convolutional neural network.To make up for the shortcomings between the two problems.In this paper,an improved very-deep convolutional neural network and GRU network are proposed.Finally,experiments are carried out on Chinese and English mail data sets to prove their effectiveness.3.A multimodal spam filtering method based on decision level fusion is proposed.The above methods are all about improving the original spam filtering technology.However,in recent years,spammers have embedded text spam into images to spread in order to avoid the detection of spam filters.The shortcoming of a single modal mail detection is that it does not provide a comprehensive analysis of all the information in the mail.Based on the filtering method of the first two mail texts and the image classification technology,this paper proposes a multi-modal architecture model based on decision-level fusion and experiments in this paper prove its effectiveness.
Keywords/Search Tags:spam filtering, text classification, word2Vector, deep learning
PDF Full Text Request
Related items