Font Size: a A A

Research And Analysis Of Spam Classification Based On CNN Two-Way LSTM Attention Mechanism

Posted on:2021-01-22Degree:MasterType:Thesis
Country:ChinaCandidate:X Q WuFull Text:PDF
GTID:2428330602979037Subject:Information and Communication Engineering
Abstract/Summary:PDF Full Text Request
With the rapid development of science and technology,E-mail communication is widely used.Because of the universality of E-mail also brought some problems,the existence of the spam interfere with the normal email communications,users often receive some irrelevant emails,and even induce a user clicks on the inside of the mail links resulting in computer viruses and even for unauthorized card credit card which cause serious damage to the public,such as email.At present,there are mainly two ways to filter spam,the first is the identification technology based on sending source,the second is the identification technology based on text content.In this paper,deep learning domain knowledge is used to establish a spam classification model to identify spam.Based on the existing text classification algorithm,the weight calculation formula is improved.The original neural network model is improved,and a new CNN-BiLSTM-Attention model is proposed to be applied to the email text classification task.1.This paper first introduces the traditional email text classification process,including preprocessing,weight calculation,vectorized text data and classifier,and introduces in detail a variety of methods to realize classification.The application of text classification is introduced in detail from two aspects:machine learning algorithm and neural network algorithm.2.In view of the unbalanced weight allocation of entries in email classification and the Inverse Document Frequency problem,the calculation formula of IDF(Inverse Document Frequency)was improved,the text quantity ratio factor and chi-square statistics of eigenvalues were added,and the improved IDF was combined with the TFC calculation formula to obtain a new weight algorithm model,TFC '.The experimental results of TFC ',TF-IDF(term frequency-IDF)and TFC algorithm were compared to verify the accuracy of the algorithm.Naive Bayes was used as the classifier for the obtained vectorization data.The experimental results show that the improved TFC 'can achieve about 85%accuracy in mail classification.3.By combining the Convolutional Neural Networks(CNN)and bi-directional Long Short-Term Memory,the advantages of bi-directional Neural Networks(BI-LSTM)can be gained more accurately as to the global characteristics of the text,which makes up for the shortcomings of the Convolutional Neural network and solves the problem of extraction accuracy of text classification to a certain extent.4.Joined the attention mechanism,better able to extract the important text entry,according to the probability distribution of vectorization process input text characteristic,the mechanism of attention and then joined the Bi-LSTM layer,can be more detailed feature extraction for entry,improved weight calculation combined with Word2vec after vectorization of text data as input of CNN layer,the experimental results show that spam classification accuracy increased to 92.7%.
Keywords/Search Tags:spam, text classification, deep learning, neural network, weight
PDF Full Text Request
Related items