Research On Spam Filtering Algorithm Based On FastText

Posted on:2021-01-09

Degree:Master

Type:Thesis

Country:China

Candidate:M Yan

Full Text:PDF

GTID:2428330611466168

Subject:Software engineering

Abstract/Summary:

PDF Full Text Request

E-mail plays an irreplaceable role in the Internet age.Spam classification can not only block the spread of invalid information and improve user experience,but also block harmful information to reduce potential danger.In the past,many scholars proposed spam classification methods that based on traditional machine learning,and which had disadvantages such as difficulty in manual feature selection and long training time.In recent years,deep learning has made remarkable achievements in the field of natural language processing.fastText,as a shallow neural network,performs well in spam classification.Based on full investigation,this article found that fastText has the following two deficiencies:(1)Noise words will be generated in the n-gram feature processing stage.These words appear very frequently and lack actual semantic feature information,which reduces the accuracy of mail classification.(2)The email text is short and small,so sparse vectors and sparse matrices will be generated when modeling the vector space.The feature space cannot be fully mapped,which affects the classification.In response to the above problems,this article improves on the fastText algorithm.(1)The TF-fastText algorithm is proposed.The TF-IDF-N algorithm is used in the input layer to calculate the weights of the feature words after n-gram processing.According to the weights,the meaningless words with high frequency and low discrimination can be removed,so reducing the effect of noise words.Noise data improves the accuracy of mail classification.Through the combination experiment of TF-IDF-N and traditional algorithm,it is proved that the improvement of TF-IDF-N is effective.TF-fastText combined with traditional machine learning algorithms and neural network algorithms to carry out email classification experiments.The experimental results show that this algorithm can not only improve the accuracy of email classification,but also cost less time.(2)The LDA-fastText algorithm is proposed.After extracting the subject words in the corpus,the subject words are compared with the original word sequence.Then the words under the same subject word are supplemented to the original word sequence,reducing the sparse vector,and conduciving to high discrimination features representation of words in thehidden layer and improving the classification accuracy.It is combined with traditional machine learning algorithms,neural network algorithms and TF-fastText algorithm to carry out email classification experiments.The experimental results show that the accuracy of this algorithm for email classification is improved slightly but the time cost is higher slightly.(3)The TFL-fastText algorithm is proposed.Combining the advantages of the two algorithms,that removes redundant entries and supplements the sparse matrix.Then compare it with traditional machine learning algorithms: Naive Bayes,KNN,SVM,and neural network algorithm: fastText,RNN,CNN,and improved TF-fastText,LDA-fastText for email classification experiments.The experimental results show that this algorithm has the highest classification accuracy and the lowest time cost,which proves the effectiveness of TFL-fastText.

Keywords/Search Tags:

spam classification, deep learning, fast Text, TF-IDF, LDA

PDF Full Text Request

Related items

1	Spam Text Classification Method Based On Deep Learning
2	Research On Spam Text Filtering Based On Deep Learning
3	Research And Analysis Of Spam Classification Based On CNN Two-Way LSTM Attention Mechanism
4	Spam Messages Based On Integrated Learning Multiple Classification Study
5	Research On Spam-filtering Based On Deep Learning
6	Research On Key Techniques And Applications In Text Classification
7	Research On Internet Spam Identification Method
8	Content-based Anti-Spam Filtering
9	Image Spam Classification Based On Deep Learning
10	Research On Key Technologies Of Chinese Text Classification Based On Deep Learning