Font Size: a A A

Research On Spam Filtering Algorithm Based On FastText

Posted on:2021-01-09Degree:MasterType:Thesis
Country:ChinaCandidate:M YanFull Text:PDF
GTID:2428330611466168Subject:Software engineering
Abstract/Summary:PDF Full Text Request
E-mail plays an irreplaceable role in the Internet age.Spam classification can not only block the spread of invalid information and improve user experience,but also block harmful information to reduce potential danger.In the past,many scholars proposed spam classification methods that based on traditional machine learning,and which had disadvantages such as difficulty in manual feature selection and long training time.In recent years,deep learning has made remarkable achievements in the field of natural language processing.fastText,as a shallow neural network,performs well in spam classification.Based on full investigation,this article found that fastText has the following two deficiencies:(1)Noise words will be generated in the n-gram feature processing stage.These words appear very frequently and lack actual semantic feature information,which reduces the accuracy of mail classification.(2)The email text is short and small,so sparse vectors and sparse matrices will be generated when modeling the vector space.The feature space cannot be fully mapped,which affects the classification.In response to the above problems,this article improves on the fastText algorithm.(1)The TF-fastText algorithm is proposed.The TF-IDF-N algorithm is used in the input layer to calculate the weights of the feature words after n-gram processing.According to the weights,the meaningless words with high frequency and low discrimination can be removed,so reducing the effect of noise words.Noise data improves the accuracy of mail classification.Through the combination experiment of TF-IDF-N and traditional algorithm,it is proved that the improvement of TF-IDF-N is effective.TF-fastText combined with traditional machine learning algorithms and neural network algorithms to carry out email classification experiments.The experimental results show that this algorithm can not only improve the accuracy of email classification,but also cost less time.(2)The LDA-fastText algorithm is proposed.After extracting the subject words in the corpus,the subject words are compared with the original word sequence.Then the words under the same subject word are supplemented to the original word sequence,reducing the sparse vector,and conduciving to high discrimination features representation of words in thehidden layer and improving the classification accuracy.It is combined with traditional machine learning algorithms,neural network algorithms and TF-fastText algorithm to carry out email classification experiments.The experimental results show that the accuracy of this algorithm for email classification is improved slightly but the time cost is higher slightly.(3)The TFL-fastText algorithm is proposed.Combining the advantages of the two algorithms,that removes redundant entries and supplements the sparse matrix.Then compare it with traditional machine learning algorithms: Naive Bayes,KNN,SVM,and neural network algorithm: fastText,RNN,CNN,and improved TF-fastText,LDA-fastText for email classification experiments.The experimental results show that this algorithm has the highest classification accuracy and the lowest time cost,which proves the effectiveness of TFL-fastText.
Keywords/Search Tags:spam classification, deep learning, fast Text, TF-IDF, LDA
PDF Full Text Request
Related items