Font Size: a A A

Research On Spam Text Filtering Based On Deep Learning

Posted on:2020-05-20Degree:MasterType:Thesis
Country:ChinaCandidate:X SunFull Text:PDF
GTID:2428330578979397Subject:Computer technology
Abstract/Summary:PDF Full Text Request
With the rapid development of big data,cloud computing and IoT(Internet of Things)technologys,various applications on the Internet are presented with complexity and diverse.Large amount of spam messages not only occupies a lagre amount of computing and communication resources,but also have a bad impact on human life.Spam text is one of the most important components of spam messages.We focus on the spam text filtering algorithms,related technologies and then propose novel text filtering algorithms.The main works and contributions of this paper are as follows.(1)As for the disadvantage of using the Recurrent Neural Network(RNN)on sentences clasification which can not extract keyword features,we propose a novel algorithm called TC-LSTM which combines convolutional neural networks with LSTM for spam text filtering.TC-LSTM works well on the spam text with obvious keyword features because of the structure of CNN.At the same time,due to using LSTM,TC-LSTM is effective on the sentences which include no significant keywords.Experiments show that TC-LSTM outperforms CNN and LSTM on spam text filtering.Experiments on different datasets show that the proposed method is more effective than other typical methods.(2)We study the influence of using Word Embedding in different ways in this paper,which is tested and verified on spam text datasets.We use three different Word Embedding methods,which are pre-trained word vectors and we fix it when model is training;pre-trained word vectors and fine tuned;randomly initialized word vectors which are jointly trained in the model.We experiment on different spam text datasets and analyze the results to further improve the performance of TC-LSTM.(3)We propose a new algorithm which is called TC-LSTM-TFIDF to improve TC-LSTM.This algorithm combines TFIDF and assigns different weights to each word,which improves the performence of TC-LSTM.Because our algorithm considers the influence of each word to the classification label,it works better in extracting features than previous work.Experiments show that the proposed mothod can markedly improve TC-LSTM and outperforms other typical methods.
Keywords/Search Tags:Spam Text Filtering, Deep Learning, Model Combination, Text Classification, Natural Language Processing
PDF Full Text Request
Related items