Font Size: a A A

A Study Of Deep Learning-based Methods Of SMS(short Message Service) Spam Detection

Posted on:2020-06-25Degree:MasterType:Thesis
Country:ChinaCandidate:T LiangFull Text:PDF
GTID:2428330578965051Subject:Electronic and communication engineering
Abstract/Summary:PDF Full Text Request
Following the prevalence of communication technologies and smart phones,the amount of short messages sent and received has seen a sharp rise.However,a large number among them are unsolicited commercial ads,fraud information and illegal pyramid schemes that affect people's normal life,as well as some rumors plotted to sway the society from its harmony and stability.Therefore,the research on methods of SMS spam detection,and precise distinguishing and spam filtering has its realistic and economic effects on maintaining national security and social stability.Traditional text classification techniques tend to combine vector space model(VSM)and support vector machine(SVM)to classify texts.Yet in terms of text display,VSM overlooks the impacts that word order,grammar and semantics can have on the classification of texts,whereas Word2 vec model addresses the above deficiency of VSM through the training of word embeddings,which distributedly mapped words from higher dimension space onto lower dimension space.This,in addition,rectifies the inappropriate orthogonal relation between words.Considering the advantages of Word2 vec model in terms of text display,this research works on this basis,and further proposes a PTF-IDF weighted word vector method for text display,named PTF-IDF weighted word vector model in this research.A convolutional neural network spam message detection model is also designed based on on deep learning theory.The work this research involves are as follows:First,in the initial exploration stage this research applies traditional text classification techniques that combine VSM and SVM to the spam message detection.To begin with,the original short message dataset undertook preprocessing steps,such as iterative text removal,data cleansing,sentence separation,and part-of-speech tagging,etc.It then constructed a document-term matrix,grounded on the Bag-ofwords model,to represent the texts in a vector manner.An SVM Classification model was established,thereafter,to train and test the lab data — 20,000 short messages extracted from the original dataset.These short messages were divided into two halves,one for training,and the other testing.The results of this experiment,shown in forms of Precision(P),Recall(R),and F1-score(the indicators for classified evaluation),suggests that the limitations of VSM model restricts its classification performance.Second,in terms of text display methods,given that text messages are short in length,short of feature words,and strong in their contextual semantic relations,the level of contribution different word embeddings made to the text cannot be measure if they are directly applied to text display after trained by word2 vec model.Furthermore,word embeddings alone cannot cover sufficient text information.Therefore,this research proposes a PTF-IDF weighted word vector method for text display,which uses part of speech as a supplement to text semantics.This method introduces contribution factors into TD-IDF algorithm,which therefore enables it to calculate the feature weights of word embeddings from the perspective of both word frequency and part of speech.In the classification experiment where the text display method was used in combination with the SVM model,the part-of-speech contribution factor values were changed.It is found that the best classification performance was achieved when the values of the part-of-speech contribution factor are 0.6,0.3,and 0.2.When comparing this experiment with those using TD-IDF model,mean word vector model,TF-IDF weighted word vector model,the results all indicated that the PTF-IDF weighted word vector model has an advantage in text display over the other models.Third,a convolutional neural network spam message detection model has also been designed based on on deep learning theory in this research.In the input layer of convolutional neural network,the PTF-IDF weight of the word vector is calculated according to the value of part-of-speech contribution factor: 0.6,0.3 and 0.2,and together with the word vector,the short message text is expressed as a two-dimensional matrix as the input of CNN model..Moreover,three convolution kernels of different sizes were devised to extract the local features of short messages with separate granularity.This improves the precision of the features extracted from the short messages.Additionally,the 1-Max pooling strategy was employed in the pooling layer to further extract the most representative features of short messages,which in the end was combined in the fully connected layer and then input into Softmax layer to realize the detection of spam short messages.In the experiment,several groups of experiments were designed and compared,which verified that the CNN model designed in this paper had improved in accuracy,recall and F1-Score,reaching 97.01%,94.10% and 95.53% respectively.
Keywords/Search Tags:text classification, word embedding, TF-IDF, part-of-speech contribution factor, convolutional neural network
PDF Full Text Request
Related items