Font Size: a A A

Application Of Various Classification Methods In Spam Message Recognition

Posted on:2018-04-18Degree:MasterType:Thesis
Country:ChinaCandidate:B Z ZhangFull Text:PDF
GTID:2348330518483227Subject:Applied Statistics
Abstract/Summary:PDF Full Text Request
With the progress of science and technology,SMS and e-mail become an indispensable part in our life,information technology is constantly changing our life and makes our life more convenient,at the same time we also suffer from various problems that It brings us.In everyday life we will receive all kinds of text messages and emails,including garbage these text messages and emails.Those SMS and mails may have some attacks.They pose a threat to our information and property security,people constantly try to identify these wastes using existing technology,and hope to get high accuracy.The identification of spam messages and e-mail is actually a text mining.For text mining,text processing and classification technology are particularly important,This paper is mainly written about these two aspects.In this paper,we first introduce the methods of text data processing,including the selection of feature words and the construction of the feature vector space model.Then the theories of some classification methods are introduced,such as k-NN,SVM,RF,NB and so on.At the same time,the text data used in this paper is unbalanced,we also introduce the optional measure of the prediction effect of the classifier,such as accuracy,recall rate,real rate and so on.Finally,we use English SMS text data from https://www.kaggl.com/uciml/sms-spam-collection-dataset to establish the vector space model of characteristic words,apply these classification methods and establish various classifiers,then we compare the forecasting effect of various classifiers by cross validation comparison.For classifier prediction,this paper mainly adopts the accuracy rate,recall rate of normal SMS,the recall rate of spam SMS as evaluation standards.By comparison,we find that the recall rate of each classifier is very high and there is almost no difference.There is a big difference in the recall rate of spam messages,the recall rate of naive Bayes is the highest,the recall rate of k-NN is the lowest.This conclusion is in line with the practical application of Bias in e-mail filtering and text classification,and KNN is not applicable to the uneven problem.
Keywords/Search Tags:Text feature words, Space vector model, Logistic, k-NN, SVM, Decision Tree, Naive Bayes, RF, Combination method
PDF Full Text Request
Related items