Font Size: a A A

A Multi-level Framework To Filtering Spam Messages Based On Text Content

Posted on:2017-01-12Degree:MasterType:Thesis
Country:ChinaCandidate:J MiFull Text:PDF
GTID:2308330503458928Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the constantly updated form of short message and text feature, it is very urgent to filter spam messages accurately and fast. Nowadays, the existing spam SMS filtering methods mainly include setting black and white list,matching key words,reporting actively by users and filtering based on content, etc. Among them, spam SMS filtering based on content can more effectively respond to the diversity of the constant updated message form, and does not have to rely on other kind of information of SMS. But for text content, traditional filter algorithms ignored the obvious text characteristics of spam message which influences the filter’s performance. Besides, these methods have no good solution to the problem of sparse vector caused by short-content.In this paper, we proposed a new framework for building classifiers that deal with filtering out spam messages based on text. This new framework makes great use of noise information which may contributes greatly before pre-processing. It abstracts this part of noise information as custom properties and then use them as the first feature set to filter typical spam messages. After that, it predict training set with LDA topic model, find the distribution between topic and text and the distribution between topic and word, then it can find more synonyms for original key words. By this, this framework can extend features effectively and reduce the negative effect of the sparse vector on the classification results.In the end, this paper describes the experimental sections. The data sets we used are real messages from public which can represent the varying proportion of spam and legal messages that users received. We did a careful experimental procedure to evaluate the effect of this new spam filter in three aspects, ‘spam’,’legal’ and ‘weighted’ respectively so as to analyze the result from different angles. Meanwhile we investigated the effect of training-corpus size, sub-classifiers number, feature set size on the filter’s performance. The results proved that this filtering framework can effectively improve the accuracy of filtering spam messages based on text content.
Keywords/Search Tags:spam message filtering, text classification, Feature extension, Classification algorithm
PDF Full Text Request
Related items