Font Size: a A A

Research And Design Of Content-Based Spam Detection Framework

Posted on:2015-02-13Degree:MasterType:Thesis
Country:ChinaCandidate:C Y LuoFull Text:PDF
GTID:2268330428463619Subject:Control Science and Engineering
Abstract/Summary:PDF Full Text Request
Email has gained immense usage in daily communication due to its convenient, economical and easy to use nature. These days, however, the huge number of email spam has caused serious problems in email communication. To mitigate sufferings of spam emails, variety of techniques have been developed. One main method is content-based spam filtering, for which, classification methods and feature selection algorithms are critical techniques.Text categorization is a method to establish spam detection model by seeking law or statistical learning. Content-based spam detection which owns self-learning, adaptive capacity and high accuracy only needs statistical learning instead of considering the semantic environment.Dimensionality reduction is a very critical process to select the most discriminative features from the original high-dimension feature space for classifier training, which, to a large extent, determines the precision and efficiency of spam filtering. Vector space model (VSM) is often used to represent the text of emails while the dimensionality of VSM is always very high. Therefore, How to deal with the original high-dimension feature space is really an important question. Due to the good classification performance simplicity of implementation, feature selection is a vital object of study to spam detection.In this paper, we proposed a framework OCFSVM combining SVM with feature selection algorithm OCFS (Orthogonal Centroid Feature Selection) for spam filtering. Extensive comparison experiments were performed on six spam corpuses. The results showed that, compared with other traditional combinations, the combination of SVM and OCFS obtained more excellent performance in terms of accuracy and F-measure.The major work and innovation of this paper are listed as follows:(1) Adopted the OCFS algorithm and SVM algorithm in text classification, proposed a combined framework to reduce the redundancy and keep high accuracy for content-based spam detection. The framework of OCFSVM was constructed based on platform of Matlab, Weka, C#and Eclipse.(2) Designed a series of extensive comparison experiments with three different classifiers, three different feature selection algorithms on English corpus PU, Chinese corpus ZH1and mixed Chinese-English corpus of own collection.(3) The assessment indexes of F-measure and accuracy were adopted. Experimental conclusions were summarized based on a comprehensive analysis of the laboratory results which proved that OCFSVM algorithm could detect spam effectively under different condition. Compared with traditional frameworks, a noteworthy improvement of detection performance was achieved.
Keywords/Search Tags:Content-based, Spam, Detection, OCFS, SVM, OCFSVM
PDF Full Text Request
Related items