Font Size: a A A

E-Mail Filtering System Based On Information Fusion Criterion

Posted on:2009-10-15Degree:MasterType:Thesis
Country:ChinaCandidate:S WuFull Text:PDF
GTID:2178360245469760Subject:Communication and Information System
Abstract/Summary:PDF Full Text Request
Nowadays email is one of the most common network applications and has become the most important communication method. Content-based spam filtering is an important issue in Internet security technology. Application of machine learning approaches such as text categorization to spam determination is an efficient way for dealing with plenty of spam.This paper aims at characteristics of e-mail by analyzing the inadequacy of traditional technology in filtering spam on the basis of a large number of statistical analyses. We put emphasis on comparing the advantages, disadvantages and scope of applications of various feature selection methods, and achieve a Cross Entropy (CE) to replace IDF function of Term Frequency Inverse Document Frequency (TFIDF) algorithm, named Term Frequency Cross Entropy (TFCE). A new judgment has been proposed which is based on triangle module fusion at the same time to further improve accuracy of feature selection and effectively reduces the probability of mail misjudgment and lost of judgment.This thesis mainly includes the following parts: Summarize the state of spam filtering which include the definition of spam, danger and filtering techniques; Generalize common approaches of feature pruning, anti-spam filter and mail corpora. Also we emphasize on feature selection methods and filtering algorithms, the theory of TFCE; Summarize the framework and implementation of new algorithms which mainly include architecture, function model, organization model and flowchart of spam filtering. Based on research and academic analysis of information fusion technology, we give a detail analysis on the spam fusion judgment criterion. Simulation results are shown to verify its performance: One is comparison of various feature selection method, including TFCE; the other one is comparison between information fusion criterion based on triangle module and single judgment criterion. The simulation results suggest that Average accuracy of TFCE is higher than that of other traditional feature selection methods and the performances of information fusion criterion based on triangle module are also better than those of single judgment criterion.Finally, this paper proposes some suggestions to further improve the performances of spam filtering system based on TFCE feature selection method and triangle module fusion algorithm and effectively reduce mail misjudgment and lost of judgment, provides a new probability for the development of e-mail filtering technology.
Keywords/Search Tags:spam, feature selection, cross entropy, information fusion, triangle module
PDF Full Text Request
Related items