Font Size: a A A

Study And Implementation Of The Anti-Spam System Based On Bayesian Algorithm

Posted on:2007-07-23Degree:MasterType:Thesis
Country:ChinaCandidate:H J ChenFull Text:PDF
GTID:2178360212459300Subject:Software engineering
Abstract/Summary:PDF Full Text Request
With the development and popularization of Internet, E-mail has been widely applied E-mail also brings convenience to the people but have a new problem—There is a lot of spam.Spam is not only to consume a large amount of network resources. Dissemination of unhealthy information would cause great harm to society, So study spam filtering is of great significance.This paper first outlines the e-mail and filter technology, Then, we do an analysis of the Linux kernel and related technologies introduced, Including technical of Linux data packets intercepted, packet structure and operational functions in Linux operating system,definition of kernel space and user space, and basic method of kernel programming. On these bases, a mail filtration model is proposed with the main functional module running in Linux kernel space to analyze and filtrate E-mails efficiently. The general structure of the E-mail filtration model is described with the introduction of the component modules and functions of sub modules. This is a content-based anti-spam technology. Improved Bayes algorithm, Spam will reflect the characteristics combined referred to as the "attribute", Use these "attributes" constructing eigenvector of mail features of the vector space model, Avoid the pure IP-based, the first letter and envelope shortcomings tend to filter rules. To reduce the risk of normal mail judged as spam, Improve the system of judgment. Further ensuring the normal mail arrived stability and real-time , Helping to form a more accurate mail filtering capacity for learning.To improve system performance, This paper studies the various technical spam filtering system, which include Chinese word segmentation technology, the dictionary mechanism, automated text classification technology, and so on. With analysis of the Chinese word segmentation technologies, this paper combines the leftward increase maximum matching algorithm and the rightward decrease minimum matching algorithm to segment the text, and adopts Mutual Information to eliminate different meanings, thus the precision of Chinese word segmentationis enhanced. A imingat the current dictionary mechanism,this paper put forward an improved PATRICIA-tree-based dictionary mechanism for Chinese word segmentation, which can improve the speed and efficiency of segmentation, but decrease the space complication,and reduce the build and maintenance difficulty. By comparing all kinds of feature selection functions, this paper adopt Expected-Cross-Entropy to select the features, this can form a base for exact text classification. This paper analyze two ways to improve the Bayes algorithms, and point out that the essentials of these two ways are same, so this paper adopt the improved Bayes algorithm to reduce the risk of false judgement.
Keywords/Search Tags:E-mail filtration, Linux kernel, Linux firewall, Netfilter
PDF Full Text Request
Related items