| With the developing and widespread of Internet technology, Email has becomeof the the most important and common communication medium for most every persondue to its low cost, convenience and fast. However, at the same time, its byproductspam (also known as junk mail) is eroding the whole Internet like violent ffood. Spamcauses much misuse of storage space, bandwidth and computational resource, destroystheimageofISP,takesupemailusersmuchtimetocheckadditionalemails, evenaffectthedevelopingofthewholeInternet. Asaresult, itissigniffcanttoresearchandexploitan effective spam ffltering system.Firstly, an analysis and summarization on the basic protocol of Email, the harmof spam and the main cause of so huge mount of spam is given. And then, a deep in-vestigation on the behavior-based and content-based anti-spam is made, and propose abehavior-based and content-based multilayered spam ffltering model. Finally, we im-plement the model in real enterprize implement. Behavior-based spam ffltering mainlymakes same feature mining based on single Email's head and body, and through build-ing users'relationship model to effectively solve the false positive problem. At thesame time, content-based ffltering adopts multilayered ffltering, and propose and usean optimized text classiffcation algorithm, and further increase the spam recall andreduce false positive.The main thesis work and contributions:1. Buildawholespamfflteringsystem,includingbehavior-basedandcontent-basedspam ffltering, and implement it in real enterprize product environment.2. Based on the source of spam, make use of IP blacklist, real time blacklist and SPF technology in the session stage to ffnd obvious spam to reduce the burdenof system.3. Discover the features of users'behavior according to different enterprize users,and build different levels of frequency limits.4. Build content-based spam ffltering model, based on the features of Email's con-tent, build same domain free-ffltering mechanism, recent contact mechanism, toeffectively reduce false positive without affecting spam recall.5. Research and implement optimized Naive Bayes classiffcation algorithm, and itexhibits better results than SVM on three public Email datasets. Combine thealgorithm with user fame model, and adopt ffexible threshold to get better result.Experiment results and practical application experience show that the spam fflteringsystem can ffnd most spam with very low false positive. Both behavior-based fflter-ing and content-based ffltering are necessary, and the system have the merits of higheffciency, high accuracy, easy maintenance and good expansibility. |