Font Size: a A A

Research On Spam Filtering Technologies Based On Content Characteristics Analysis

Posted on:2013-01-01Degree:MasterType:Thesis
Country:ChinaCandidate:J ZhaoFull Text:PDF
GTID:2218330371969290Subject:Communication and Information System
Abstract/Summary:PDF Full Text Request
With the fast development of internet technology, e-mail which is low-cost and simpleoperation becomes a new way to exchange information in our daily life. However, amounts ofspam mails have great impacts on societies, and often cause vast economic losses. They consumenetwork resources, waste users' time and money, and even spread both harmful information andvirus. For this reason, researching effective anti-spam technology will bring far-reaching socialmeaning and huge economic value.Content-based filtering is becoming the hotspot of anti-spam technology research, becauseit has good filtration effect and catches change in spam characteristics in time. At present, It hasmade some achievements in research and application, but there are still some problems that needto be resolved.(1) The large number of training samples and high vector dimension lead to highoperation and space complexity;(2) Classifying emails by analyzing email content is uncertaintyand timeliness;(3) The structure features of email are ignored;(4) Single technology is hard tosatisfy the request of spam filtering.This dissertation does researches due to the above problems and improves filtering accuracy.The innovative work of this dissertation mainly includes the following aspects:(1) This dissertation proposes a feature selection approach based on improved odds ratio,which reduces the operation time and space complexity.Due to the problem that operation time and space complexity are high in content-basedfiltering, this dissertation improves odds ratio to selecte feature items. Firstly, this dissertationevaluates the following aspects of feature selection methods which used to filter emails:classifier adaptability, data set dependence, time complexity. Experimental results show that oddsratio is better than other methods. Secondly, it analyses feature items selected by odds ratio andcomputational formula of odds ratio, which show that the odds ratio is hard to select the featureitems with high word frequency or feature items contributing to two categories. Finally, due tothe above problem of odds ratio, it improves the computational formula of odds ratio byanalyzing frequency factor and categorie information. Experimental results indicate that theimproved odds ratio shows a further decline in operation time and space complexity, while theprecision of spam filtering is still high.(2) This dissertation proposes an improved na ve bayes algorithm combining feature with noncharacteristic information, which increases the precision of spam filtering.Due to the problems that content-based filtering has uncertainty of classifying emails byanalyzing email content and ignores structure features of email, this dissertation proposes animproved na ve bayes algorithm combining feature with noncharacteristic information. Theimproved algorithm considers the contributions of email header and body between ham andspam, which overcomes the dependence of classifying emails by analyzing email content,increases the precision of spam filtering, and reduces the false rate of ham emails. In this method,this dissertation firstly analyzes structure features, extracts noncharacteristic information, whichare different attributes of fields in email header between ham and spam, and selects typicalfeature information from email content; then, it combines feature with noncharacteristicinformation to improve the formula of na ve bayes. Experimental results show that the approachimproves the recall and precision of spam filtering.(3) This dissertation designes and realizes multi-layer spam filtering modules, and uses it inmail server to filter spam emails.Due to the problem that single technology is hard to satisfy the request of spam filtering,this dissertation designes and realizes multi-level spam filter modules, which is a collection oftechnologies, and applys it in mail serve to filter spam emails. In this module, it includesblacklist and whitelist technology, key words filtering, content-based filtering. All kinds oftechnology cooperate perfectly to do high performance of spam filtering. In especial, thecontent-based filtering modules use these improved approaches of the dissertation, whichimproves the filtering veracity.
Keywords/Search Tags:Spam Filtering, Feature Selection, Odds Ratio, Naive Bayes
PDF Full Text Request
Related items