Font Size: a A A

Design And Implementation Of An Email Filter

Posted on:2011-03-15Degree:MasterType:Thesis
Country:ChinaCandidate:W DaiFull Text:PDF
GTID:2178360305454660Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With the popularity of Internet, email has became as the Internet's most important and most efficient application with its fast, convenient, low-cost feature, especially in office situations. Email is also good at announcing something or recording the discussions and meetings.But spam-email also increases fast and become spread over the world. Some people create a lot of spam using the bug of the server network protocol filtering policy. So email users mutst take much time to filter the spam. Spam-email not only takes up a lot of unnecessary storage, computing and network resources, but also influences the user's normal work, live and learning. How to filter spam effectively is a big problem we must face up as well as the Internet currently pressing problem. This article designs a email filter to identify and filter email-spam, using the mainstream technology, algorithms and strategies adjusted to the mainstream, and adding my own algorithms and method.The current mainstream technology is as follows (the method and strategy are adjusted optimally):1) Real-time black and white list:Using the international real-time blacklist RBL. the server will find the spam gathering primary domain IP or with restrictions, and generate a blacklist, this analysis through SMTP, as long as the matching of IP found or directly block out the main domain. The paper makes the data statistics IP, Zhu domain or mailbox of junk, such as rule-based and content of the analysis some primary domain has a lot of rubbish, putting these results flushing blacklist, along with feedback and reported users's garbage data, which is added as a blacklist, based on user does not want to see the premise of spam, but testing found that there is the phenomenon of user reports chaos, here are used in the design optimization program:a report and the rejection process combined, will be reported and action pull a complete black; reported an increase of spam type selection, while the focus is set to "Cancel" button on. This will not only convenient for users to report and rejected, and because the focus in the "Cancel" button, the user does not directly improve the button to see the phenomenon of text mess.2) DNS reverse resolution:when the user received the message the From field indicated sender, but the spammers usually use empty mail address to hide their identity. This phenomenon can be set in the computer foxmail express delivery, so that no need to set the SMTP server (this time is to get computers to do their own SMTP server) so that the From field to send out can be set arbitrarily. However, IP is the only, you can check each other to accept mail from the letter which made over IP, and then back under the IP address resolution, if the same domain name and IP to accept, or considered spam.3) The dynamic rules filtering based on statistical strategy:Rule-based filtering of spam previously observed characteristics of artificial and based on the theme of the message header and message content to set keyword, certain settings on the matching messages as spam, such as Mail Title contains pornographic, reactionary keyword messages. In this paper, the dynamic rules based on statistical filtering, on the one hand, not based on statistical methods of human labor movement dimension of spam samples analyzed, and often form as a keyword or the rules and give a rate to an e-mail on the analysis of to see if there is match these rules and to a rate this message, if the score reaches the set threshold are considered spam; on the other hand, the rules are dynamic, more timely, being the rules that determine a waste You build sets into the garbage, and accepted statistical analysis update rule base, this rule is limited to avoid the previous problem of death.4) Naive Bayes classification:Naive Bayes classification is based on probabilistic methods using prior knowledge to predict the future events which may need to identify two aspects: extractiong of email feature and the definition of message type. This article is spam, so type the two kinds of-trash and non-spam. Features for 80%-90% of these messages to 0k-5k for a short text, you can go to traverse the message body (not very time-consuming experiments show), the body of the message preprocessing to filter out Stop-words, function words, auxiliary, punctuation Words such as interference, then the largest positive word matching algorithms, in accordance with the TF and IDF calculation of the largest companies calculate TF×IDF feature vector composed of the five words that the characteristics of a message, according to Bayes Theorem to calculate the posterior probability, take the maximum as final classification.5) Set the honeypot mail:the above and the following algorithm to be described, statistics are needed if the sample of spam, this collection of samples used to set spam honeypot mail technology, which automatically generates a number of mail as a honeypot mail as a trap to lure spam.In this paper, the mainstream technology and algorithm is improved and adjusted by making the following programs: 1) Mining behavioral characteristics of spam:the sending of mail log analysis, will find that the behavior of spammers are mostly the same, even if they deliberately avoid spam the same or similar content. Based on statistical analysis showed that spammers will send a large number of messages in a short time, the message contents and similar files smaller, but their not sure what email address it will try to send e-mail account does not exist and so, this paper The shielding characteristics of spam.2) Mail Similarity:Similarity arises because the experiment was found in a number of similar e-mail content, but some were intercepted some are not, by observation and analysis showed that control a large number of spammers IP and account resources, avoid duplicate IP and account to send mail, so to bypass the IP and account to send traffic restriction policies. In this paper,2-gram and zebra hash of the thinking of each message is mapped to a matrix (called the transfer matrix), the matrix code that uniquely identifies this message as a signature, placed inside the hash table, face an e-mail that is mapped to the matrix and the code signature, and then and to find the hash table, find the message that is similar. Some of the filters can not identify the waste, junk e-mail samples can be added, so that subsequent similar messages will be blocked, the sample can be used to add spam settings described above honeypot mailbox, also can be used to send the account to not exist e-mail as a strategy for waste samples.3) spam urls filtering:analysis by spam and found that a large part of spam, only with a link and this link is often the spammers set up ads and even embedded Trojan, users will often point out of curiosity go in, so it will jump to spammers to set a trap. However, algorithms and strategies described above can not cover this issue and put forward approach based on filtering spam url. On the one hand, using instant messaging from a search engine or platform to get or mining waste url, on the other hand, on the front cover of spam tactics to mention chain, over white list (to avoid spam appears legitimate url) into the garbage after the url set. From the search engine or instant messaging platform to get the user logs on which site traffic small jump occurred, the user stay a short time and dead chains url research, the majority is garbage site resources; from the browser point of view, the links lead to browser crashes, reboot, or some malicious site home page and user profile changes, through the analysis of these sites and links are almost always refuse url; the same time the blacklist, some can not recognize the waste directly into blacklist effect.This mail filter designed to use offline and online computing combined with high computational complexity of computing into the line, into the calculation of low complexity online, while meeting the efficiency and robustness. To assess the effect of filtering, using precision and recall rates of the two indicators, by experiment, the precision and recall rate of 95% or more, to meet the real environment, e-mail filtering application.
Keywords/Search Tags:Spam-emails, filter, rules, statistics, black and white lists, user behavior
PDF Full Text Request
Related items