Font Size: a A A

Research Of Spam Filtering System Based On LSA And MD5

Posted on:2009-03-11Degree:MasterType:Thesis
Country:ChinaCandidate:J T SunFull Text:PDF
GTID:2178360245956832Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With rapid development of Internet, the growing problem of junk mail (also referred to as "spam") has caused wide public concern. Today, many means can be applied to resolve the problem of spam. Contend-based spam filtering is one of the mainstream technologies used so far. The approach is using automated text categorization and information filtering to get spam.This paper uses latent semantic analysis (LSA), the system performs training using given mail collection so that classification machine can text classification and information filter of inspected mail and obtaining spam. Yet LSA usually inherited vector space model in calculation of weight, failing to pay proper attention to its own characters. This results in the absence of priori information and global information implantation of documents as well as lack of flexibility in actual application. To solve the problem, this paper is to introduce new weighting functions to improve existing ones. The results show that Latent Semantic Analysis based on modified weight function is better than that old one.Besides, the sender address of most spam today varies in a dynamic way, yet contents of the text or attachment are the same. In large-scale LAN that includes tens of thousands of users, spam usually spreads itself across the network by means of mass mailing. In consideration of these characteristics, the "Email fingerprint" of multi-send spam is born with MD5 on the LSA analytical foundation, the problem of filtering technique's low effect in the multi-send spam is resolved with this kind of method.Our designed system was evaluated with a Chinese anti-spam alliance's dataset. The results obtained were compared with Na(?)ve Bayes algorithm filter experiment results show that system based on Latent Semantic Analysis and MD5 performs Na(?)ve Bayes. The experiments show the expected results obtained, and the feasibility and advantage of the new spam filtering method is validated.However, much more work should be done in order that the filter can be used in practice.
Keywords/Search Tags:Feather Selection, Latent Semantic Analysis, Message-Digest Algorithm 5, Slipping Windows, Email fingerprint, Spam Filtering
PDF Full Text Request
Related items