Font Size: a A A

Research On Key Issues Of Spam Detection And Filtration

Posted on:2011-11-18Degree:DoctorType:Dissertation
Country:ChinaCandidate:W H LiuFull Text:PDF
GTID:1118330332472165Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
As the Internet plays an important role in almost every aspect in our life, spam grows faster than ever. The omnipresent spam on the Internet has caused serious problems for people's social and economic life. Finding effective spam detection and filtering is an urgent task for both scientific researchers and engineers.In recent years, there has been a lot of research in spam detection and filtering. However, as the techniques adopted by spammers are getting more and more complicated, spam detection and filtering technologies need to improve constantly as well. In such a background, this research made significant progress on several important tracks of the spam issue.The main content and innovation of this research include:1. A theoretical proof of the feasibility of spam similarity calculation using fingerprint vectors. Edit distance is one of the several most accurate techniques in finding similar documents. However, the high complexity and computing demand severely limits its application scope. In comparison, q-gram, although with a lower accuracy, improves much in complexity and computing speed. Based on the idea of q-gram, fingerprinting improves the computing speed further, but also with an even significant error rate. This thesis provides a theoretical analysis for the relationship between q-gram and fingerprinting.2. In addition to theoretical analysis, the effectiveness of fingerprint vectors in spam filtering is further proved by practical experiments. This thesis carried extensive experiments on spam detection by fingerprinting on several public spam corpus. Research results indicate that the fingerprinting based Bayes method has a significant improvement in accuracy, storage and computing demand compared to traditional Naive Bayes method.3. A novel online active learning method is proposed. This stream-oriented filtering method uses a committee based polling technique. To compare the effectiveness of this method, several experiments are done on public corpus. Experiment results clearly demonstrate that this method, compared with other methods, can achieve higher accuracies with fewer tagged examples and less human interference. 4. We propose a gradient based computing method in the CRF parameter estimation against obfuscated emails. Experiments prove that after the de-obfuscation process, traditional filtering method can achieve higher accuracies.
Keywords/Search Tags:Spam, Fingerpint, Edit Distance, Q-gram Distance, Active Learning, Conditional Random Field
PDF Full Text Request
Related items