Font Size: a A A

Based Spam Filtering System To Improve The Svm

Posted on:2012-07-08Degree:MasterType:Thesis
Country:ChinaCandidate:X L ChenFull Text:PDF
GTID:2208330332490338Subject:Communication and Information System
Abstract/Summary:PDF Full Text Request
With the development of network and communication technology, E-mail is becoming the most important way of communication among modern people due to its convenience, rapidity, easy saving and management. However, in recent years, spam has become a key problem in electronic communication. Spam has frustrated, confused, and annoyed e-mail users because it can waste time, spread virus and so on. Therefore, anti-spam research has an important practical significance.Facing the increasingly serious spam problem, spam filtering technology which based on machine learning techniques become the study hotspot. As one of most popular machine learning techniques, Support Vector Machines (SVMs) have made a large contribution to the development of spam email filtering. However, in the process of spam filtering techniques research and application, there are still many issues need to be resolved. For example, how to improve the blocking rate spam e-mails while reduce the misclassification rate of legitimate e-mails, how to use the structure and the nature of the text to spam filtering and how to integrate existing spam filtering technologies into filtering solutions and so on.This article researched those problems existing in email filtering, and built a fully functional spam filtering system. The mainly work of this article as follows:1. In order to reduce the misclassification rate of legitimate e-mails while filtering spam e-mails, this article proposed a method of spam filtering based on Weighted Support Vector Machines (WSVMs).The filtering method based on SVMs is very suitable for the spam filtering, but the standard SVM is to optimize the classification accuracy. So when used in spam filtering, the accuracy of the SVM is very high, but the precision is low. In order to solve this problem, this article proposed a method of spam filtering based on WSVMs. This method introduces two weight variables, one variable reflects the importance of different classes and another reflects the importance of each e-mails. Experimental results show that the method can enhance the filtering performance effectively by adjusting the weight variables.2. A lot of semantic information is lost when filtering spam e-mails due to text structure is neglected. In order to solve this problem, this article proposed a word sequence kernel based on dependence measure, and used the kernel in the field of spam filtering. This method improves the accuracy of spam filtering.The structure of the text is neglected by using the majority of used kernels to classification, so that a lot of semantic information is lost. In order to solve this problem, a word sequence kernel based on dependence measure is proposed and used in the field of spam filtering in this article. Firstly, extracted the features of each e-mail and calculated the dependence measure of each feature; then used the word sequence kernel as kernel function to training SVM, and the decay factor of each feature was calculated by taking the dependence measure of each feature into account in the training process; finally, the optimized SVM filter was used to spam filtering. Experimental results show that the improved word sequence kernel got the best accuracy compared to the commonly used kernels and string subsequence kernel. This method improves the accuracy of spam filtering.3. For the problem that only single filter technology cannot achieve satisfactory filtering results, this article proposed a multi-level spam filtering solution, and then built a fully functional spam filtering system.According to the advantages and disadvantages of various e-mail filtering techniques, this article integrated IP address filter ,DNS filter ,black and white lists, the keyword filter of subject and attachment, and text content filter into the multi-level spam filtering solution. The advantages of various techniques were showed well and those shortcomings of various techniques were avoided in this solution. Finally, a fully functional spam filtering system was built.
Keywords/Search Tags:Spam Filtering, Weighted Support Vector Machines, Word Sequence Kernels, Dependence Measure
PDF Full Text Request
Related items