Font Size: a A A

Research On Theory Of Spam Filtering And Its Key Techniques

Posted on:2009-09-27Degree:DoctorType:Dissertation
Country:ChinaCandidate:Z LiuFull Text:PDF
GTID:1118360245461932Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Nowadays, spam flood has become one of the Internet disasters and aroused people's wide attentions. Since the first spam sprung out in the middle of 1980s, various anti-spam strategies and techniques came alone with it and developed rapidly till today. However, Investigations on anti-spam problems have trapped researchers into an"uncertainty garden". Subjective and objective uncertainties universally existed in discriminating spams have caused big performance bottlenecks on available automated machine classification and filtering methods. On the other hand, after decade years research, people have found in some extent that uncertain intelligent computing techniques are able to handle some uncertain problems in practical engineering applications. Althrough the theory is not perfect, researchers still keep exploring the rules behind the uncertainties and have achieved phased successful results since they believe God would not simply toss dice to create human beings. We also consent that uncertain intelligent computing techniques could well handle those subjective and objective uncertainties for discriminating spams from some aspects. Therefore, researching on uncertain intelligent computing theories and applying them into the area of anti-spam become the vital job of this dissertation. The involvement of the uncertain intelligent computing theories makes the research on spam filtering become a job which is full of challenges and delights.This dissertation utilizes and assimilates lastest achievements comprenhendly in uncertain intelligent computing and spam filtering. From two aspacts including theory and applicaton, investigations on uncertain intelligent computing and spam filtering are made deeply and carefully. The main research results and innovations can be conluded as follows:(1)The background of the spam issue is systematically analyzed, and the academic importance and practical value to investigate the spam issue is emphasized as well. By tracing the latest progress in spam filtering area, comparisons among various popular anti-spam approaches are made. According to the comparisons and our analysis, we conclude that uncertain intelligent computing theories based on statistics are feasible tools to improve spam filtering system's performance and worth investigating carefully.(2)Advanced approachs and innovative methods on Bayesian network are proposed. Firstly, for less complicated network, a PPJT algorithm based on global message propagation is proposed. New algorithm is able to decrease the time complexity and ensure the precision requirement in a less complicated network under small scale of samples input. Secondly, for Polytree-featured complicated network, extending inference algorithm to multi-machine mode is considered. By analyzing the structure of Polytree-featured complicated Bayesian network and defining new parallel evidence format which is suitable for multi-machine environment, a parallel inference algorithm is proposed which can well improve evidence propagation performance in a large Bayesian network with Polytree structure. Finally, parameter learning under incomplete evidence input is investigated. By applying a standard likelihood function to construct evidence-loss computing model and usingχ2 distance to estimate error disatance caused by evidence loss, an EM algorithm contained learning ratio is derived. Compared with traditional processing method, new algorithm can converge much faster without precision degradation and ensure a trusted Bayesian network parameter estimation under incomplete evidence input.(3) A kernel function-based Bayesian parameter estimation approach is proposed which is able to make the parameter estimation more applicable. Combined with the both sides of email content and format, a Bayesian network for spam classification is well constructed. The testing results by on-line learning for different email testing sets prove that the new model can ensure the classification and filtering efficiently by applying the kernel function-based Bayesian parameter estimation approach into the classification network.(4) An advanced fitted logistic regression model is considered to implement email classifier training. By introducing a coefficient function, characteristic of partial dependency(CPD) is well imitated while modeling. The testing results by various email testing sets indicate that the new model has much stronger response sensitivity on false positvie than on false negative and therefore realize a new email classifier with CPD at the algorithm level.(5) As to avoid various difficulties that content-based spam filters have encountered before, spam categorization method is researched from another point of view, namely spammers'behaviors mode. The new categrazation model is well constructed by extracting and selecting an email feature vector which is closely related to spammer's behavior features and applying SVM method to generate a classification function. After carefully model design and simulation tests, we found that the new categorization model is accurate and robust for spam discrimination.(6) A spam filtering system, SpamWeeder, which is located at the front end of the email server with multi-layer structures is well designed. SpamWeeder system has integrated Naive Bayes email classification based on multi-level attribute set, email classification based on Bayeisan network, email classification based on feature of spammer's behaviors, and email classification based on logistic regression model which have been brought forward in this dissertation. With coordination and collaboration of these approaches mentioned above, SpamWeeder is easy to manage and meet individual requirements and can archieve precise, fast and efficient performance as well.
Keywords/Search Tags:Spam, False positive ratio, False negative ratio, Bayesian network, Evidence theory, Kernel density estimation, Support vector machine, Logistic regression
PDF Full Text Request
Related items