Font Size: a A A

Research And Implementation Of Chinese Spam Filter Technology Based On Content Mining

Posted on:2009-08-29Degree:MasterType:Thesis
Country:ChinaCandidate:J M XuFull Text:PDF
GTID:2178360242490825Subject:Computer technology
Abstract/Summary:PDF Full Text Request
With the rapid development of the Internet, the online electronic information is booming, and electronic mail become the fastest and most economical form of communication available. Unfortunately, a lot of junk mails (also referred to as"spam") are popular at the same time. The junk mails not only fill up mail server storage space, but also make user spend much time on removing these junk mails. As a result, it is significant to explore an automated mail filter. Firstly, an analysis and summarization on the current measures of anti-spam is given after a deep investigation on relative literature and data is made. Spam filtering, an important method for anti-spam, includes techniques based on IP address,manual rules and content filtering currently.Secondly, we research the Chinese-spam filtering technology based on content excavating mainly in this paper. The key technology of automated mail filter includes text excavating, pretreatment and e-mail classification. For the sake of reducing spam misclassification and valid e-mails misclassification, we propose a framework of junk mails filtering after a deep study on their basic principle and implementation mechanism, which bases on text excavating technology. And, each parts of this framework were researched and improved. In order to improve word segmentation velocity and system efficiency of spam filtering system, we propose an improved forward maximum matching for word segmentation Algorithm which supports the first character hash and half search. We use sliding window Advantage-Method in extracting feature. The misclassification-rate was reduced and the classification accuracy was increased by enlarging feature extraction range of vocabulary. In classification of spam filtering system, we put forward hybrid model (HM) which combine binomial model (BIM) with multinomial model (MM). The hybrid model can degrease misclassification-rate and increase classification accuracy of spam filtering system when it was applied to Minimum Risk Bayesian Classifier.Lastly, an Chinese-spam filtering prototype system are designed based on the theory of text excavating under above Framework, the design idea and implementation details were given in the same time. Besides, we construct an actual experimental and the testing environments and the results show that the improved Bayesian algorithm not only filtrate junk mails effectively, but also improve the recall and precision compared with traditional naive Bayesian algorithm.
Keywords/Search Tags:Spam, Text mining, Chinese word segmentation, Feature extraction, Bayesian algorithm, Mails filtering
PDF Full Text Request
Related items