Font Size: a A A

Streaming Data Characteristics-based Spam Filtering Technology

Posted on:2010-10-11Degree:MasterType:Thesis
Country:ChinaCandidate:J XuFull Text:PDF
GTID:2208360275491800Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
The escalation of spam mail has severely interfered with our daily life and threatened the information security.With the progress of anti-spam technologies, spam-producers keep changing their strategies to avoid the blocking of current filters. Spam producing and spam filtering are always on the two opposite sides of a game. Spam filtering meets severe challenge with the flooding amount and variable forms of spam mails.Compared with common text categorization problem,spam filtering has its own characteristics.As for the content,spam mails are of different languages,different encoding methods and different file forms.As for the scale,spam mails has features just like stream data,such as large scale of the processing data,infinite increasing and dynamic change.Besides,an applicable anti-spam system has high demands on temporal and spatial computation cost to enable fast and efficient filtering.According to these characteristics,we propose a spam filtering method based on the characteristics of stream data,which evolves from the inexact string matching and fuzzy weighting,and can adjust the effective features used for filtering in real time. This method could optimize the temporal and spatial cost of the filtering computation to some considerable extent,while keeping the accuracy of the spam filter at a high level.And our experiments proved this advantage when comparing with other spam filtering method.The work of this article mainly includes the following parts:We give an overview of email principle and the current research state of anti-spam methods.We point out the merits and deficiencies of these methods,and analyze the key difficulties met by current methods.We introduce a string based spam filtering method and analyze its merits.Then a fuzzy weighting method is designed based on the inexact string matching and its advantage in spam filtering is proven by experiments.We introduce some current research progress on stream-data mining,and according to stream-data characters of email,a stream-data based method is proposed succeeding the previous work.The method improves the accuracy of the spam filter and reduces the temporal and spatial cost of the filtering significantly.We design and implement a time-stream based spam filtering system and prove the effectiveness of the method proposed in this article when comparing with other methods.Finally we summarize the achievements of this article and look forward to the new challenges we face in the anti-spam field.
Keywords/Search Tags:spam, stream data, time stream, text classification, feature selection
PDF Full Text Request
Related items