Font Size: a A A

The Study Of Streaming Text Representation Method Based On Suffix Tree Model (STM) And Its Application

Posted on:2006-04-05Degree:MasterType:Thesis
Country:ChinaCandidate:J ZhangFull Text:PDF
GTID:2178360185996992Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the popularity of Internet, more and more people pay attention to the management of data stream. Different from static data, the representation of streaming text appeal for higher speed of updating. Faced to network content security, we review other representation method and propose a representation method based on suffix tree model (STM). The advantage of this method is as follow:By the support of suffix tree, the alteration of the training corpus can affect subsequent operations in real time.By the support of suffix tree, the model can perform fast matching, obtain the vector presentation of text and avoid the complex computation such as word segmentation or feature extraction of the text.The model can take advantage of context location and do string match of unfixed length, which provide more information to subsequent operations.The avoidance of word segmentation and feature extraction shows that the categorizing process is irrelevant to do with the concrete language and is a language independent method.Based on the SpamAssassin, which is a free software of spam filtering, we combined the representation method with the classification algorithm and completed a spam filtering system.The filtering system based on suffix tree model and took advantage of context location and string match of unfixed length, then computed the similarity between the test mail and the corpus to determine the sort of Email finally. Experiment and analysis of the algorithm show that,The time complexity of text preprocessing in our system is O(N), which satisfied the speed of updating;The time complexity of filetering in our system is O(N);...
Keywords/Search Tags:Representation of streaming text, suffix tree model, text categorization, spam filtering
PDF Full Text Request
Related items