Font Size: a A A

Research On Text Filtering System Based On Active Learning

Posted on:2012-08-05Degree:MasterType:Thesis
Country:ChinaCandidate:H SunFull Text:PDF
GTID:2178330335960292Subject:Pattern Recognition and Intelligent Systems
Abstract/Summary:PDF Full Text Request
Facing the spam problem caused by overexpansion of information in the fields of Telecom and Internet, we researched the relative theory and technique of SMS and Email filtering, and proposed a text filtering system based on active learning method. The main innovations of this thesis are stated below:Firstly, this paper proposed the two-training-phase active learning to handle with the lacking of training samples problem at the training initial stage, and improved the qualities of the selected training samples.The traditional one-training-phase method which applied to training stage does not consider the need may be different at different training phase. Our active learning method is based on maximum-minimum entropy theory, and we regard the training stage as two phase. At the initial phase, because the filter has few data, we need to choose the deterministic data; At the later phase, the filter has sufficient deterministic data, so we need to choose the data near the classification border. The experiments showed our algorithm is better than the traditional one, which not only make the filtering system more accurate but reduced the work of human making label.Secondly, facing the different characteristics of SMS and Email filtering, we researched three filtering methods:Multinomial Bayes, Bernoulli Bayes and Vector Space Model, and found out the most effective method.For Email filtering, the vector space has high dimension, so we prefer the multinomial Bayes based on term frequency or VSM; For SMS filtering, because the SMS vector space has few features and low term frequency, the Bernoulli Bayes method is the best.Finally, we designed and implemented the text filtering system based on active learning. The system has three phase:training, filtering and feedback. Our filtering system had satisfactory precision and recall rate by the experiments.We made use of four classic feature selection methods to eliminate noise of the corpus:CHI, Document Frequency, Information Gain and Mutual Information. At training stage, we used active learning to choose more valuable data; At filtering stage, we applied the best method to each filtering scenario; At feedback stage, we got feedback data by methods based on threshold, then used relevance feedback and pseudo relevance feedback based on Rocchio method.
Keywords/Search Tags:text filtering, active learning, feature selection, relevance feedback
PDF Full Text Request
Related items