Font Size: a A A

Research On Text Categorization Method By Active Multi-Field Learning For Spam Filtering

Posted on:2012-05-12Degree:DoctorType:Dissertation
Country:ChinaCandidate:W Y LiuFull Text:PDF
GTID:1118330341451633Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Spam filtering is one of the key technologies to improve the availability of network information. Although, spam filtering has been extensively investigated and many advances have been made on it, there are still many problems expected to be solved which are shown in actual applications and evaluations. As a consequence, in recent years, many academies and industries have been making an in-depth research on spam filtering technologies. Currently, many academies tend to use statistical text categorization (TC) methods to solve the problem of spam filtering. This dissertation has explored the main framework problem of statistical online binary TC for spam filtering, the splitting problem of field documents, the combining problem of field results, the space-time-efficient classifying problem of field documents and the costly feedback problem, and proposed a series of methods to solve the problems. We have used the email spam filtering experiment on the TREC07P collection, the short message service spam filtering experiment on the CSMS collection, the multiclass document classification experiment on the TanCorp collection to validate the availability of proposed methods, and make the following contributions:(1) The text structure of information documents is detailedly investigated, and it is found that many information documents have a multi-field structure. According to this finding, we propose a multi-field learning (MFL) framework, which uses a divide-and-conquer idea to break a complex TC problem of multi-field documents into several simple field sub-problems. Each sub-problem has its own feature space and statistical TC model. Experimental results show that the MFL framework is an effective main framework of statistical online binary TC. In this framework, text features are more independent between fields and TC model is more targeted in each field. And in each sub-problem, feature extraction and TC model construction are both straightforward and efficient.(2) We investigate the splitting problem of field documents, and propose a natural field document splitting (NFDS) strategy and an attribute-specific field document splitting (ASFDS) strategy. The NFDS strategy splits a document into several field documents according to the splitting positions identified by the natural field structure. The ASFDS strategy, a reuse technique of text features, extracts the texts with strong distinguishability by some rules to form a field document, which does not really exist in the original document. Experimental results show that the NFDS strategy is general for the common multi-field structure of information documents, and the ASFDS strategy is more suitable for short text documents because it can overcome the problem of sparse features.(3) We also deeply investigate the combining problem of field results, and propose five combining strategies (arithmetical average, support vector model, historical performance of field classifiers, text quantity of field documents, and compound). Experimental results show that the five strategies can improve the performance of previous TC algorithms, and the compound strategy, considering the historical performance of field classifiers and the current text quantity of field documents, can achieve the promising performance of time complexity and precision.(4) The token frequency distribution is detailedly investigated in document collections, and it is found that the token frequency distribution commonly follows a power law. According to this finding, we propose a token frequency index (TFI) based TC algorithm. This algorithm, transfering a research idea of text retrieval to TC problems, uses an equal-probability-based random sampling to compress labeled documents online, and can solve the hard problem of turning a posteriori rule of offline batches to a priori online computable rule in traditional statistical TC methods. The TFI data structure has an advantage of the low time complexity for each query and each incremental update, and has a raw text compression property of indexes and a compression property based on random sampling. So the TFI can capture the varied content and the concept drift space-time-efficiently. Experimental results show that the TFI-based TC algorithm can solve the space-time-efficient classifying problem of field documents, and integrated in the MFL framework, this algorithm can achieve the state-of-the-art performance of the low space-time complexity and the high precision. Moreover, we extend the research idea of TFI, and propose a multiclass token frequency index (MTFI) based TC algorithm. Experimental results show that the MTFI-based TC algorithm is effective in the multiclass document classification.(5) In the research of the costly feedback problem, three active learning strategies (chronological priority, priori range, variance-based uncertainty sampling) are proposed. The variance-based uncertainty sampling active learning strategy makes use of the decision difference of several field classifiers, and compares the current variance of field results with the threshold of historical variances to choose informative documents to require user feedback. Experimental results show that the variance-based uncertainty sampling is the best active learning strategy among the three strategies, and the best strategy can achieve the promising performance with greatly reduced requirements of user feedback. And owing to the low space-time complexity of variance computing, the variance-based uncertainty sampling is an effective active learning strategy.In conclusion, this dissertation investigates some key problems in spam filtering, and proposes a series of TC methods around the MFL idea. The proposed TC methods can meet the practical application requirement of spam filtering. The further research of the MFL can be expected to achieve higher performance.
Keywords/Search Tags:Spam Filtering, Text Categorization, Multi-Field Learning, Active Learning, Power Law, Token Frequency Index, Variance-Based Uncertainty Sampling, TREC Evaluation
PDF Full Text Request
Related items