Font Size: a A A

Contributions To Several Key Issues Of Associative Text Classification

Posted on:2007-02-05Degree:DoctorType:Dissertation
Country:ChinaCandidate:T Y QianFull Text:PDF
GTID:1118360242461917Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With the rapid expansion of electronic publication from World Wide Web, intranet and digital library, it becomes more and more important to index or organize textual information category by category. The large volume of information makes it almost impossible to manually determine the label. Therefore it would be helpful to automatically examine the text and categorize documents into one or more topics by a text classification system.As a rule-based method, associative classification has attracted great interest and a variety of useful methods have been developed. This dissertation applies associative classification into text area. It discusses the key problems arising from the current approaches and then develops a new framework to address these problems. The methods proposed in this thesis are proven to outperform state-of-art classification algorithms in terms of both computational performance and classification accuracy.Support threshold is a crucial parameter related to associative classification. This dissertation first demonstrates that low support threshold conduces to better classifiers since it provides larger vocabulary size and more general description of data, and meanwhile it brings in the difficulty of rule extraction due to the huge amount of rule number. In order to remove really bad rules while keeping beneficial ones, we present a general-to-specific pruning method to eliminate rules with both lower confidence and lower support. However, it is quite time consuming to judge general-to-specific ordering among different rules. So we exploit a vertical pruning strategy, which retains the same power of complete pruning but substantially reduces the time cost.Feature selection, a major research area in information retrieval due to its capability of noise removal and dimensionality deduction, has played an important role in text classification. All current systems use feature selection as a separate preprocess procedure. In the context of associative text classification, predefined features may not naturally become frequent itemsets. We describe that the traditional metrics in feature selection can be transformed into expressions of support and confidence. Moreover, we develop a method to integrate feature selection into rule exaction process. By doing so, we can compute these metric almost for free and dynamically determine the best feature number without saving multiple copies of original data.This thesis also intensifies the classification method. If there are multiple rules exhibiting general-to-specific relationship matching the same test file, a"receiver deciding"strategy is adopted to adaptively select the best matching rule of the receiver (test document). Through this way, we get trade-off between high precision and high recall. We also introduce a norm factor and a confidence interval to counteract the deterioration caused by the quality and quantity imbalance among different classifiers.Different association patterns are chosen as candidate patterns for text classification in this thesis. First, sentential level frequent itemsets, along with a special rule selection and classification algorithm, are introduced into text classification, and the cons and pros of such an approach are thoroughly analyzed. Second, we compare two document level association patterns, hyperclique patterns and frequent patterns, from the aspects of rule number, training time and classification performance, and we draw an important conclusion that hyperclique patterns are more suitable for text categorization tasks than frequent itemstes. Moreover, extensive experiments on several real data sets show that our hyperclique based associative text classification reaches higher performance than SVM.At the end of this dissertation, we analyze the influence of characteristics of Chinese on text categorization. We design a prefix-hash-tree data structure to convert Chinese document into transaction data. Also, class frequency of terms is defined to quantify, partition and filter Chinese words. As a consequence, the size of database is greatly reduced and classification accuracy is improved. Related tool has been successfully applied to e-government domain.
Keywords/Search Tags:text classification, association rules, frequent itemsets, hyperclique patterns, rule extraction, feature selection, classification model
PDF Full Text Request
Related items