Font Size: a A A

A multiple classifier learning methodology for feature-space heterogeneous text categories

Posted on:2008-12-31Degree:Ph.DType:Dissertation
University:George Mason UniversityCandidate:Hadjarian, Ali RFull Text:PDF
GTID:1448390005979765Subject:Computer Science
Abstract/Summary:
Text categorization is an information access task dealing with the automatic assignment of text documents to a set of pre-defined categories. This dissertation focuses on two aspects of the text categorization problem largely unexplored by the research community. These have to do with the complexity of the individual categories involved, as well as the close-world assumption generally associated with such problems, where the category of a given test document is necessarily represented in the training set.; The developed learning method tackles the complexities associated with feature-space heterogeneous categories through a novel problem space decomposition approach aimed at detecting the underlying substructures of such categories. Here, decomposition is achieved with the use of a rule-based clustering approach that is both context-sensitive, in that clustering is performed in relation to the larger categorization problem at hand, and concept-sensitive, in that it is based on the notion of conceptual cohesiveness. A multiple classifier system is formed by training a separate classifier for each individual document cluster. The extendibility of the trained classifiers to categorization scenarios beyond the close-world setting is facilitated through the incorporation of as many of the relevant features as possible, while mitigating the risk of overfitting the data.; Experimental results demonstrate the advantage of the developed multiple classifier learning method over some of the most frequently used classifiers for the intended categorization problems. This advantage is captured both in terms of a significant increase in the overall classification performance, as well as the general stability of the results with respect to the number of available training documents and the size of the feature subset.; The extendibility of the developed method beyond the learning algorithm under immediate consideration, forming the basis for a multiple classifier learning methodology, is demonstrated in the context of Nearest Neighbor classification. The results reveal the benefits of the developed variant over the standard k-NN algorithm for the specified problem scenarios.
Keywords/Search Tags:Multiple classifier learning, Text, Categories, Categorization, Method, Problem, Developed
Related items