A multiple classifier learning methodology for feature-space heterogeneous text categories

Posted on:2008-12-31

Degree:Ph.D

Type:Dissertation

University:George Mason University

Candidate:Hadjarian, Ali R

Full Text:PDF

GTID:1448390005979765

Subject:Computer Science

Abstract/Summary:

Text categorization is an information access task dealing with the automatic assignment of text documents to a set of pre-defined categories. This dissertation focuses on two aspects of the text categorization problem largely unexplored by the research community. These have to do with the complexity of the individual categories involved, as well as the close-world assumption generally associated with such problems, where the category of a given test document is necessarily represented in the training set.; The developed learning method tackles the complexities associated with feature-space heterogeneous categories through a novel problem space decomposition approach aimed at detecting the underlying substructures of such categories. Here, decomposition is achieved with the use of a rule-based clustering approach that is both context-sensitive, in that clustering is performed in relation to the larger categorization problem at hand, and concept-sensitive, in that it is based on the notion of conceptual cohesiveness. A multiple classifier system is formed by training a separate classifier for each individual document cluster. The extendibility of the trained classifiers to categorization scenarios beyond the close-world setting is facilitated through the incorporation of as many of the relevant features as possible, while mitigating the risk of overfitting the data.; Experimental results demonstrate the advantage of the developed multiple classifier learning method over some of the most frequently used classifiers for the intended categorization problems. This advantage is captured both in terms of a significant increase in the overall classification performance, as well as the general stability of the results with respect to the number of available training documents and the size of the feature subset.; The extendibility of the developed method beyond the learning algorithm under immediate consideration, forming the basis for a multiple classifier learning methodology, is demonstrated in the context of Nearest Neighbor classification. The results reveal the benefits of the developed variant over the standard k-NN algorithm for the specified problem scenarios.

Keywords/Search Tags:

Multiple classifier learning, Text, Categories, Categorization, Method, Problem, Developed

Related items

1	A Study On Chinese Text Categorization
2	Research On The Method Of Chinese Text Categorization Based On Machine Learning
3	A Study On M3-kNN Network And Application In Text Categorization
4	A Study On Chinese Text Automatic Categorization
5	Research On Chinese Text Classifier Based On Probability Method
6	Studies On Some Essential Problems In Automatic Text Categorization
7	The Research And Implementation Of Automatic Text Categorization For Chinese Web Documents
8	A Study On Text Categorization Based On Machine Learning
9	The Compare Two Automated Text Categorization Algorithms Based On The Open Telephone Of Mayor
10	Chinese Text Automatic Classification System - Of Chinese Words And Classifier Design