Font Size: a A A

Infosift: Adapting graph mining techniques for document classification

Posted on:2005-10-11Degree:M.SType:Thesis
University:The University of Texas at ArlingtonCandidate:Aery, ManuFull Text:PDF
GTID:2458390008485521Subject:Computer Science
Abstract/Summary:
A classification system that determines the patterns of various term associations that emerge from documents of a class, and uses these patterns for classifying similar documents is needed. This thesis proposes a novel graph-based mining approach for document classification. Our approach is based on the premise that representative---common and recurring---structures or patterns can be extracted from a pre-classified document class and the same can be used effectively for classifying incoming documents. To the best of our knowledge, there is no existing work in the area of text, email or web page classification based on pattern inference and the utilization of the learned patterns for classification. A number of factors that influence representative structure extraction and classification are analyzed conceptually and validated experimentally. In our approach, the notion of inexact graph match is leveraged for deriving structures that provide coverage for characterizing the contents of a document class. The ability to classify based on similar and not exact occurrences is singularly important in most classification tasks, as no two samples are exactly the same. Extensive experimentation validates the selection of parameters and the effectiveness of our approach for text, email and web page classification.;The novel idea proposed in the thesis aims at establishing the ground work for adapting graph mining techniques for various classification problems, not necessarily limited to text. (Abstract shortened by UMI.)...
Keywords/Search Tags:Classification, Document, Graph, Mining, Patterns
Related items