XML document classification using structural and textual features

Posted on:2009-03-12

Degree:M.Sc

Type:Thesis

University:University of Calgary (Canada)

Candidate:Khabbazhaye Tajer, Mohammad

Full Text:PDF

GTID:2448390002994494

Subject:Computer Science

Abstract/Summary:

This thesis addresses XML document classification by considering both structural and content based features of documents. This leads to more informative feature vectors that better represent documents from different perspectives. To manage the feature space better, we integrate soft clustering and feature reduction into the process. In order to extract structural information, we use existing rule mining algorithms to capture frequent structural patterns in the form of rules and later convert them to structural features. However, for extracting content information of XML documents, we propose a new method based on soft clustering of words and using each cluster as a textual feature. We show that the classifier built only using our textual features outperforms some well-known information retrieval (IR) based document classification technique. Further, the combination of structural and textual features results in an accurate and robust classifier. We demonstrate the efficiency and effectiveness of incorporating both structural and content information by performing a comparative analysis of our classifier model and several other document classifiers including XML document classifiers.;Index terms. XML document, classification, structural features, content features, soft clustering, information retrieval.

Keywords/Search Tags:

XML document, Structural, Features, Content, Soft clustering, Information

Related items

1	Research On Clustering Ensemble Method For Fusing Structural Information
2	Research On Efficient Document Clustering Using Improvised Sub-Document Based Framework
3	Unsupervised Structural Learning And Its Applications
4	Research On Structural Texture Synthesis Method Based On Edge Information
5	Research And Realization Of Web Information Mining Model Based On Topic Features
6	Research On Information Retrieval Method Based On XML Document Structural Semantics And Its Application
7	Research On Web Document Clustering Approaches Based On Phrase Features
8	The Research And Implementation Of An XML Document Structural Clustering Algorithm Using Frequent Path Pattern
9	Research On Text Structural Information Extraction And Clustering Based On XML
10	In The Context Of Contemporary Tries To Analyze "Document Exhibition" Of Document Features