Font Size: a A A

XML document classification using structural and textual features

Posted on:2009-03-12Degree:M.ScType:Thesis
University:University of Calgary (Canada)Candidate:Khabbazhaye Tajer, MohammadFull Text:PDF
GTID:2448390002994494Subject:Computer Science
Abstract/Summary:
This thesis addresses XML document classification by considering both structural and content based features of documents. This leads to more informative feature vectors that better represent documents from different perspectives. To manage the feature space better, we integrate soft clustering and feature reduction into the process. In order to extract structural information, we use existing rule mining algorithms to capture frequent structural patterns in the form of rules and later convert them to structural features. However, for extracting content information of XML documents, we propose a new method based on soft clustering of words and using each cluster as a textual feature. We show that the classifier built only using our textual features outperforms some well-known information retrieval (IR) based document classification technique. Further, the combination of structural and textual features results in an accurate and robust classifier. We demonstrate the efficiency and effectiveness of incorporating both structural and content information by performing a comparative analysis of our classifier model and several other document classifiers including XML document classifiers.;Index terms. XML document, classification, structural features, content features, soft clustering, information retrieval.
Keywords/Search Tags:XML document, Structural, Features, Content, Soft clustering, Information
Related items