Font Size: a A A

An examination of KSS for feature selection for text categorization using support vector machines

Posted on:2006-04-08Degree:M.C.ScType:Thesis
University:Dalhousie University (Canada)Candidate:Basu, AtreyaFull Text:PDF
GTID:2458390008960904Subject:Computer Science
Abstract/Summary:
There are currently many tools for information extraction from textual data. Some tools extract data such as keyphrases and keywords while others give information about the importance of the two. These tools are sometimes called knowledge tools. In our experiment we examined what would happen if we employed these knowledge tools to text categorization. We reasoned that by improving the quality of extracted information we would have better document representation and therefore be able to better train our classifier for text categorization. The knowledge tools that we used were provided to us by IBM as a software suite called KSS. The two tools from that package that we used were Textract, a keyphrase/keyword (term) extractor, and IQ an algorithm that measures the importance of a term to a document.;Currently Support Vector Machines (SVMs) are considered excellent classifiers, and therefore we decided to employ a, 'best of breed tools' approach to the problem of Text Categorization.;While using Textract and IQ together we did get better quality terms, as evidenced by almost identical Precision and Recall measures. However the terms extracted and weighted by these tools could not be statistically proven to be better for training the SVM classifiers as compared to terms extracted and weighted by traditional information retrieval techniques. The traditional technique being, only keyword extraction, no keyphrase extraction, and no term ranking and thereby no term filtering.
Keywords/Search Tags:Text categorization, Tools, Extraction, Information, Term
Related items