Font Size: a A A

Application Of The SVM-Based Categorization Approach In Content Management

Posted on:2007-12-04Degree:MasterType:Thesis
Country:ChinaCandidate:D WangFull Text:PDF
GTID:2178360182994750Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
The automated categorization of texts into predefined categories has been an important part of content management research. Since the categories involved in content management application are defined over a large taxonomy (such as Yahoo) or hierarchically organized in tree-like structures, it is more natural and appropriate to use an automatic hierarchical text classification method. Unfortunately, most popular categorization techniques focus on flat classification where the predefined categories are treated in isolation and there is no structure defining the relation among them.This paper gives an overview of automatic text classification and, specifically, designs an automatic hierarchical text classification approach (HTCSVM) based on SVM. Support vector machines (SVM) is a relative new class of machine learning techniques first introduced by Vapnik in 1992 and based on the structural risk minimization principle from the statistical learning theory, which have been promising methods for classification because of their solid mathematical foundations and convey several salient properties that other methods hardly provide and scarcely have been previously explored in the context of hierarchical classification. We also provide a formal analysis of computational complexity of HTCSVM method and derive a complexity of polynomial-time for training phase of HTCSVM. Furthermore, we apply our proposed performance measurement framework called PMFHC to evaluate HTCSVM method and experiments shows our method is effective and feasible. At same time the successful application of HTCSVM method to real product that supported by Foundation under grant (2003K05-G32) has witnessed that our algorithm is robust, practically useful.We establish a new performance measurement framework (PMFHC) for hierarchical classification which using category similarity and distance that capture the relationships between categories. PMFHC is natural extension of those used in flat classification and consistency to them.This thesis investigate several representative text feature selection strategies, and conduct the experimental comparison between them which prove %2 statistics (CHI)generally outperforms other feature selection measures, finally, which is adopted in HTCSVM categorization method.
Keywords/Search Tags:content management, text mining, SVM, hierarchical classification
PDF Full Text Request
Related items