Font Size: a A A

The Research On Text Classification Based On Clique Model

Posted on:2009-08-25Degree:MasterType:Thesis
Country:ChinaCandidate:X H HuFull Text:PDF
GTID:2178360272480743Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With the rapid growth of the online electronic documents, the automated text categorization (or text classification, TC) becomes more important in the applications of information retrieval (IR), information filter and content management in the last decade, and has became forward research area of IR and machine learning (ML). As one of the most effective text information management methods, automated Text Categorization (TC) helps people organizing and managing the electronic text more quickly and easily. Text categorization is the procedure of automatically assign predefined categories to free text documents, and the TC method based-learning has became mainstream technology.At present, researchers have put forward a lot of mature text classification algorithm, most of them are come from the pattern classification, existing text classification algorithms such as: KNN and SVM, most of which are based on vector space model, without considering the semantic feature of these documents. Starting from the inadequacy of the traditional classification, the author of this thesis attempts to do some research on text classification and its related technologies. Several methods and techniques are presented.The main contributions of this paper are as follows:1) A clique-based text classification method is put forward, which, through constructing a similar graph of the context by a similar matrix of the context based on the train text, and then extracting the clique of the context (complete graph) from the similar graph of the context, we construct the classifier using clique information of each category, and combine with the SVM or KNN classifier. Experiments on 20NewsGroups corpus and Fudan University Corpus show that the method improved the classification performance.2) With the rapid growth of website information, especially on-line information increased, it is unrealistic to rely on human to deal with information. Therefore, the automatic classification has become a critical technology of great practical value, and it is a powerful tool to manage and organize data. In organizing effectively the extremely rich information resources from Internet, Web page automatic categorization has become an increasingly important area of study. Because of its own characteristics, the classification of WEB document has attracted attention from many scholars in recent years. Based on the traditional classifier, we make use of the rich link information. Experiments on the SEWM corpus show that the combination of the method proposed in this thesis with link information of WEB documents improve the classification performance.
Keywords/Search Tags:Text classification, Text Clique, Graph model, Link, WebPages classification
PDF Full Text Request
Related items