Font Size: a A A

Text Representation And Algorithms For Chinese Text Classification

Posted on:2008-05-18Degree:MasterType:Thesis
Country:ChinaCandidate:H JiangFull Text:PDF
GTID:2178360242971975Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With the rapid development of information technology and popularity of Internet, the amount of Web pages increases largely. For the main content of web page is text, how to classify web pages automatically by their contents becomes an important research subject. Moreover, text classification has a wide and broad future as the technical basis of information filtering, information retrieval, search engine, digital library and so on. It will produce great social and economic benefits.Text classification is a complicated process, including text preprocessing, representation, classification algorithm and performance evaluation. Among them, text representation is the basis while classification algorithm is at the very core of the process.This paper mainly focuses on the text representation and classification algorithm in text classification tasks. It firstly presents a survey of basic concepts and knowledge. Then it analyses the representation effectiveness of Vector Space Model and factors influencing the classification performance. On this basis, improvement on VSM and its corresponding classifiers are put forward. The main contents are as follows:(1) Because of words' limited representation ability, this paper considers the order and co-occurrence of terms in a sentence by introducing the term association graph. And then a method of constructing association term set is designed. The experimental results show that this technology can improve the performance of naive Bayes classifier.(2) Dimension reduction is an important research trend, as well as one of the major contents of this paper. This paper adopts the AdaBoost algorithm to select features and enhance the classifier according to the ability of each feature. Based on the experimental results, a two-phase combined feature selection algorithm is designed and is proved feasible for text classification.(3) Ensemble learning method is a new active area by selecting a group of features, for it makes contribution to dimension reduction, high usability, diversity and algorithm performance. In this paper, the functions of Part-Of-Speech are taken into full . consideration, a novel method of constructing different feature groups by Part-Of-Speech is proposed, named POSAdaBoost. This integrated algorithm can make up the drawback of VSM which only relies on the word forms. The result of this algorithm and Random Subspace Method are compared finally.
Keywords/Search Tags:Text Categorization, Text Representation, Machine Learning, Feature Selection, AdaBoost
PDF Full Text Request
Related items