Text Representation And Algorithms For Chinese Text Classification

Posted on:2008-05-18

Degree:Master

Type:Thesis

Country:China

Candidate:H Jiang

Full Text:PDF

GTID:2178360242971975

Subject:Computer software and theory

Abstract/Summary:

With the rapid development of information technology and popularity of Internet, the amount of Web pages increases largely. For the main content of web page is text, how to classify web pages automatically by their contents becomes an important research subject. Moreover, text classification has a wide and broad future as the technical basis of information filtering, information retrieval, search engine, digital library and so on. It will produce great social and economic benefits.Text classification is a complicated process, including text preprocessing, representation, classification algorithm and performance evaluation. Among them, text representation is the basis while classification algorithm is at the very core of the process.This paper mainly focuses on the text representation and classification algorithm in text classification tasks. It firstly presents a survey of basic concepts and knowledge. Then it analyses the representation effectiveness of Vector Space Model and factors influencing the classification performance. On this basis, improvement on VSM and its corresponding classifiers are put forward. The main contents are as follows:(1) Because of words' limited representation ability, this paper considers the order and co-occurrence of terms in a sentence by introducing the term association graph. And then a method of constructing association term set is designed. The experimental results show that this technology can improve the performance of naive Bayes classifier.(2) Dimension reduction is an important research trend, as well as one of the major contents of this paper. This paper adopts the AdaBoost algorithm to select features and enhance the classifier according to the ability of each feature. Based on the experimental results, a two-phase combined feature selection algorithm is designed and is proved feasible for text classification.(3) Ensemble learning method is a new active area by selecting a group of features, for it makes contribution to dimension reduction, high usability, diversity and algorithm performance. In this paper, the functions of Part-Of-Speech are taken into full . consideration, a novel method of constructing different feature groups by Part-Of-Speech is proposed, named POSAdaBoost. This integrated algorithm can make up the drawback of VSM which only relies on the word forms. The result of this algorithm and Random Subspace Method are compared finally.

Keywords/Search Tags:

Text Categorization, Text Representation, Machine Learning, Feature Selection, AdaBoost

Related items

1	The Research And Implementation Of Automatic Text Categorization For Chinese Web Documents
2	A Study On Text Categorization Based On Machine Learning
3	The Research Of Text Representation And Feature Selection In Text Categorization
4	Studies On Some Essential Problems In Automatic Text Categorization
5	Multi-class Scientific Literature Automatic Categorization System
6	Research On High-Performance Text Categorization
7	Text Categorization Algorithm Based On Machine Learning
8	Research On High Performance Chinese Text Classification Based On Machine Learning
9	A Study On Chinese Text Categorization
10	Normal Weight Based Feature Selection Method In SVM Text Categorization