Font Size: a A A

The Research Of Semi-Supervised Chinese Document Classification Algorithm

Posted on:2006-10-10Degree:MasterType:Thesis
Country:ChinaCandidate:Y HeFull Text:PDF
GTID:2178360182468925Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Text classification is a supervised learning task of assigning natural language text documents to one or more predefined categories or classes according to their contents. It has recently attracted an increasing ammount of attention due to the eve-expanding amount of text documents available in digital form. Text classification is widely applied in every fields of text process and information retrieval, has became the key technique in process and organize large scale text information, and impulses the information process to the direction of automation.This thesis firstly introduces general development and some techniques of information classification. Then, some analyses and remarks are made to compare the performance of some typical classificationi algorithms of feature selection, feacture extraction, and weight calculation, classification algorithm.Secondly Considering the contradiciton of deadly need for large labeled train-set to obtain high classification accuracy and the scarcity of labeled documents, this thesis emphasizes on improvement of Semi-supervised classification algorithms, analysis all the existied Semi-supervised classification alogrithmns and find While unlabeled data samples can help to improve the accuracy of trained models to certain extent, existing methods still face difficulties when labeled data is extremely small, e.g.containing less than 10 labeled examples in each class,and biased against the underlying data distribution. This paper present a clustering based classification approach, using this approach, training data, including both the labeled and unlabeled data, is first clustered with the guidance of the labeled data. Some of unlabeled data samples are then labeled based on the clusters obtained. Discriminative classifiers can subsequently be trained with the expanded labeled dataset. The effectiveness of the proposed method is justified analytically.Finally I design a document classification system and conductedcomprehensive experiments to validate our approach and study related issues. The experiments showed the superior performance of our method over existing methods such as TSVM and Co-Training when labeled data size is extremely small. When there is sufficient labeled data, our method is comparable to TSVM and Co-Training.
Keywords/Search Tags:information classification, text classification, semi-supervised learning, clustering
PDF Full Text Request
Related items