The Research Of Semi-Supervised Chinese Document Classification Algorithm

Posted on:2006-10-10

Degree:Master

Type:Thesis

Country:China

Candidate:Y He

Full Text:PDF

GTID:2178360182468925

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

Text classification is a supervised learning task of assigning natural language text documents to one or more predefined categories or classes according to their contents. It has recently attracted an increasing ammount of attention due to the eve-expanding amount of text documents available in digital form. Text classification is widely applied in every fields of text process and information retrieval, has became the key technique in process and organize large scale text information, and impulses the information process to the direction of automation.This thesis firstly introduces general development and some techniques of information classification. Then, some analyses and remarks are made to compare the performance of some typical classificationi algorithms of feature selection, feacture extraction, and weight calculation, classification algorithm.Secondly Considering the contradiciton of deadly need for large labeled train-set to obtain high classification accuracy and the scarcity of labeled documents, this thesis emphasizes on improvement of Semi-supervised classification algorithms, analysis all the existied Semi-supervised classification alogrithmns and find While unlabeled data samples can help to improve the accuracy of trained models to certain extent, existing methods still face difficulties when labeled data is extremely small, e.g.containing less than 10 labeled examples in each class,and biased against the underlying data distribution. This paper present a clustering based classification approach, using this approach, training data, including both the labeled and unlabeled data, is first clustered with the guidance of the labeled data. Some of unlabeled data samples are then labeled based on the clusters obtained. Discriminative classifiers can subsequently be trained with the expanded labeled dataset. The effectiveness of the proposed method is justified analytically.Finally I design a document classification system and conductedcomprehensive experiments to validate our approach and study related issues. The experiments showed the superior performance of our method over existing methods such as TSVM and Co-Training when labeled data size is extremely small. When there is sufficient labeled data, our method is comparable to TSVM and Co-Training.

Keywords/Search Tags:

information classification, text classification, semi-supervised learning, clustering

PDF Full Text Request

Related items

1	Based On The Positive And Unlabeled Samples, Semi-supervised Classification
2	Research On Semi-supervised Clustering And Classification Algorithm
3	Research On Text Classification Algorithms Based On Semi-supervised Learning
4	Text Classification Based On Semi-supervised Learning
5	Research On Chinese Short Text Classification Based On Semi-Supervised Clustering
6	Research On Semi-supervised Classification Of Data Stream Based On Adaptive Density Clustering
7	Research On Semi-supervised Classification Algorithm Based On Clustering Ensemble
8	Research On Partially Supervised Classification
9	Research On Semi-supervised Text Classification Method Based On Deep Learning
10	Research On Noisy Semi-Supervised Text Classification Method Based On BERT