Research On Document Classification Algorithm Based On Semi-Supervised Learning

Posted on:2011-02-11

Degree:Master

Type:Thesis

Country:China

Candidate:F Qin

Full Text:PDF

GTID:2178360305960849

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

With the development of information technology, people deal with more and more text in their daily work. And document classification as a key technology gets more attention in recent years. However, classical document classification methods require a large number of text categories to build classifier. In practice, we may only get a small number of samples which contain the categories and a large number of unlabeled samples. If just use such less labeled samples to build the classifier, there is not only a certain limitation in the results, but also the underlying information which belongs to the unlabeled samples would not be effectively used. So, it leads to waste resources. Semi-supervised learning is a learning mode between supervised learning and unsupervised learning. It only combines parts of the labeled samples with the unlabeled samples to build the classifier.The thesis introduces current semi-supervised classification algorithms systemically, and proposes a semi-supervised classification method based on majority voting and a novel document classification method which could expand the labeled samples. The main tasks in the thesis are as follows:1. The key technologies for document classification are discussed systematically in this thesis, including document representation, document preprocessing, feature selection, feature weight calculation, the classical classification methods and classification performance evaluation.2. Some current semi-supervised learning concepts and methods are introduced and analyzed, and then a novel semi-supervised classification algorithm based on the nearest neighbor's majority voting rule is put forward. The experiments show that the proposed method is effective and practical.3. Inspired by the idea of semi-supervised classification, a semi-supervised learning method in very few of samples on adding similar samples is proposed according to the features of the documents. This method extracts the representative features of each category from the labeled sets, and then selects the similar samples from the unlabeled sets according to such features, which is expanded the labeled sets. The experiment with a standard Chinese classification datasets shows that the novel method is superior in the performance.

Keywords/Search Tags:

Document Classification, Semi-supervised learning, Nearest neighborhood, Similar samples

PDF Full Text Request

Related items

1	The Research On Semi-supervised Classification Algorithm Based On Two Different Composition Method
2	Chinese Question Classification, Based On Semi-supervised Learning
3	Based On The Positive And Unlabeled Samples, Semi-supervised Classification
4	SAR Image Semi-supervised Learning Classification Based On Superpixels And Samples Selective Strategies
5	The Research Of Semi-Supervised Chinese Document Classification Algorithm
6	Coordinate Descent Method For Semi-supervised Learning And Application To Document Classification
7	Research Of Reliable Semi-supervised Classification
8	The Web Pages Classification Method Based On Semi-supervised Support Vector Machine
9	Research On Semi-supervised Clustering And Classification Algorithm
10	Topic Modeling Approaches For Supervised Document Classification