Font Size: a A A

Research On Document Classification Algorithm Based On Semi-Supervised Learning

Posted on:2011-02-11Degree:MasterType:Thesis
Country:ChinaCandidate:F QinFull Text:PDF
GTID:2178360305960849Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the development of information technology, people deal with more and more text in their daily work. And document classification as a key technology gets more attention in recent years. However, classical document classification methods require a large number of text categories to build classifier. In practice, we may only get a small number of samples which contain the categories and a large number of unlabeled samples. If just use such less labeled samples to build the classifier, there is not only a certain limitation in the results, but also the underlying information which belongs to the unlabeled samples would not be effectively used. So, it leads to waste resources. Semi-supervised learning is a learning mode between supervised learning and unsupervised learning. It only combines parts of the labeled samples with the unlabeled samples to build the classifier.The thesis introduces current semi-supervised classification algorithms systemically, and proposes a semi-supervised classification method based on majority voting and a novel document classification method which could expand the labeled samples. The main tasks in the thesis are as follows:1. The key technologies for document classification are discussed systematically in this thesis, including document representation, document preprocessing, feature selection, feature weight calculation, the classical classification methods and classification performance evaluation.2. Some current semi-supervised learning concepts and methods are introduced and analyzed, and then a novel semi-supervised classification algorithm based on the nearest neighbor's majority voting rule is put forward. The experiments show that the proposed method is effective and practical.3. Inspired by the idea of semi-supervised classification, a semi-supervised learning method in very few of samples on adding similar samples is proposed according to the features of the documents. This method extracts the representative features of each category from the labeled sets, and then selects the similar samples from the unlabeled sets according to such features, which is expanded the labeled sets. The experiment with a standard Chinese classification datasets shows that the novel method is superior in the performance.
Keywords/Search Tags:Document Classification, Semi-supervised learning, Nearest neighborhood, Similar samples
PDF Full Text Request
Related items