Semi-supervised Classification Based On KL Divergence

Posted on:2011-03-06

Degree:Master

Type:Thesis

Country:China

Candidate:Z Xu

Full Text:PDF

GTID:2178360305497831

Subject:Computer software and theory

Abstract/Summary:

PDF Full Text Request

As the rapid progress of network, biology and so on, lots of data is produced. To store and manage data, it costs lots of manpower and financial resources. Hence, how to generate economic benefits from static data has become a problem to solve be addressed. For this, classification and other data mining techniques are proposed. However, in traditional classification methods, only by increasing the artificial known training samples to improve the classification accuracy, it is very costly. Therefore, in recent years, how to use unlabeled samples to improve the classification performance gets more and more attention, especially semi-supervised classification based on known and unlabeled samples.This paper mainly focuses on how to find hidden unlabeled negative instances, while there are not any negative samples in the training set. At present, semi-supervised classification methods based on known positive and unlabeled samples are almost about balanced data set, not effectively dealing with the problem of unbalanced data, namely, very different class-type distribution or very few unla-beled negative instance. To deal with this particular problem, this paper proposes a very direct and efficient method. The traditional classifier is based on their pos-terior probability of unlabeled examples for final classification, therefore, this paper proposes a semi-supervised classification algorithm with KL divergence, using their relative entropy between posterior probability of unlabeled instance and prior prob-ability of training set to measure the confidence of the classification results, thereby reducing the imbalance's influence and improving classification accuracy. Mean-while, for balanced training set,we directly use entropy to measure the degree of differences in posterior probability:when the entropy value is smaller, the posterior probability distribution is more uneven, i.e. correct classification of the instance with higher credibility, vice versa.The main contribution of this paper includes:1. Use any specific classification technology, avoiding different classifiers's influ-ence on the classification results, reducing dependence between the data type and classifiers. 2. A more simple and flexible approach:for non-balanced training set, the semi-supervised learning algorithm with KL divergence; for a balanced training set, it proposes an entropy-based semi-supervised learning algorithm.3. Different type of data sets:text and non-text data. At the same time, perfor-mance comparison between different parameters or factors.4. Provide a large number of experiments to verify the usefulness and efficiency of the proposed method. Through the text and non-text data sets, the proposed method outperforms previous work in the literature.

Keywords/Search Tags:

data mining, classification, semi-supervised learning, KL divergence, entropy

PDF Full Text Request

Related items

1	Studies On Semi-supervised Clustering Algorithms Based On Entropy And Divergence
2	Based On The Positive And Unlabeled Samples, Semi-supervised Classification
3	Research Of Reliable Semi-supervised Classification
4	Research And Implementation Of The Classification Analysis System For Applications Running Performance Based On Semi-supervised Learning
5	Studies On Semi-supervised Learning And Its Applications
6	Semi Supervised Clustering Algorithm And Its Application And Research
7	Research And Implementation Of Image Classification Algorithm Based On Discriminative Semi-supervised Dictionary Learning
8	Research On Semi-supervised Clustering And Classification Algorithm
9	Robust Semi-supervised Classification Method Search For Noisy Labels Based On Self-paced Learning
10	Research On Image Classification Algorithm Based On Semi-supervised Learning