Font Size: a A A

Semi-supervised Classification Based On KL Divergence

Posted on:2011-03-06Degree:MasterType:Thesis
Country:ChinaCandidate:Z XuFull Text:PDF
GTID:2178360305497831Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
As the rapid progress of network, biology and so on, lots of data is produced. To store and manage data, it costs lots of manpower and financial resources. Hence, how to generate economic benefits from static data has become a problem to solve be addressed. For this, classification and other data mining techniques are proposed. However, in traditional classification methods, only by increasing the artificial known training samples to improve the classification accuracy, it is very costly. Therefore, in recent years, how to use unlabeled samples to improve the classification performance gets more and more attention, especially semi-supervised classification based on known and unlabeled samples.This paper mainly focuses on how to find hidden unlabeled negative instances, while there are not any negative samples in the training set. At present, semi-supervised classification methods based on known positive and unlabeled samples are almost about balanced data set, not effectively dealing with the problem of unbalanced data, namely, very different class-type distribution or very few unla-beled negative instance. To deal with this particular problem, this paper proposes a very direct and efficient method. The traditional classifier is based on their pos-terior probability of unlabeled examples for final classification, therefore, this paper proposes a semi-supervised classification algorithm with KL divergence, using their relative entropy between posterior probability of unlabeled instance and prior prob-ability of training set to measure the confidence of the classification results, thereby reducing the imbalance's influence and improving classification accuracy. Mean-while, for balanced training set,we directly use entropy to measure the degree of differences in posterior probability:when the entropy value is smaller, the posterior probability distribution is more uneven, i.e. correct classification of the instance with higher credibility, vice versa.The main contribution of this paper includes:1. Use any specific classification technology, avoiding different classifiers's influ-ence on the classification results, reducing dependence between the data type and classifiers. 2. A more simple and flexible approach:for non-balanced training set, the semi-supervised learning algorithm with KL divergence; for a balanced training set, it proposes an entropy-based semi-supervised learning algorithm.3. Different type of data sets:text and non-text data. At the same time, perfor-mance comparison between different parameters or factors.4. Provide a large number of experiments to verify the usefulness and efficiency of the proposed method. Through the text and non-text data sets, the proposed method outperforms previous work in the literature.
Keywords/Search Tags:data mining, classification, semi-supervised learning, KL divergence, entropy
PDF Full Text Request
Related items