Font Size: a A A

Some Data Mining Algorithms Based On Information Theory

Posted on:2009-06-12Degree:DoctorType:Dissertation
Country:ChinaCandidate:C F ShaFull Text:PDF
GTID:1118360272989285Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Some notation in information theory can be used to measure the correlations, diversity in the researched objects, and the distance between probability distributions. Those techniques has found many applications in computer science areas. In this thesis, we propose some data mining problems based on information theory, and develop techniques for these tasks. The problem we address includes mining correlation patterns and diversity patterns, feature selection, and correlation clustering. We also discuss privacy preservation in the public data publishing for data mining applications, where we focus on the t-closeness privacy preservation model.The main contributions of this thesis can be summarized as follows:1. Based on the conditional entropy, we introduce a symmetric information distance which satisfying triangle inequality, define the problem of finding novel dependency trees and correlation patterns, and propose some algorithms for these mining tasks. We also propose a feature selection algorithm based on this new information distance which measures the correlation between features.2. Based on the joint entropy of random variables, we introduce the problem of finding entropy diversity patterns. By establishing serval bounds between entropy of different random variables, we propose some efficient algorithms to find these diversity patterns. We also develop an improved mining algorithm for non-redundant interacting feature subsets.3. Based on Kullback-Leibler divergence between continuous distributions, we develop a novel nonlinear correlation clustering algorithm.4. Based on Kullback-Leibler divergence between discrete distributions, we introduce a novel t-closeness privacy preservation model with Kullback-Leibler divergence, which addresses the drawback in the previous approaches. We also discuss the relationship between our new model with semantic privacy.In these work, we in turn present the problem definition, analyze the problem or the properties of researched objects, develop the mining or implementation algorithms. The efficiency and effectiveness of each technique is verified using simulations over both synthetic and real data sets.
Keywords/Search Tags:Information theory, diversity patterns, correlation patterns, feature selection, nonlinear correlation clustering, privacy preservation
PDF Full Text Request
Related items