Some Data Mining Algorithms Based On Information Theory

Posted on:2009-06-12

Degree:Doctor

Type:Dissertation

Country:China

Candidate:C F Sha

Full Text:PDF

GTID:1118360272989285

Subject:Computer software and theory

Abstract/Summary:

PDF Full Text Request

Some notation in information theory can be used to measure the correlations, diversity in the researched objects, and the distance between probability distributions. Those techniques has found many applications in computer science areas. In this thesis, we propose some data mining problems based on information theory, and develop techniques for these tasks. The problem we address includes mining correlation patterns and diversity patterns, feature selection, and correlation clustering. We also discuss privacy preservation in the public data publishing for data mining applications, where we focus on the t-closeness privacy preservation model.The main contributions of this thesis can be summarized as follows:1. Based on the conditional entropy, we introduce a symmetric information distance which satisfying triangle inequality, define the problem of finding novel dependency trees and correlation patterns, and propose some algorithms for these mining tasks. We also propose a feature selection algorithm based on this new information distance which measures the correlation between features.2. Based on the joint entropy of random variables, we introduce the problem of finding entropy diversity patterns. By establishing serval bounds between entropy of different random variables, we propose some efficient algorithms to find these diversity patterns. We also develop an improved mining algorithm for non-redundant interacting feature subsets.3. Based on Kullback-Leibler divergence between continuous distributions, we develop a novel nonlinear correlation clustering algorithm.4. Based on Kullback-Leibler divergence between discrete distributions, we introduce a novel t-closeness privacy preservation model with Kullback-Leibler divergence, which addresses the drawback in the previous approaches. We also discuss the relationship between our new model with semantic privacy.In these work, we in turn present the problem definition, analyze the problem or the properties of researched objects, develop the mining or implementation algorithms. The efficiency and effectiveness of each technique is verified using simulations over both synthetic and real data sets.

Keywords/Search Tags:

Information theory, diversity patterns, correlation patterns, feature selection, nonlinear correlation clustering, privacy preservation

PDF Full Text Request

Related items

1	Research On Data Privacy Preserving Method For Clustering Based On Neighborhood Correlation
2	A Study On Composite Service Selection Method Based On Service Correlation Patterns
3	Video Underlying Feature Selection And Evaluation Of Correlation Analusis With The Audience
4	Correlation Visualization Of Time-varying Patterns For Multi-variable Data
5	Design of an electrochemical cognitive system: A study and application of emergent spatio-temporal patterns in far from equilibrium nonlinear systems
6	Research On The Key Problems Of Canonical Correlation Analysis For Multidimensional Data Streams
7	Research And Application Of Max-Correlation And Mix-Redundancy Unsupervised Feature Selection
8	The Research And Application Of Clustering Feature Selection Methods
9	Research On Attribute Selection Algorithm Based On Analysis Of Correlation Between Attributes
10	Label-specific Feature Multi-label Learning Based On The Combination Of Multiple Correlation Information