Font Size: a A A

Evaluating And Discovering Correlations In Data Sets

Posted on:2015-12-13Degree:MasterType:Thesis
Country:ChinaCandidate:M LiFull Text:PDF
GTID:2298330434954285Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
The most important issue in the era of big data is not about "great quantity of data" but about the change of thinking, one of which is the transformation from causal relationship to correlation. Causal relationship means "why is it", namely the root causes of things, which is usually Elusive, Obscure abstruse, even can not been learned. And correlation means "what is it", namely the Dependencies between things, which is easier to obtain comparatively and able to replace the role of causal relationship in most cases. The domains having been affected or being affected by correlation measuring and discovering include Data Mining, Machine Learning, and Knowledge Discovery and so on.The traditional measuring methods for correlation include Pearson Correlation Coefficient, Mutual Information, and evaluation methods of relevance in Machine Learning and Dada Mining, while these methods exists significant limitations, namely unable to measuring more general correlations such as nonlinear relationships fairly comparing to linear relationships. Newly proposed statistical measure MIC is able to measuring correlations between two variables nicely, while it can not be worked out accurately in polynomial time, and until now, no effective methods exists for measuring and discovering muti-variable correlations.In response to above problems, A kind of statistical measure ARTMIC(The Alternant Recursive Topology Maximum Information Coefficient) is proposed to evaluate the strength of correlation between two variables, as well as some other statistics to measure the nature of bi-variable correlation. These statistics can evaluate a wide range of relationships both linear and nonlinear efficiently and equitably, and compensate the disadvantages of Reshef’s MIC that it can not be worked out accurately in polynomial time and is incapable of identifying the "local random" phenomenon. ARTMIC and other statistics are applid for a dataset of19American classical indexes with data collected since1959, as a result, a lot of bi-variable correlations are found.In addition, refering to the idea of chemical system, the "ideal correlation system", one frame for multi-variable correlation is proposed, and the mapping relationship between these two systems is probed. By proving three mutual information decomposition theorems, decomposability of multi-variable correlations is demonstrated at a large extent. Methods for measuring and discovering multi-variable correlations in ideal and non-ideal situations are proposed separately, and the effectiveness of these methods is verified by simulation and real experiments.
Keywords/Search Tags:ARTMIC, Correlation Mining, Non-Linear Correlation, Multi-variale Correlation
PDF Full Text Request
Related items