Font Size: a A A

Dissimilarity Measure Based On The Free Parameters Of Data Mining Research

Posted on:2008-06-19Degree:MasterType:Thesis
Country:ChinaCandidate:J J WenFull Text:PDF
GTID:2208360215961241Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
In recent years, data mining technique developed quickly. The clustering and classification methods have been applied in many fields. But all these methods need parameters setting; the problems produced by parameter setting attract many attentions from researchers. In order to resolve the problems, the idea of parameter-free data mining was proposed by some researchers.In this work, we presented the theory of data mining, analyzed the influence of para -meter setting, and indicated it was one facts of leading the result of data mining to false. So in order to avoid the problem we must remove the parameters from every step of data mining procedure. We choose the dissimilarity measure. A new method SCDM (Symmetrical Compression-Based Dissimilarity Measure) which base on Kolmogorov complexity theory coupled with compression was proposed. The SCDM adapts compression algorithm to estimate Kolmogorov complexity. Because compression algorithm is typically space and time efficient, SCDM inherit these advantages.In this paper, SCDM's function was implemented using MATLAB,standarded compressor and GenCompress on DNA sequence. Many experiments were performed on DNA and time-series sequences, and comparison of our results with other results which use Euclidean distance. Then SCDM was applied for hierarchical clustering.As shown in the experimental results, SCDM don't require the dimensionality of two instances being compared is exactly the same, allow the missing of single data point. It needn't set parameter and with its high efficiency, it can easily resolve the high dimensionality instance. SCDM avoid the influence of parameter, improve the lightness of algorithm and using it into hierarchical clustering can get good result.
Keywords/Search Tags:parameter-free data mining, Kolmogorov complexity, SCDM dissimilarity measure, GenCompress
PDF Full Text Request
Related items