Font Size: a A A

Study Of Kolmogorov Complexity Based Clustering Algorithms

Posted on:2014-08-15Degree:MasterType:Thesis
Country:ChinaCandidate:X L TaoFull Text:PDF
GTID:2268330422453101Subject:Management Science and Engineering
Abstract/Summary:PDF Full Text Request
Clustering is an important branch of Data Mining, which has gotten more depth research becauseof its unique knowledge discovery functions. Today, there are lots of efficient clustering algorithmswhich have been widely used in many area services. However, with expansion of areas, clusteringalgorithms encountered a number of universal problems. For instance, Text clustering algorithmcannot be used in picture clustering area and music clustering area. Because of differents kinds of areaknowledg needed to extract subject-sepcific features, they cannot be universal.We present a new method for clustering based on Kolmogorov complexity. The method doesn’tuse subject-sepcific features or background knowledge, and works as follows: First we determine aparameter-free, universal, similarity distance, the normalized compression distance or NCD,computerd from the lengths of compressed data files (singly and in pairwise concatenation). Second,we apply a K-means clustering method(In fact, any other kind of fine clustering method is OK). TheNCD is not restricted to a specific application area, and works across application area boundaries. Atheoretical precursor, the normalized information distance, is provably optimal. However, theoptimality comes at the price of using the non-computable notion of Kolmogorov complexity. Wepropose axioms to capture the real-world setting, and show that the NCD approximates optimality. Totest the K-means clustering algorithm from the distance matrix, the algorithm is implemented andavailable as public software on Github.com, and is robust under choice of different kinds of subjects.To substantiate our claims of university and robustness, we report evidence of successful applicationin areas as diverse as literature, pictures, broken pictures and scientific documents.
Keywords/Search Tags:Clustering Algorithm, Kolomogrov Complexity, Normalized Compression Distance, Data Preprocess
PDF Full Text Request
Related items