Font Size: a A A

A New High-dimensional Data Clustering Algorithm Based On GAs

Posted on:2012-06-09Degree:MasterType:Thesis
Country:ChinaCandidate:Z F WangFull Text:PDF
GTID:2218330338453285Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Data mining, is a non-trivial process which data obtained from a large number of valid, novel, potentially useful and ultimately understandable patterns. It is a mathematical method to analyze large amounts of data information in order to find different data partition to supply decision support.Data mining is one of the most cutting-edge researches in the information industry and clustering analysis is a very active research subject of data mining.Clustering analysis divides the data into a number of meaningful classes or clusters according to their similarity. The same class or cluster of data have more similarity and small dissimilarity while the difference class or cluster of data have small similarity and more dissimilarity.High dimensional data clustering is an important and difficult issue in clustering analysis. Currently, low-dimensional data clustering algorithms have been more mature. But for high dimensional data, compared with the low-dimensional case, its distribution is very different, which makes many low-dimensional algorithm failed in high-dimensional data. Therefore, the high dimensional data clustering has very important significance.As for the clustering for high dimensional data problem, sub-space clustering and dimension reduction are often adopted to solve it.In this paper, a new sub-space cluster algorithm based on genetic algorithm is proposed, which use the contribution rate of information entropy and distance evaluating the sub-space as the fitness evaluation function. Fitness value determines the quality of clustering results, it's the foundation of the result of evaluating cluster. So it has a certain theoretical value and significance.The main work and innovations of this paper are as follows:(1) Determine the chromosome coding and the search space in the genetic algorithm. The coding design consists of feature selection subspace and class center space. As real data-coded search space is larger and more convenient, this paper adopts real data coding.(2) Design a new fitness evaluation function. Which is the key and core part of this paper. It takes the distance within the object class and the distance between the center and the class information entropy to the contribution rate of the subspace clustering as the fitness evaluation function, making the accuracy and robustness have been greatly improved.(3) To validate the high efficiency and robustness of this paper according to the artificial and real data while compare with other clustering algorithms to assess the merits of the algorithm.
Keywords/Search Tags:data mining, high-dimensional data, cluster analysis, genetic algorithm, feature subspace
PDF Full Text Request
Related items