A New High-dimensional Data Clustering Algorithm Based On GAs

Posted on:2012-06-09

Degree:Master

Type:Thesis

Country:China

Candidate:Z F Wang

Full Text:PDF

GTID:2218330338453285

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

Data mining, is a non-trivial process which data obtained from a large number of valid, novel, potentially useful and ultimately understandable patterns. It is a mathematical method to analyze large amounts of data information in order to find different data partition to supply decision support.Data mining is one of the most cutting-edge researches in the information industry and clustering analysis is a very active research subject of data mining.Clustering analysis divides the data into a number of meaningful classes or clusters according to their similarity. The same class or cluster of data have more similarity and small dissimilarity while the difference class or cluster of data have small similarity and more dissimilarity.High dimensional data clustering is an important and difficult issue in clustering analysis. Currently, low-dimensional data clustering algorithms have been more mature. But for high dimensional data, compared with the low-dimensional case, its distribution is very different, which makes many low-dimensional algorithm failed in high-dimensional data. Therefore, the high dimensional data clustering has very important significance.As for the clustering for high dimensional data problem, sub-space clustering and dimension reduction are often adopted to solve it.In this paper, a new sub-space cluster algorithm based on genetic algorithm is proposed, which use the contribution rate of information entropy and distance evaluating the sub-space as the fitness evaluation function. Fitness value determines the quality of clustering results, it's the foundation of the result of evaluating cluster. So it has a certain theoretical value and significance.The main work and innovations of this paper are as follows:(1) Determine the chromosome coding and the search space in the genetic algorithm. The coding design consists of feature selection subspace and class center space. As real data-coded search space is larger and more convenient, this paper adopts real data coding.(2) Design a new fitness evaluation function. Which is the key and core part of this paper. It takes the distance within the object class and the distance between the center and the class information entropy to the contribution rate of the subspace clustering as the fitness evaluation function, making the accuracy and robustness have been greatly improved.(3) To validate the high efficiency and robustness of this paper according to the artificial and real data while compare with other clustering algorithms to assess the merits of the algorithm.

Keywords/Search Tags:

data mining, high-dimensional data, cluster analysis, genetic algorithm, feature subspace

PDF Full Text Request

Related items

1	Research On High-dimensional Data Clustering Based On Genetic Algorithm
2	Research On Subspace Clustering Algorithm For High Dimensional Data
3	Research On Clustering Algorithms For High-Dimensional Data
4	The Research On A Few Key Issues In High Dimensional Data Mining
5	Research On Improved Subspace Clustering Algorithm
6	Study On High-dimensional Data Subspace Clustering Analysis And Application
7	The Application Of Cluster Analysis Algorithm In HMIS
8	The Research On Common Subspace Recognition Method For High Dimensional Data
9	Data Mining Technology And Its Application In The Supermarket In Crm
10	Research On Subspace Clustering Algorithms Based On Density