Research On High-dimensional Data Clustering Based On Genetic Algorithm

Posted on:2011-11-18

Degree:Master

Type:Thesis

Country:China

Candidate:L H Xiong

Full Text:PDF

GTID:2178360308485154

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

Data Mining is a hot research area in information technology industry, and cluster analysis is one of the most important research topics in this area. Clustering is the process of grouping data into a number of clusters according to a similarity metric, which has a wide range of applications in real word. Nowadays, there are many classical clustering algorithms which work well on low-dimensional data, while those algorithms are often invalid when processing high-dimensional data because of"the curse of dimensionality". However, the data is always with high dimensions in the real applications. For examples, gene expression data, finance data, multimedia data and web data. The universality of high-dimensional data makes it very important to research on clustering algorithms for high-dimensional data.The direct approach for high-dimensional data clustering is feature transformation which transforms the high dimensional space into low dimensional space. After that, the traditional clustering algorithms can be used to solve the problem. In high dimensional data space, not all of the dimensions are related to clustering. In order to find the most appropriate feature subspace, all the feature subsets need to be tested, while this is a very large cost in computing for high-dimensional data. The traditional search algorithms like greedy algorithms may only find optimal solutions; in this study, genetic algorithms (GA) is used for searching the feature subspace. Genetic algorithms are adaptive heuristic search algorithm premised on the evolutionary ideas of natural selection and natural genetics. In our approach, the searching capability of GA is exploited to search for appropriate feature subsets for clustering, meanwhile, in order to illustrate the characteristics of features (or dimensions) shown in clustering, a fitness function which is based on the degree of features contribute to subspace clustering is proposed. Those researches in this thesis have some theoretical and practical significance.The main contributions of the study are summarized as following: (1) determine the searching space and encoding method. The traditional encoding method using GA for clustering focus on cluster centers, while in this study, the encoding space is made up of feature subspace and cluster centers and attach some conditions to limit the length of encoding string; (2) propose a new fitness function which is based on the degree of features contribute to subspace clustering. As the evaluation function of subspace clustering, it has ability to compare the clustering result of different subspaces, that is evaluate the clustering result and the features included in such subspace at the same time; (3) design and implement a high-dimensional data clustering using genetic algorithms, called GA-HDclustering; (4) the experiments on an artificial data set generated by computer and real-life data sets got from UCI and literature of Brian Tjaden indicate the feasibility and efficiency of GA-HDclustering.

Keywords/Search Tags:

cluster analysis, genetic algorithm, high-dimensional data, feature subspace

PDF Full Text Request

Related items

1	A New High-dimensional Data Clustering Algorithm Based On GAs
2	Research On Subspace Clustering Algorithm For High Dimensional Data
3	Research On Clustering Algorithms For High-Dimensional Data
4	Study On High-dimensional Data Subspace Clustering Analysis And Application
5	Research On Subspace Clustering Algorithms Based On Density
6	Research On Projective Clustering Algorithms With Applications For High-dimensional Data
7	The Research On Common Subspace Recognition Method For High Dimensional Data
8	A Study Of Soft Subspace Cluster Based On Natural Computation
9	Research On Clustering Methods For High Dimensional Data And Their Applications
10	Research On Clustering Algorithm Based On Subspace In High-dimensional Data Streams