Font Size: a A A

Research On High-dimensional Data Clustering Based On Genetic Algorithm

Posted on:2011-11-18Degree:MasterType:Thesis
Country:ChinaCandidate:L H XiongFull Text:PDF
GTID:2178360308485154Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Data Mining is a hot research area in information technology industry, and cluster analysis is one of the most important research topics in this area. Clustering is the process of grouping data into a number of clusters according to a similarity metric, which has a wide range of applications in real word. Nowadays, there are many classical clustering algorithms which work well on low-dimensional data, while those algorithms are often invalid when processing high-dimensional data because of"the curse of dimensionality". However, the data is always with high dimensions in the real applications. For examples, gene expression data, finance data, multimedia data and web data. The universality of high-dimensional data makes it very important to research on clustering algorithms for high-dimensional data.The direct approach for high-dimensional data clustering is feature transformation which transforms the high dimensional space into low dimensional space. After that, the traditional clustering algorithms can be used to solve the problem. In high dimensional data space, not all of the dimensions are related to clustering. In order to find the most appropriate feature subspace, all the feature subsets need to be tested, while this is a very large cost in computing for high-dimensional data. The traditional search algorithms like greedy algorithms may only find optimal solutions; in this study, genetic algorithms (GA) is used for searching the feature subspace. Genetic algorithms are adaptive heuristic search algorithm premised on the evolutionary ideas of natural selection and natural genetics. In our approach, the searching capability of GA is exploited to search for appropriate feature subsets for clustering, meanwhile, in order to illustrate the characteristics of features (or dimensions) shown in clustering, a fitness function which is based on the degree of features contribute to subspace clustering is proposed. Those researches in this thesis have some theoretical and practical significance.The main contributions of the study are summarized as following: (1) determine the searching space and encoding method. The traditional encoding method using GA for clustering focus on cluster centers, while in this study, the encoding space is made up of feature subspace and cluster centers and attach some conditions to limit the length of encoding string; (2) propose a new fitness function which is based on the degree of features contribute to subspace clustering. As the evaluation function of subspace clustering, it has ability to compare the clustering result of different subspaces, that is evaluate the clustering result and the features included in such subspace at the same time; (3) design and implement a high-dimensional data clustering using genetic algorithms, called GA-HDclustering; (4) the experiments on an artificial data set generated by computer and real-life data sets got from UCI and literature of Brian Tjaden indicate the feasibility and efficiency of GA-HDclustering.
Keywords/Search Tags:cluster analysis, genetic algorithm, high-dimensional data, feature subspace
PDF Full Text Request
Related items