Font Size: a A A

The Research Of The Feature Selection And Cluster Algorithms In Data Mining

Posted on:2011-04-15Degree:MasterType:Thesis
Country:ChinaCandidate:H LiFull Text:PDF
GTID:2178330332460931Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the improvement of the data acquisition techniques, the dimension of the data is becoming larger and larger and some noise and redundant information are contained. The simply statistical methods have not the ability to satisfy the need of discovery for knowledge. Data mining technology is gradually emerging its superiority when facing the situation that data is extremely rich but information is relatively lack, and it is becoming a powerful analysis means and effective analysis tool.Feature selection and cluster analysis are two main fields in data mining. The aim for feature selection is to filter out useful information and improve the accuracy. Cluster analysis is aimed to give a overall evaluation without the interference of artificial factors. In recent years, the genetic algorithm-based feature selection and affinity propagation cluster have received widespread attention. In this paper, we present a new genetic algorithm-based feature selection method by changing the encoding strategy and combining the ensemble thought on the basis of multi-population agent genetic algorithm. This method can not only keep the advantage of the multi-population agent genetic algorithm, but can reduce the number of the features in the result. Through the frequency of the features, the order of the feature importance is got, which is conductive to choose important features for analysis. Inspired by the multi-population agent genetic algorithm, we also propose a chain-like multi-population genetic algorithm for feature selection, this method can improve the diversity of the population by constructing a new population structure and selection strategy. In the research for cluster, a feature weighting cluster based on the affinity propagation cluster is proposed. It can reflect the data information exactly compared with the traditional method through considering the different function in the cluster for different features.Through analyzing the results of feature selection to the liver disease data, the short encoding-based multi-population genetic algorithm and the chain-like multi-population genetic algorithm can avoid the shortcoming that too many features are contained in the result and improve the classification rate of accuracy. In the cluster research, three public data sets from UCI were used, The experiment results on the three dataset showed the features weighted AP method can get higher accuracy compared with the traditional AP cluster.
Keywords/Search Tags:Feature Selection, Genetic Algorithm, Cluster, Data mining
PDF Full Text Request
Related items