Font Size: a A A

Research On Feature Selection, Classification And Clustering For Data Mining Based On Swarm Intelligence

Posted on:2011-09-01Degree:DoctorType:Dissertation
Country:ChinaCandidate:W XiongFull Text:PDF
GTID:1118330335492251Subject:Signal and Information Processing
Abstract/Summary:PDF Full Text Request
Data mining (DM) and knowledge discovery in database (KDD) are the research highlights in the fields of intelligence science and technology showing very important economic and social values and having many algorithms, but the effects of mining and discovery are degraded owing to noise, redundant and high dimension characters of data sets. Although feature selection (FS) improves a certain extent this situation, the results of mining and discovery are not the global optima owing to that there are not more effective methods on the problem of FS.Swarm intelligence (SI) is inspired by creature prototype in nature, and an important outcome of artificial intelligence. In order to attain the global optima of mining and discovery, this dissertation proposed researches on the FS, classification and clustering based on SI for DM.The innovated achievements of this dissertation are summed up as follows:1) DM based on SI was investigated, and a framework of FS, classification and clustering based on SI was proposed to solve the problem attaining optimal feature subsets and mining results without a SI-based framework and strategies for its maladjusted problems were discussed. In the meantime, a theoretical research of the convergence for the improved adaptive ant colony optimization (ACO) proposed by this dissertation proved that the improvement is convergent, and the speed of convergence according to probability will be faster than the original on certain conditions.2) The different models of FS and their evaluation measures were investigated, and a strategy that the evaluation measures used by filter model can be exploited by wrapped model in the above SI-based framework was proposed, which takes feature importance measures as heuristic information and introduces stochastic factors and evaluation feedback to correct the imprecise of feature importance measures to help the mining procedure converged to global optima feature subsets. In the meantime, the integration of multiple heuristic information for the FS were investigated, and the linear weighted combination of extended t-statistic, Fisher's discriminant ratio (FDR) and RF feature importance was presented to form integrated feature importance, which resolve the slow convergence of FS and degraded effect of mining when lacked of appropriate heuristic information in the SI-based methods.3) The exploitation of sensitivity in the machine learning algorithm was investigated. For example, support vector machine (SVM) is sensitive to linear feature transformation on the data sets, and there is not approach to obtain the best transformation factors except to normalize simply the data sets, so a hybrid method based on modified particle swarm optimization (PSO) and SVM for feature transformation and classification was presented, which uses novel heuristic info to attract swarm to find optimal linear feature transformation factors, and features on data sets transformed are further refined by discrete binary PSO to generate optimal SVM classifier. Experiments on madelon data set of neural information processing systems (NIPS) in 2003 and ten data sets of university of California at Irvine (UCI) show higher accuracy and smaller feature subsets on three data sets of UCI and madelon data set verifying the feasibility and availability of this method.4) The hybrid and combined algorithms to promote classification accuracy were investigated, and a hybrid method based on improved adaptive ACO and random forests (RF) for selecting a small set of marker genes from micro-array data to produce high accuracy cancer combinational classifier was proposed. In the pre-processing, feature pre-selection based on importance ranked is adopted owing to its small computational costs, and in the performing, the search procedure is accelerated by using heuristic information to refine the pre-selection. Finally, in the post-processing, restricted sequential forward selection (SFS) is adopted to construct optima from near optima. The experiments on two micro-array gene expression data sets show highest accuracy and smaller feature number, and obtain three different groups of marker gene on one data set.5) The SI-based clustering were investigated, and an adaptive pheromone and cluster partition clustering and DM based on ACO was proposed, which divides the clustering phases to entering, increasing and unimproved on N times, and adopts adaptively different cluster partition method correspondingly and pheromone volatile factor to promote the clustering effect. Experiments on iris data set of weka show this method obtains smaller intra cluster distance and 90% to 94% accuracy verifying the feasibility of this method.
Keywords/Search Tags:data mining, swarm intelligence, feature selection, feature transformation, classification, clustering
PDF Full Text Request
Related items