Font Size: a A A

Clustering Methods And Applications For High-dimensional Data Based On K-harmonic Means

Posted on:2013-10-15Degree:DoctorType:Dissertation
Country:ChinaCandidate:J J ChenFull Text:PDF
GTID:1228330395953626Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
It is possible for a large amount of high-dimensional data acquisition with theadvancement of information technology and data store techniques. Large size data withhigh-dimensional features become very common in many domains including financialanalysis, genomics, sensors, web documents and satellite image etc. Clustering analysisis an important means for dimming the interesting knowledge from them. However,many clustering methods working well in low-dimensional space often achieve the badperformance in high-dimensional space for being affected by the curse of dimensionaldisaster. Thus, it is an important research direction and the research difficulties forclustering analysis methods of high-dimensional data.An important way resolving high dimensional problem of data is to make data bethe data with lower-dimensional features by the dimension reduction technology andthen deal with these data using the methods for lower-dimensional data. That assures theefficiency and effectiveness of the methods for lower-dimensional data. Dimensionalreduction technique is an important way for data dimension reduction. Traditionalclustering methods have successfully solved the problem of data clustering withlow-dimensional features. Among them, partition clustering method is widely used forits simplicity and low time complexity; however, the shortcomings are obvious becauseof being sensible to the noise and the initialization, predefining the number of clustersand being trapped in local optima easily, which also lead to the low performancewhiling analyzing the high-dimensional data. Classical partition algorithms mainlyinclude Fuzzy c-means (FCM), K-means (KM), K-harmonic Means etc. Among it,KHM is more robust being weakly sensible to the initial value compared with FCM andKM.To solving the problem of clustering high-dimensional data,a filter-based two-stepcombinational feature selection algorithm RF (ReliefF-FCBF) is proposed, which is based on FCBF(Fast Correlated-Based Filter)and ReliefF (Relief-F). On the basis ofthe data preprocessing with RF algorithms, the dissertation focuses on the partitionclustering problems and KHM algorithm. Moreover, the dissertation makes a furtherstudy on the automatic clustering algorithms based on KHM clustering algorithm,proposes a series valid automatic clustering algorithms and applies them to the analysisof gene expression data. The main contributions of the research are as follows:(1) The dissertation proposed a filter-based two-step combinational featureselection algorithm RF (ReliefF-FCBF). RF can carry on the data preprocessing andremove the noise, irrelevant and redundant features in data sets efficiently andeffectively so that it can reduce the data dimension. Experimental results on UCIMachine Learning data sets and gene expression data sets proved that RF algorithmcould find a more compact and distinguished gene subset, which assured theeffectiveness of the application of partitional clustering algorithms on high-dimensionaldata.(2) A general scheme is proposed for AKHM (automatic KHM) clustering basedon CVI (clustering validity index).On the basis of it, a PBMF-based AKHM algorithmis designed. The algorithm resolves the problem of predefining the number of clusterswith KHM algorithm. Experimental results on UCI Machine Learning data sets andgene expression data sets processed by RF algorithm proved that PBMF-based AKHMalgorithm had the good performance and could accurately capture the inherent clustersof data sets in most cases.(3) We proposed a hybrid meta-heuristic automatic clustering method thatcombines PBMF-based AKHM algorithm and PSO (Particle Swarm Optimization)algorithm. On the basis of it, the two concrete algorithms, PSOAKHM andDAPSOAKHM (Dynamic Adaptive PSO AKHM) are designed to resolve the localoptima problem of KHM clustering algorithm. Experimental results on UCI MachineLearning data sets and gene expression data sets processed by RF algorithm proved that two automatic clustering algorithms could find the global approximated optimalresolution, especially they performed well on the multiclass data sets.(4) The dissertation proposed a hybrid meta-heuristic automatic clustering methodbased on HS (Harmony Search) and KHM. On the basis of it, an algorithm,GDACHSKHM (Global Dynamic Adaptive Clustering HSKHM) is proposed, whichmakes HS apply to the analysis of gene expression data and thus extends and riches thepractical applications of HS. HS is a new meta-heuristic algorithm compared with PSOand it has many advantages. GDACHSKHM algorithm can automatically capture theinherent clusters of data itself by the HS component and adaptively find the globalapproximated optimal resolution without adjusting the parameters manually.Experimental results on UCI Machine Learning data sets and gene expression data setsprocessed by RF algorithm proved the effectiveness of GDACHSKHM algorithm,moreover, it had better performance and more robust in some data sets compared withPSOAKHM or DAPSOAKHM algorithms.In accordance with the problem of clustering high-dimensional data, thedissertation carries on the study of automatic clustering methods based on KHMclustering algorithm and the application analyzing gene expression data. The achievedresearch results prove the research works of the dissertation have some importanttheoretical significance and practical application value.
Keywords/Search Tags:Feature Selection, K-harmonic Means, Automatic Clustering, ParticleSwarm Optimization, Harmony Search
PDF Full Text Request
Related items