Font Size: a A A

A New Consensus Method For Cluster Ensembles To Improve Clustering Accuracy And Stability

Posted on:2012-01-24Degree:MasterType:Thesis
Country:ChinaCandidate:Happe Clement Deus D SFull Text:PDF
GTID:2178330335989453Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Organizing Data into sensible grouping is one of the most fundamental and crucial modes of understanding learning where by similar patterns are grouped together in one group and those not similar into another groups. This thesis presents a Statistical Consensus method for Cluster ensemble to improve clustering accuracy and stability and in some situations it can serve in distributed data mining in either case whether privacy issues or massive volume of data that cannot be pooled into a single location for processing or both. Ensemble methods have been popularly in supervised learning whereby it proved to reduce predictive errors to a considerable high margin compared to a single best classical predictor/learning model. Likewise recently intensive researches are in progress working on unsupervised learning (cluster ensembles) in which promising results are attained. Our proposed Clustering Ensemble technique comprises of four sub-parts on the way through the process in attaining final consensus clustering result.The first part of our approach is generating the partitions in which K-means Clustering algorithm is employed with different initiations and run the algorithm several times. K-means algorithm is sensitive to initial parameters whereby different initialization leads to diverse clustering results from same dataset. The second part is the selection of a single best clustering among generated partitions. This is achieved by employing the objective function from the k-means clustering algorithm which is regarded as an error in this case. The k-means clustering states that, the minimum this error implies compact and well separated clusters which is the essence to clustering. Due to the lack of labels, the error is the only well proved mathematical measure in clustering quality analysis. The third part of the consensus method is the selection of consistent clusterings in which the inconsistency partitions or clustering are filtered out from the ensemble, only the consistency partitions/clustering are included in the consensus process. Here the information theory; Mutual Information (MI) is employed as criteria in selecting consistent partitions and the fourth part of our method is the consensus function. The final clustering result is achieved by fusing/combining the consistency partitions in the ensemble with a Statistical Consensus function.Our research is based on Cluster Ensembles in which the concentration and focus is at improving the Accuracy of clustering result including Stability. Due to being the most influential development technique in Data Mining and Machine Learning, ensemble techniques combine multiple models into one which is usually more quality than the best of its components. Most Data Mining and Knowledge Discovery techniques emphasize more effort on model building rather than accuracy for stance in marketing, intrusion detections in networks and alike. But on contrary complex business intelligent systems like auditing, fraud detections including criminal detections do require much attention on clustering accuracy rather than the model. Any business intelligent system needs a quality clustering as its core technique and in most cases it involves huge volumes of data which sometimes can be in distributed environment. The issue is that, available classical clustering algorithms are not stable. Their instability leads to inaccurate clustering results at the same time they are not suitable for applications in distributed data environments whereby data cannot be pooled into a single location for processing because classical clustering techniques assumes that the data is located at single location.The proposed new Consensus approach for Cluster Ensembles apart from producing stable and accurate clustering results, it offers capability for clustering distributed data sets. Distributed data mining is one of the interesting aspects in data mining especially when dataset cannot be pooled into a single location due to either storage issues (always data mining involve massive data) or privacy reasons. As prescribed before from previous paragraph, in these kinds of situations single classic clustering technique cannot handle. Our Consensus method uses number of patterns and cluster centers as representation for the clusters; this makes our technique to be unique from existing cluster ensemble methods in which they tend to use labels from each pattern or data point. The representation of clusters with cluster centers and its number of patterns resolves the label correspondence problem directly without introducing additional technique which is always the case in most existing cluster ensemble methods. This technique also serves time and storage since the only information needed for the consensus is the cluster's centers with their number of data points which is always much less that the actual number of data points in the dataset, this makes our consensus suitable for clustering massive volume of data in parallel and or distributed environment. Experiment results from real datasets shows that, our New Consensus clustering produces improved result in terms of accuracy and stability compared to its components which is classical k-means clustering algorithm in this case and can deal with massive volume of data in distributed environment.The Thesis is presented as follows:The first chapter introduces the idea of Data Mining and Knowledge Discovery, techniques involved, applications including the challenges involves with a literature review. The second chapter focuses on the Ensembles and Clustering Ensembles reviewing the technique and existing ensemble methods citing their strengths and weaknesses, while the third chapter is presents the proposed Ensemble technique. The fourth chapter is about the Experiments undertaken and the Evaluation of results, with the fifth chapter giving the conclusion. The last three sections are about the Autography, the Appendices and the Acknowledgements respectively.
Keywords/Search Tags:Cluster Ensembles, Statistical Consensus, Unsupervised Learning, Business Intelligent System, Accuracy
PDF Full Text Request
Related items