Font Size: a A A

I-nice: A New Approach For Data Clustering

Posted on:2019-01-16Degree:DoctorType:Dissertation
Country:ChinaCandidate:Md Abdul MasudFull Text:PDF
GTID:1368330599454822Subject:Information and Communication Engineering
Abstract/Summary:PDF Full Text Request
Clustering is a general data exploration method.The k-means type algorithms of clustering methods are efficient for large data.However,these algorithms require the number of clusters in advance,and need to specify the initial cluster centers for improving the clustering results.In this thesis,we propose I-nice,the abbreviation of Identifying the number of clusters and initial cluster centers,approach for data clustering.In I-nice approach,we consider a dataset as a terrain in which clusters are hills.We assign an observer to the terrain to observe and count the peaks of hills,which correspond to the dense regions of clusters and reflect the number of clusters in the data.We develop two parameter-free clustering algorithms based on the I-nice concept.Using the I-nice method,we solve three clustering problems.The major contributions of this thesis are fourfold.First,we propose the I-nice approach for data clustering.The I-nice method transforms high-dimensional data into one-dimension distance data by computing the distances between the observation point and the objects.The distance distribution is modeled by a set of Gamma mixture models,which are solved with the expectation-maximization algorithm.The best-fitted model is selected with an Akaike information criterion variant.We propose the I-nice SO(I-nice with a Single Observation)algorithm in which the number of components in the model is taken as the number of clusters,and the objects in each component are analyzed with the k-nearest neighbor method to find the initial cluster centers.For complex data with many clusters,we propose the I-nice MO(I-nice with Multiple Observations)algorithm,which combines the results of multiple observation points.Second,we formulate the I-nice based semi-supervised clustering from unlabeled data.In I-nice based semi-supervised clustering,we propose a method for selecting pairwise constraints from unlabeled data for improving the clustering accuracy.For this purpose,we first cluster the unlabeled data with the I-nice method into a set of initial clusters.The most informative objects and informative objects are identified from objects in clusters to form a set of pairwise constraints.The advantage of this method is that no label information of data is required for selecting the pairwise constraints.Third,we also formulate the I-nice based concept drift detection for cluster survival analysis.In this approach,we propose a data stream clustering algorithm I-nice Stream for clustering the unlabeled load profile data stream.The concept drift detection method uses a modified Kullback-Leibler divergence to compute the concept drift scores from the clustering results.We estimate the clustering patterns from the concept drift scores.We use the survival analysis to categorize the clustering patterns into sustaining,fading,and emerging types;and retrieve the representative load profiles with interesting characteristics.Finally,to analyze load profile data stream,we propose the I-nice based semi-supervised clustering ensemble framework.We modify the algorithm I-nice MO with weighted observation points,namely,I-nice WMO,which discovers the cluster structure on each load profile data horizon.In semi-supervised clustering,the pairwise constraints are selected from each structure of clusters and a set of solutions are obtained from several consecutive data horizons.Then,the clustering ensemble method is formulated for obtaining an optimum clustering solution.In the experiments,we used synthetic datasets,real-world datasets,and real-life application load profile data.The load profile data contains 21330 load profiles collected from manufacturing industries at Guangdong province in China in 2012.The experiments were conducted to evaluate the effectiveness of proposed methods against competing methods in details cluster analysis.
Keywords/Search Tags:Data Clustering, Number of Clusters, Semi-supervised Clustering, Concept Drift Detection, Load Profile Data
PDF Full Text Request
Related items