Font Size: a A A

Multi-Granular Big Data Analytics Based On Density Peak

Posted on:2018-03-23Degree:DoctorType:Dissertation
Country:ChinaCandidate:J XuFull Text:PDF
GTID:1318330566462463Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
In the big data era,the collection,transmission,and storage of data have permeated into every realm of human's life.However,the value of big data needs to be revealed with the analytics and understanding for the data.Because of the 3Vs characteristics(volume,velocity,and variety)of big data,the human beings are not capable of obtaining insight into the data only with their cognitive power.But granular computing(GrC)as an emerging paradigm of approximate knowledge representation and problem solving,has the advantage of simulating the way the mankind think and is equipped with the super-fast computing power of the modern computers.So GrC can switch to the appropriate granularity when solving a specific problem,rather than sticking to the finest-grained original data.As a result,GrC usually obtains an effective solution with higher efficiency.In recent years,the clustering method based on density peaks(DPClust for short)has attracted widespread attention due to its being high in both accuracy and efficiency.We discover the philosophy behind DPClust is that the neighboring data are not in equal positions,but in a partial-ordered relation.The leading tree structure as an intermediate result is just the concrete realization of this partialordered relation.This thesis progressively leverages the partial-ordered relation and the structure of leading tree to explore the clustering and classification problems in big data.The main contributions are as follows:(1)A highly efficient multi-granularity clustering based on density peak(named as DenPEHC)is proposed for large volume static data.DenPEHC directly generates clusters on each possible clustering layer without a merging or dividing process,and introduces a grid granulation framework to enable DenPEHC to cluster large-scale and high-dimensional(LSHD)datasets.This study consists of three parts:(a)utilizing the distribution of the parameter ,and a linear fitting approach is used to select clustering centers with the clustering hierarchy decided by finding the “stairs”in the curve;(b)analyzing the leading tree as an intermediate result of DPClust,and constructing the clustering hierarchy efficiently based on the tree;and(c)designing a framework to enable DenPEHC to cluster LSHD datasets when a large number of attributes can be grouped by their semantics.The proposed method builds the clustering hierarchy by simply disconnecting the center points from their parents with a linear computational complexity (8)),where 8)is the number of clusters.Experiments on synthetic and real datasets show that the proposed method has promising efficiency,accuracy and robustness compared to state-of-the-art methods.(Chapter 3)(2)The basic idea of DenPEHC is extended to high speed evolving data stream,and a data stream clustering method is developed to detect arbitrarily shaped clusters and deliverthe clustering result on the fly.We transform the leading tree(LT)into a fat node leading tree(FNLT)in a granular computing way.FNLT is a novel interpretable synopsis of the current state of data stream for clustering.New incoming data is blended into the evolving FNLT structure quickly,and thus the clustering result of the incoming data can be delivered on the fly.During the interval between the delivery of the clustering results and the arrival of new data,the FNLT with blended data is granulated as a new FNLT with a constant number of fat nodes.The FNLT of the current data stream is maintained in a real-time fashion by the Blending-Granulating-Fading mechanism.To find the major pattern drifts,the change points are detected using the partial order relation between each pair of the cluster centers and the martingale theory.Compared to several state-of-the-art clustering methods,the presented model shows promising accuracy and efficiency.(Chapter 4)(3)The granulation methods in DenPEHC and DP-Stream are brought to be subjected to the justifiable granularity principle,and this gives birth to a Local-Density based Optimal Granulation(LoDOG)model.LoDOG exhibits evident advantages: a)it can detect arbitrarilyshaped information granules,b)it finds the optimal granulation solutions with ()complexity,once the leading tree structure has been constructed.We describe information granules of arbitrary shapes using a small collection of landmark points positioned on the skeleton of the underlying manifold,which contribute to approximate reconstruction capabilities of the original dataset.A dissimilarity metric is developed to evaluate the quality of the obtained reconstruction.The interpretability of LoDOG information granules is discussed.Theoretical analysis and empirical evaluations are covered to demonstrate the effectiveness of LoDOG and the manifold description.(Chapter 5)(4)Based on the optimal leading forest constructed by LoDOG,a non-iterative label propagation method(named as LaPOLeaF)is proposed to classify the data with only a few labeled samples,and LaPOLeaF is scaled to accommodate big data.We firstly propose a sound assumption,arguing that: the neighboring data points are not in peer-to-peer relation,but in a partial-ordered relation induced by the local density and distance between the data;and the label of a center can be regarded as the contribution of its followers.Starting from the assumption,a highly efficient noniterative label propagation algorithm based on a novel data structure named as optimal leading forest(LaPOLeaF)is proposed.The major two weaknesses of the traditional GSSL,namely,low efficiency caused by iterative optimization and inconvenience of labeling the newly arrived data,are addressed by this study.We further scale LaPOLeaF to accommodate big data by utilizing block distance matrix technique,parallel computing,and Locality-Sensitive Hashing(LSH).Experiments on large datasets have shown the promising results of the proposed methods.(Chapter 6)The research works included in this thesis emphasize on the efficiency and accuracy in dataanalytics.A salient feature shared by the works is that the proposed methods are non-iterative,and of linear complexity except for the distance matrix computation,which makes the series of researches highly applicable in big data analytics.And another highlight is the idea of multi-granularity has very simple implementations in every piece of work in this thesis,which makes it convenient to trade off between the time constraint and precision of the solution.
Keywords/Search Tags:Big Data, Granular Computing, Density Peak, Hierarchical Clustering, Data Stream Clustering, Information Granules, Semi-supervised Learning
PDF Full Text Request
Related items