Font Size: a A A

Study On Clustering For Large Data Sets And Its Applications

Posted on:2012-08-24Degree:DoctorType:Dissertation
Country:ChinaCandidate:P J QianFull Text:PDF
GTID:1118330332991559Subject:Light Industry Information Technology and Engineering
Abstract/Summary:PDF Full Text Request
Clustering analysis is always a topical subject in the field of Pattern Recognition and many clustering methods have mergred now. Most of these methods have shown their superior performance on some small datasets fitting for them; however, they are often inefficient, impractical and even invalid on those large ones. Motivated by the challenges of clustering for large datasets, several issues are addressed in this study, which mainly involves the following five parts.In Chapter 2, the constratined Graph-based Relaxed Clustering (CGRC) algorithm is firstly presented by introducing a constrained condition and a linear term to the objective expression of the original Graph-based Relaxed Clustering (GRC) algorithm. Furthermore, owing to CGRC can be regarded as a Center-constrained Minimal Enclosing Ball problem, the Fast Graph-based Relaxed Clustering (FGRC) algorithm is further proposed by using the Core-set based MEB approximation trick accordingly. It is the biggest merit that the asymptotic time complexity of FGRC is linear with the data size. Probability Density Estimation is a theoretical foundation in Pattern Recognition and many follow-up works are continued based on it. So do the Fast Adaptable Similarity-Based Clustering Method (FASCM) proposed in Chapter 3 and the Fast Mean Shift Spectral Clustering (FMSSC) algorithm presented in Chapter 4, they are both based on the Fast Reduced Set Density Estimation (FRSDE).In Chapter 3, in first, it is deduced that the similarity measure function of the Similarity-Based Clustering Method (SCM) equals to a Gaussian kernel density function, so the similarity function with sparse weight coefficients, which reduces sharply the computational cost of the SCA phase embedded in SCM, can be obtained quickly by utilizing FRSDE. Then, by replacing the Agglomerative Hierarchical Clustering (AHC) algorithm with GRC, it makes true that the new prposed method is independent on artificial experience and is self-adaptable. This is FASCM.In Chapter 4, it is pointed out that the Parzen Window density estimator (PW) leads to the expensive time cost of the original Mean Shift Spectral Clustering (MSSC) algorithm. To deal with this problem, the FMSSC algorithm with new framework is presented, where FRSDE substitutes for PW and CGRC proposed in Chapter 2 replaces the simple mode merging method, which makes FMSSC is more practical than MSSC and its time complexity is roughly linear with the whole data size.In Chapter 5, because the objective function of GRC can be represented as two parts:"weight sum of PW"and"Quadratic Entropy", GRC can be considered as a Kernel Density Estimation (KDE) problem. Thus, the Scaling up GRC by KDE Approximation (SUGRC-KDEA) method is proposed according to the KDE Approximation theorem. The key point of SUGRC-KDEA is the sampled size. To ensure the sampled size is suitable and the sampling subset is able to reflect the potential distribution rule of the whole dataset, the Hyperesphere Segmentation Based Random Sampling (HSBRS) algorithm is presented synchronously. In Chapter 6, the Fast Kernel Density Estimation (FKDE) theorem, which is a fundamental theory in this study, is proposed via the Cauchy– Schwartz inequality. It is proved that the Integrated Squared Error (ISE) between the KDE generated by all data points and the KDE generated by a sampled subset is just dependent on the sampled size and the kernel width. That is, a subset with appropriate sampled size can be employed for KDE instead of the whole dataset. This theorem provides a new theoretical support for those data sampling based methods or technologies in Partten Recognition, including all researches in this thesis.
Keywords/Search Tags:Clustering, Large data set, Time complexity, Spectral clustering, Similarity-based clustering, Mean shift, Kernel density estimation, Minimal enclosing ball, Parzen Window, Reduced set density estimation
PDF Full Text Request
Related items