Font Size: a A A

Uncertainty Data Clustering Algorithm

Posted on:2012-08-03Degree:DoctorType:Dissertation
Country:ChinaCandidate:W L LiuFull Text:PDF
GTID:1118330368477372Subject:Management Science and Engineering
Abstract/Summary:PDF Full Text Request
Clustering, as a approach to data mining / knowledge discovery, is applied in the engineering (e.g., machine learning, pattern recognition, signal processing, data compression), computer science (e.g., Web mining, information retrieval, image segmentation, etc.), gene function identification and disease diagnosis in biomedical field, astronomy and earth science (star classification, geographical landscape analysis, etc.), social sciences (human behavior analysis, social network analysis, criminal psychology, archaeological findings, etc.), as well as customers'characteristics analysis , buying patterns analysis, business classification and stock trend analysis in the economic. Due to inaccurate measurement, sampling error, outdated data source and people's lack of awareness lead to data inherently with uncertainties in the clustering application, e.g., ambiguity and random. Data with uncertainties has brought enormous challenges in clustering analysis of data. On the one hand, the data used to eliminate the uncertainty components in the traditional data pre-processing, affect the quality of clustering results, on the other hand, the existing clustering algorithms for introducing uncertain features of the data will be brought complexity in algorithm.In recent years, Clustering, as important areas of data mining, data clustering techniques in uncertain has also been extensively studied. through the use of probability density function to model the uncertain objects and extend existing clustering algorithm, Scholars has proposed several algorithms, including UK-means, an improved version of K-means, and improvement of the EM algorithm, and FDBSCAN algorithm based on density and hierarchy oriented FOPTICS algorithm; Benjamin et. al., combing Monte Carlo methods in database systems, brought a kind of data clustering in uncertain world; Aggarwal & Yu proposed UMicro algorithm for uncertain data stream. Chau used UK-means algorithm to solve the uncertain moving objects clustering, and produce better results.But above mentioned algorithm were based on the common foundation introducing elements of uncertainty (probability density function that) into distance metric, which will inevitably lead to increase the algorithm's time complexity, and also restricting the distance approximation algorithm scalability during the desired distance calculation. Computing formula transformation (similar to the mechanics of the parallel axis theorem), min - max pruning method and the section function method to simplify the computational complexity of distance, due to the constraints relaxation method to obtain calculate reduction in the amount of the computation leads to weak scalability. At present, instance aggregate queries (equivalent to clustering) studies, mainly related to Top-k aggregation algorithm, which used branch and bound, calculation relaxation to reduce computational complexity. Due to development in data acquisition technology, database technology and the Internet and other technology, the cluster analysis for huge amounts of data highlights its importance.In this paper, for uncertain data environment,we study how to represent of uncertain data and similarity measure between uncertain data, on this basis, proposed several algorithms for clustering uncertain data sets, and verify the effectiveness of the algorithm. Main work includes:(1) The concept of uncertain region and several uncertain region-based clustering algorithms were proposed. Two types of clustering algorithms and uncertain region-based cluster validity measure are proposed based on establishment the concept of domain. The first category is hard c-means algorithm based on uncertain region, including the U-aHCM and U-sqHCM. U-aHCM algorithm is offline update cluster centers (i.e., batch update cluster centers), and U- sqHCM is the online update cluster centers (i.e., when there is one data object distributed from one to another cluster, then update the two clusters'center for their the data object changing); second category is fuzzy c-means algorithm based on the uncertain region, also includes two: U-sFCM and U-eFCM. These algorithms use the concept of uncertain region to better deal with uncertain data.(2)The concept of super-rectangular uncertain region and clustering algorithms based on super-rectangular uncertain region are proposed. Clustering algorithms based on super-rectangular uncertain region can be more flexibly to deal with uncertain data and find clusters of different shapes and sizes (clusters). This paper puts forward three types of clustering algorithms based on based on super-rectangular uncertain region: SU-aHCM & SU-sHCM, SU-sFCM & SU-eFCM and SU-sPCM & SU-ePCM.(3) In order to solve ill-posed problems in clustering algorithms based on based on super-rectangular uncertain region, we proposed concept of uncertain region based on super-rectangular uncertain region and regularization. One is fuzzy c-means clustering algorithm based on super-rectangular uncertain region and L2 regularization (L2-SU-sFCM with L2-SU-eFCM), the other is fuzzy c-means clustering algorithm based on super-rectangular uncertain region and L1 regularization(L1-SU-sFCM and L1-SU-eFCM), reflecting the sparse data objects, thus and finding better data relationship structures in the algorithm.(4) To express the uncertainty of membership function in the fuzzy c-means algorithm, combined with intuitionistic fuzzy set theory, this paper proposed fuzzy clustering algorithms based on intuitionistic fuzzy set, including fuzzy C-Means algorithm based on intuitionistic fuzzy sets (IFS -sFCM) and fuzzy c-means algorithm based on entropy and intuitionistic fuzzy sets (IFS-eFCM).
Keywords/Search Tags:Uncertain Data, Clustering Algorithms, Fuzzy Clustering, Intuitionistic Fuzzy Set
PDF Full Text Request
Related items