Font Size: a A A

Researches On Clustering Results Quality Improvement

Posted on:2011-03-08Degree:DoctorType:Dissertation
Country:ChinaCandidate:Y ZongFull Text:PDF
GTID:1118360305955966Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Clustering analysis is one of important tasks in data mining and also, it is one of ways to reasonably classify the original data objects. Generally, clustering is a process of partitioning a set of data objects into several subsets or clusters such that intra-group similarities are maximized and inter-group similarities are minimized at the same time. During the past decades, clustering analysis has become a most popular and hot research topic in data mining. Due to clustering is a technology deeply associated with application, so researchers use different models to define the clustering problems for different applications goals. A wide range of clustering algorithms has been proposed to deal with different clustering problems using different ways. In applications, existing clustering algorithms cannot guarantee the clustering result quality, because of the initialization sensitivity, parameter restriction problems. In this thesis, we first focus our goals on improving the clustering results quality; and then, we attempt to analyse the reasons which affect the clustering result quality from the the perspective of model and its solving way; at last, we introduce some new methods to deal with these reasons to improve the clustering result quality. The researche outcomes in this thesis have demonstrated much important theoretical and practical significance.We attempt to improve the clustering result quality by using the following methods:(1) We improve the clustering result quality of heuristic clustering algorithm by using Space Smoothing Search methods. For the clustering problem described by the combination optimization model, heuristic clustering algorithm is an efficient way to deal with it by using heuristic search strategy. Heuristic clustering algorithm has merits of quick converge speed, however, the clustering result quality cannot be guaranteed because of the initialization sensitivity and early convergance problems. We find out that the local suboptimal minimum 'traps'embedded in the search space is the real reason that affects the clustering result quality by analyzing the solving way of heuristic clustering algorithm. For the sake of reducing the impact of multiple minimums, in this dissertation, we introduce the multi-search space strategy into the design of heuristic clustering algorithm. Three smoothing operators are proposed to "fill" in the minimum "traps" without changing the construction of clustering results. The reduced number of local minimums has resulted in a reconstructed search space by running a smoothing operator. The smaller local minimum number is in a search space, the smoother it is. In order to control the smooth level, i.e., the local minimum number, we introduce a parameter named smoothing factor in smoothing operator. A series of reconstructed search spaces with different smooth level settings were built by implementing smoothing operator with different smoothing factors. For any search space, the search of current space is always induced upon the clustering results from former space. Since the multi-smoothing space technology has the merits in reducing the local suboptimal minimums which are embedded in the search space, the impact of them is eliminated and the clustering result quality is therefore improved. The algorithm need more time to reconstruct the search space.(2) We use approximate backbone to capture the common optimal information of a given data set, and then use the approximate backbone to improve the clustering result quality of heuristic clustering algorithm. Heuristic clustering algorithm deals with the clustering problems that described by combination optimal model by using local search strategy, and generates the local suboptimal clustering result which makes the objective function value minimized. In order to improve the clustering result quality, researchers either use random restart methods to run a clustering algorithm for several times and keep the best one, or find the consistent clustering result from several clustering results derived by running a clustering algorithm several times. Although, the former one could get a good result, but the relationship between different clustering results is ignored. The later one can generate the consistent result of a data set, but the information of original data set is discarded as well. As we known, the clustering result of heuristic clustering algorithm is considered as a minimum convergance and it reflects one aspect of data set. The common intersection information of several local suboptimal clustering results does reflect the mutual optimal information of a given data set, and it has significant meaning for capturing the real clustering result in data set. We, in this dissertation, introduce approximate backbone to figure out the common part of several local suboptimal clustering results, and then use approximate backbone to design heuristic clustering algorithm. We firstly define the backbone and approximate backbone of heuristic clustering problem; secondly study the factors which affect the approximate backbone quality and propose an algorithm to find out approximate backbone from several local suboptimal clustering results; at last, two algorithms named BK-means (BK-means:clustering algorithm based on approximate backbone) and ABG_RC (Approximate Backbone Guided Reduction Clustering) are proposed based on approximate backbone respectively. In case of several suboptimal clustering result should be derived by running a heuristic clustering algorithm several times, the clustering algorithm proposed in this dissertation need more running time.(3) We design a Local Significant Unit (LSU) structure to capture the data distribution in high dimensional space to improve the clustering result quality based on kernel estimation and spatial statistical theory. For the high dimensional clustering problem which is described by space density estimation model, equal and random width hyper-rectangle structures are often used to capture the data distribute in high dimensional space. Kernel density estimation is one of important space density estimation methods. Since the kernel width which could reflects the data distribution is very hard to set in application, a majority of subspace clustering algorithms use equal or random width to restrict kernel width and then use the density of hyper-rectangle with equal or random width to simulate the kernel density. Although the restriction of equal or random width could deal with kernel width setting problem, but the equal and random width cannot reflect real distribution of data set, the clustering results cannot be guaranteed. In this dissertation, a local significant unit which contains dense hyper-rectangle and its corresponding attribute subset is designed to capture the local density distribution based on local kernel estimation and spatial statistical theory. The width of hyper-rectangle and its corresponding attribute subset are derived by running a Rodeo algorithm which has the ability to find out the real kernel width. In order to quickly find LSUs which cover the distribution of data set, we propose a greedy algorithm named GA_LSU (Greedy Algorithm for LSU). The dense hyper-rectangles with the same attribute subset are merged by running a single-linkage algorithm.
Keywords/Search Tags:Heuristic clustering, Multi space, Backbone, Approximate backbone, Local significant unit
PDF Full Text Request
Related items