Font Size: a A A

A Novel Probability-based Clustering Method

Posted on:2015-01-10Degree:MasterType:Thesis
Country:ChinaCandidate:J G QiaoFull Text:PDF
GTID:2268330431451851Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Cluster analysis divides data into groups that are meaningful, easily understandable. It plays a significant role in many fields of science such as image processing, medical information processing, computer vision, statistical analysis, biological and psychological science, etc.Many algorithms have been used to solve different clustering problems. For example, k-means, k-means++, Mean Shift, CURE and PROCLUS are classical clustering algorithms proposed for certain applications. In this article, we made a thorough study of the advantages and disadvantages of these algorithms, introduced briefly their ideas, and proposed a novel probability-based clustering method called k-normal and another method aimed at high-dimensional clustering.At the beginning of this article, we present the background and significance of cluster analysis followed by the development and the applications of clustering. And then we introduce related knowledge about cluster analysis, such as the definition of clustering, proximity measures and the classification of clustering methods, succeeded by four common clustering methods, which are k-means, Murat’s method, k-means++, Mean Shift and CURE.K-means is a classical clustering method. Owe to its efficiency and ease of implementation, it is widely applied in many areas. However, it has three defects. Firstly, the number of clusters needs to be determined beforehand. Secondly, the selection of initial cluster center will exert great influence on final clustering result. Thirdly, it only finds clusters with similar sizes. In other words, for many data points which actually belong to a larger cluster, k-means may assign them to a smaller one. In the article, we propose a novel initializing cluster centers method, in order to overcome the second defect of k-means. In addition, a probability-based model is used to deal with the third defect. The new proposed method enjoys the efficiency of k-means, while improves the accuracy of clustering.Many traditional clustering methods can be applied to clustering problems with low-dimensional data. However, because of the complexity of the real data set, these clustering methods often cannot work well on high-dimensional data set and large data set. At present, the study of high-dimensional clustering is still a very important area of cluster analysis, and also a challenging problem in clustering research. When traditional clustering methods are applied to high-dimensional data set, it has two problems. Firstly, there are many irrelative dimensions in high-dimensional space in most cases. Secondly, the distances between two data points become similar to each other, as a result of the sparse data distribution in high-dimensional space. Reducing the number of dimension of high-dimensional data set or clustering based on subspace can solve the first problem effectively. Other proximity measure (non-distance)may be employed to deal with the second one. When cluster analysis is applied to the large data set or high-dimensional data set, two issues need to be considered. On the one hand, improving the accuracy of clustering is the goal of all clustering methods. On the other hand, the speed of the clustering algorithm should be acceptable. In this article, we proposed a subspace-based clustering with a probability-base model. By improving the k-normal methods, we have made progress on clustering high-dimensional data set.
Keywords/Search Tags:Cluster Analysis, probability-based model, high-dimensionalclustering, subspace clustering
PDF Full Text Request
Related items