Font Size: a A A

The Researches On Related To Key Technologies Among Clustering Based On High-dimensional Data Space

Posted on:2006-05-03Degree:MasterType:Thesis
Country:ChinaCandidate:Y L HeFull Text:PDF
GTID:2168360152475183Subject:Computer applications
Abstract/Summary:PDF Full Text Request
With the wide usage of information technology, data generated from varies information systems become more and more, and the higher efficiency data mining tools was needed to find valuable knowledge patterns. Clustering analysis is a important method in data mining. It is a discovery process that groups a set of data such that the intra-cluster similarity is maximized and the inter-cluster similarity is minimized. Clustering of data in a large dimension space is of a great interest in many data mining applications. With high-dimensionality data sets, how to find the latent and nature clusters is more difficult and need to be resolved.The researches on related to key technologies among clustering based on high-dimensional data space are made in the dissertation. It is focused on the high efficiency clustering algorithms, outliner detecting algorithms, clustering result presentation methods, and so on. It is the basic work to define the similarity for high-dimensional data objects. Based on the improved similarity definition method, the key technologies have been studied in this dissertation. The mostly and innovative work as following:1. Aimed at the distribution property of the high dimensional datasets, a projected clustering method is prompted for subspace clusters. The projected clustering use the definition of subspace clustering, to find nature clusters in any local of full dimensional space. The Bernoulli distributions is used to interpret the property of the binary data set, and a projected clustering algorithm is proposed for binary data set with large attributes based on finite mixtures of Bernoulli distributions and EM algorithm. This algorithm can find series of clusters in subspace as well as suitable attributes subset, achieves the goal of clustering in varies subspaces.2.A outlier detecting method based on high dimensional data space is advanced from the projected clustering algorithm. It is important to detect outliers in many data mining applications. A new projected outlier detecting algorithm is combined with the idea of subspace clustering. First, subspace with relatively high dense unit will be finded using a projected clustering method. It can be speed up for the clustering step if the original data can preprocessed to be binary. Second, the dispersed degree of each attribute is computed in subspace based on the definition of the attribute entropy. Third, the attribute sets that have more dispersed degree are identified and outlier points will be detected depend on these attribute sets.3.A clustering result presentation method is promptsed for high dimensional data space based on the theory of Rough set. Since the internal structure of data set is unknown before clustering, the clusters should be presented properly, so user can get the result completely and accomplish the task of knowledge discovering. The presentation and explanation of the clustering result play a important role in the technology of clustering . Based on the study of rough set, the rough set theory on attribute space is imported and a new clustering result presentation methods is advanced, with the different property consideration of the object space and attribute space of high dimensional data set. This method can provide relatively synthesis information of clustering result from object space and attribute space, reflect the clustering knowledge with rules, enables users to capture more useful pattern and to hold the internal structure of data sets.
Keywords/Search Tags:Data mining, High-dimensional clustering, Projected clustering, Outlier detecting, Presentation of clustering result, Rough set
PDF Full Text Request
Related items