Font Size: a A A

Theoretical Analysis And Effective Algorithms Of Cluster Learning

Posted on:2013-04-06Degree:DoctorType:Dissertation
Country:ChinaCandidate:L BaiFull Text:PDF
GTID:1228330374992479Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Cluster learning is a branch in machine learning, which has extensive applica-tions in various domains, including financial fraud, medical diagnosis, image process-ing, information retrieval and bioinformatics. Currently various types of clustering models and algorithms have been developed in the literature. With rapid develop-ment of information and sampling technologies, data structures become complex, including the attribute-type diversity, the high-dimensionality, the large scale of the data, the imbalanced distributions and the dynamic characteristics of the data. Since cluster analysis is data-driven, different data characteristics often lead to different clustering models and algorithms. Therefore, when complex data has become the main body of data source in modern society. How to find hidden cluster structures from complex data has become an important research content of cluster learning, which has attracted wide attention.In this thesis, we research on how to build the cluster models and design efficient algorithms for complex data. The main contributions are summarized as follows:(1) We propose a new attribute-weighting clustering algorithm for high dimen-sional categorical data. In this algorithm, a new attribute-weighting method is presented, which is used in the clustering process and form a new clustering op-timization problem. The update formulas of the partition matrix, cluster centers, attribute weights in the iterative process are strictly derived, which can guaran-tee that the proposed algorithm converges to a local solution in a finite iterations. The convergence of the algorithm is also strictly proved. Experimental results show that the proposed algorithm not only keep the simplicity of Chan et al’s attribute-weighting algorithm, but also overcome the problem about their weighting failure in handling categorical data.(2) We propose an accelerating mechanism for the fast global k-means clus-tering algorithm (FGKM). In this mechamism, the local geometrical information of each data object is used to reduce the numbers of the unnecessary distance calcula-tions, which can enhance the efficiency of FGKM in clustering large-scale data sets and obtain the same clustering result as that of FGKM. The space and time com-plexities of the approach are analyzed. Experimental results show that Compared to FGKM and other its modified versions, our proposed approach can provide the less computing time and number of distance calculations. It is worth noting that the performance of our proposed approach is more remarkable when a data set with higher dimension is divided into more clusters.(3) We present an organized study of the effect of imbalanced data distributions on the performance of the k-means type algorithms. The causes of the "uniform effect" of the clustering results are analyzed. It is shown that the fuzzy k-means algorithm more possibly produce clusters with relatively uniform sizes than the hard k-means algorithm, and as the fuzzy index m increases, the effect becomes evident. To avoid the occurrence of the "uniform effect", we propose a multi-centers algorithm. In the proposed algorithm, we first use the fast global fuzzy k-means algorithm to obtain several reliable cluster centers and partition the data set into several sub-clusters. Furthermore, based on the fuzzy index m and the Max-Min distances between the selected cluster centers, the number of clusters is determined. Lastly, a separation measure was proposed to evaluate how well two sub-clusters are separated. Multi centers with small separations were organized to model each cluster in the agglomerative method, instead of one single center. Experimental results show that the proposed algorithm can effectively cluster balanced and imbalanced data.(4) We present a new clustering framework for dynamical categorical data. In this framework, we propose a new method based on the within and between-cluster information to describe the characteristics of clusters. Based on it, a new data labeling technique is proposed to allocate an unlabeled data point into the corresponding appropriate cluster. This technique weakens the effect of the cluster sizes in the sliding windows on the labeled results. Furthermore, a validity function is defined for measuring the difference of clustering structures between two sliding windows. Based on it, we transform the problem of detecting drifting-concepts into a quadratic programming problem and make use of this optimal solution to help users discovery the drifting concepts. This technique makes the results of the drifting-concept detecting be independent of the labeled results obtained by the data labeling technique. Experimental results illustrate that the proposed framework not only accurately detects the drifting concepts but also abtains clustering results of better quality, compared with the other frameworks.(5) we present a general validity function for categorical data. Based on it, we theoretically analyze the relations of the three widely used cluster validity functions: the k-modes objective function, the category utility function and the information entropy function. The analytical result show that the category utility function is equivalent to the information entropy function in evaluating the clustering results, and the category utility function and the information entropy function outperform the k-modes objective function in the given conditions. In addition, we theoretically address the problem that only using the within-cluster information can effectively evaluate the clustering results. Finally, we highlight the importance of normalization when applying a validity function to compare the clustering results of different data sets. Along this line, we provide a theoretic analysis for the bounds of the general validity function in a case.The above mentioned contributions has further enriched the cluster analysis of complex data, and provide technology support for the studies of biological informa-tion data, Web data, customer transaction data.
Keywords/Search Tags:Machine learning, Unsupervised learning, Cluster analysis, Attributeweighting, Accelerating mechanism, Uniform effect, Drifting-concept detecting, Clus-ter validation
PDF Full Text Request
Related items