Font Size: a A A

Initializing the EM Algorithm for Data Clustering and Sub-population Detection

Posted on:2016-01-10Degree:Ph.DType:Thesis
University:The Ohio State UniversityCandidate:Hu, ZhengyuFull Text:PDF
GTID:2478390017976509Subject:Statistics
Abstract/Summary:
In this thesis our research on two loosely related parts are presented: initialization for the EM algorithm for model-based data clustering, and sub-population detection.;Clustering is the task of putting objects into groups in such a way that the observations in the same group are more "similar" to each other than those in other groups. Clustering is a useful tool for summarizing the data, and hence could greatly reduce the complexity of a data set. Moreover, data could be compressed using clustering so that it take less space to store.;There is no universally accepted definition of cluster, so there exist many different clustering algorithms. Among these algorithms, model-based clustering using the Gaussian mixture model is built on sound mathematical foundation, and is widely applied in many areas. However, the prevailing method for finding the maximum likelihood estimator (MLE), i.e., the expectation maximization (EM) algorithm, is very sensitive to initialization. Hence, the EM algorithm is very likely to stuck in a local maximum of the likelihood function, especially when the number of clusters in the data is large, but the problem of initializing the EM algorithm is not very well studied.;A comprehensive review on the literature regarding the initialization and improvement of the EM algorithm is presented in this thesis. In addition, a novel method for initializing the EM algorithm is proposed. This algorithm utilizes a new framework that is dramatically different from the existing methods, which enables the algorithm to efficiently estimate the number of cluster and provide quality initialization for the EM algorithm. The performance of this algorithm is studied through simulated data sets and is compared with some existing methods.;The second part of this thesis focuses on sub-population detection. In data clustering, different clusters could be considered to be generated by different sub-population, which are identified primarily by their difference in their centers. However, different sub-population could have overlapping centers but varying dependence structure. The task of identifying sub-population of this kind, usually known as sub-population detection, is quite different from conventional clustering analysis and require techniques that can take into consideration the specific features of this problems.;Sub-population detection is a relatively new area in statistics. Hence there is not a lot of existing work on this topic. The tau-path test is the only existing method that directly addresses this problem, so a thorough review for this test is included in this thesis. A new rank-based method for sub-population detection is proposed. We studied the performance of this method through simulation, and compared it against other existing tests. An application to the TG-GATEs database is also discussed.
Keywords/Search Tags:EM algorithm, Data, Sub-population detection, Clustering, Method, Existing, Initializing, Initialization
Related items