Font Size: a A A

Research On Automatic Recognition Algorithm Of Flow Pattern Cell Group Based On Skew T - Mixture Model

Posted on:2016-11-12Degree:DoctorType:Dissertation
Country:ChinaCandidate:X W WangFull Text:PDF
GTID:1108330461996606Subject:Biomedical engineering
Abstract/Summary:PDF Full Text Request
Objective: Flow cytometry(FCM) is a high-throughput technology that offers rapid analysis of a set of physical and chemical characteristics for a number of cells in a sample. Through staining cells with fluorochrome-bound antibodies, and simultaneously analyzing the optical information of the cells under laser irradiation, FCM has been widely used in health research and treatment, such as diagnosis of cancer cells, monitoring the course and treatment of tumor, diagnosis of HIV infection, analysis of cell phenotype, evaluation of peripheral blood hematopoietic stem cell grafts, vaccine development. However, in the current application of FCM technology, the analysis of FCM data still depends on manual analysis, which is subjective, time-consuming, nonreproducible and error-prone. With the development of FCM technology towards multi-channel and high-throughput, fast automatic data analysis platform has become an urgent demand.A major component of flow cytometry data analysis is the process of identifying homogeneous groups of cells, which is called gating. This process of manual analysis is using software to apply a series of drawn gates that select regions in 2D graphical representations of the similar cells. Manual gating is based largely on intuition rather standardized statistical inference, and it ignores the high-dimensionality of FCM data, which may convey information that cannot be displayed in 2D projections. The automatic gating of the cells, in machine learning called unsupervised clustering, has become an active research area for the past several years. There are several algorithms have been developed. However, since FCM data typically includes a rare population of highly asymmetric distribution, it cannot be analyzed accurately by those algorithms. Moreover, most FCM data contains multidimensional information because of the multi-parameter detection. To analyze multidimensional FCM data, current algorithms through a data projector or dimension reduction, which potentially decrease the quality of the gating as some biological information may be lost. Meanwhile, as data processing needs manual operation, they belong to semi-automatic clustering algorithms.In summary, the present study was therefore undertaken to propose an algorithm for automatic identification of cell population in multidimensional FCM data directly, especially the rare populations of high asymmetrical distribution.Methods and Contents: With the intention to automatically identify cellpopulations in FCM data, particularly to identify rare populations of high asymmetrical distribution, a series of studies were carried out according to the process of algorithm design and experimental verification. The main work of this paper outlined as follows:(1) To identify rare populations of highly asymmetrical distribution, we proposed an algorithm based on skew t mixture models. By studying on mixture models, we selected the skew t distribution as the component density of mixture models. Then, we defined a skew t distribution density by analyzing the definition of skew normal distribution and the relationship between t distribution and normal distribution. After studying on Expectation-Maximization(EM) algorithm for maximum likelihood(ML) estimation of mixture models, we computed ML estimation of the skew t mixture models and got the E-step and M-step computing expressions. To avoid the local optimal problem of EM algorithm calculation, we proposed an algorithm for initializing parameters. This algorithm, which is based on K-means algorithm and maximum likelihood of mixture models, ensures the results of EM algorithm converge to the global optimal solutions.(2) In order to solve the problems of low calculate efficiency and cluster shape bias, we proposed a hierarchy clustering algorithm based on skew t mixture models. This algorithm mainly contains three processes, namely, estimating the number of clusters, computing the EM algorithm for skew t mixture models and merging abundant clusters. Because the numbers of initial clusters needed to be limited in a reasonable range, we proposed an algorithm based on histogram. This algorithm, which calculates the number of bins via maximum a posteriori estimation, identifies the peaks of histogram utilizing the frequency change trend between bins. For the redundant clustering results, we defined a similarity criterion which takes consideration of spatial distance between groups and dispersion of populations. Respect to the merging results,we obtained optimum clustering results by a two segments regression algorithm.(3) Simulation experiments. As the previous process results will affect the computing time of the next process, we first simulated 3 groups of data based on FCM data attributes(namely, the number of events, the number of populations and the dimensions). We analyzed these simulated data to find the main computing process. Based on previous simulation results, we simulated another 3 groups of data to find the main factors that affect time complexity of algorithm. Upon the completion of thiswork, we simulated two data that imitate real FCM data to evaluate the performance of skew t mixture models in fitting a variety of shape clusters and the performance of algorithm in identifying a rare population with highly asymmetric distribution. At the same time, we compared the results analyzed by other mixture models and algorithms. Finally, we simulated a data containing two concave shape clusters to evaluate the performance of algorithm in identifying irregular shape populations.(4) Evaluation by biological experiments. Firstly, an experiment was conducted to analysis of yeast cell activity. By analyzing the FCM data from this experiment, we evaluated the performance of the algorithm in analyzing this type of data. Secondly, we did lymphocyte subpopulation experiment. We evaluated the performance of algorithm using two data from the experiment of relative counts of CD8+T cells and the experiment of relative counts of NK cells and B cells. At the same time, we compared the results analyzed by other probabilistic clustering algorithms and nonprobabilistic clustering algorithms.Results:(1) Simulation experimental results: By analyzing three sets of data(a total of 30) with different attributes and recording the computing time of three processes, we found the time of EM algorithm of the skew t mixture models occupies about 97% of the entire computing time. Consequently, we used skew t mixture models to analyze another three sets of data(a total of 60) with different attributes, finding computing time of EM algorithm of the skew t mixture models was linear in the number of events and cluters of data, and quadratic in the dimensions of data. Moreover, for common FCM data(p <20, g <20, n <50000), the computing time of algorithm was mainly related to the number of populations and events. As other probabilistic clustering algorithms find the number of populations rely on criterions, the results of the two experiments has verified our algorithm has higher efficiency. In the experiment of evaluating the performance of algorithm, F-measure value of the analysis results of the skew t mixture model was 0.99234, which was higher than that of the other three methods analysis results: 0.98281, 0.97989, and 0.98302, respectively. This result has shown the skew t mixture models has a good performance in fitting a variety of shapes populations in FCM data. Using the algorithm to analyze the simulated data contain rare populations of highly asymmetrical distribution, F-measure value was 0.99899, which was higher than that of the other methods analysis results: 0.98002, 0.98395, and 0.99264, respectively. This result verified the algorithm has a good performance in identifying rare populations of highlyasymmetrical distribution. The analysis of simulated data containing concave shape populations has shown an ability of the algorithm to identify irregularly shape clusters.(2) Biological experimental results: In all experiments, we analyzed the data directly. In the experiment of yeast cell activity detection, F-measure value of analysis results of our algorithm was 0.91637, which was higher than that of the other four methods, 0.78126, 0.81928, 0.89472, and 0.76438, respectively. These results verified the effectiveness of our algorithm to analyze this kind of data. In the experiment of CD8+T lymphocyte subsets relative counts, F-measure value of anlysis results of our algorithm was 0.95642, which was higher than that of the other four methods, 0.78453, 0.88642, 0.89013, and 0.89691, respectively. In the experiment of B cell and NK cell relative counts, F-measure value of anlysis results of our algorithm was 0.95807, which was higher than that of the other four methods, 0.80149, 0.90826, 0.92682, and 0.93041, respectively. The results of the two experiments verified the effectiveness of our algorithm to analyze the data from lymphocyte subsets. Results from the three experiments above verified the effectiveness of the algorithm to analyze multi-dimensional FCM data directly.Conclusion: Compared to other probabilistic soft clustering algorithms, the algorithm we proposed has a better performance in identifying concave shape and irregular populations, and analysis time is far less. Moreover, compared to non-probabilistic hard clustering algorithms, this algorithm we proposed could not only identify rare populations of high asymmetrical distribution, but realized the direct analysis of high-dimensional data. Therefore, in terms of efficiency and accuracy, our approach compares favorably to current state-of-the-art automated gating algorithms.
Keywords/Search Tags:flow cytometry, clustering analysis, mixture models, skew t distribution, EM algorithm
PDF Full Text Request
Related items