Font Size: a A A

Research On Mixture Model Based Clustering Of Cancer Omics Data

Posted on:2017-01-14Degree:DoctorType:Dissertation
Country:ChinaCandidate:J XiongFull Text:PDF
GTID:1224330482488114Subject:Statistics
Abstract/Summary:PDF Full Text Request
Tumor classification is to indentify different subtypes of a kind of tumor. Due to the clinical heterogeneity of tumor, different treatments were used for different tumor subtypes but results in diverse prognosis. Thus, identify the correct tumor subtypes have implications for tumor treatments and prognosis. Hovever, clinically tomor subtype identification based on pathology is limited on the cellular level and offen guide mis-diagnosis. Therefore, identification tumor subtypes from more accure characristics is in urgent.In recent years, as the developments of high-throughput technologies such as microarray and next generation sequencing, it becomes possible to understand the cancer comprehensively through the entire genome. The gnomic data could characterize tumors more acurate and comprehensive with respect to pathological parameters. Therefore, identification of tumor subtypes based on the genomic data of tumors could provide more information on tumor classification, and would guide the diagnosis, treatment, and prognosis of tumor.Cluster analysis is a useful exploratory technique for tumor genomic data classification. Cluster analysis is to patition a group of objects or observations into several more copact smaller groups to the extent that objects in the same group are more similar with respect to the objects in different groups. The easy use and availability of implementations of distance-based classic clustering methods such as hierarchical and k-means clustering made them very popular in the biological and medical community. Although these classic algorithms have many successful application examples in multiple areas, the statistical properties of these algorithms are largely unclear that impeded the statistical inferences of these methods.In recent years, clustering algorithms based on probability models offer a principled alternative to heuristic-based algorithms. In particular, the model-based approach assumes that the data in generated by a finite mixture of underlying probability distributions such as multivariate normal distributions. Compared with the heuristic clustering algorithms, model-based clustering treats the selection of number of clusters as a statistical selection issue. However, as is known, there is a so-called “curse of dimensionality” problem in clustering high-dimensional genomic data due to there is too many parameters need to be estimated. Thus, it is necessary to reduce the dimensionality of the data to obtain a more compact clustering.Based on this idea, the present paper is proposed to penalize the common factor loading in the framework of mixture of common factor analyzers. The EM(Expectation Maximum) algorithm for parameter estimation of the proposed penalized mixture of common factor analyzers(PMCFA) and the association R codes were also proposed. We use simulated data and real cancer expression data to illustrate the utility and advantages of the proposed method over several existing ones in terms of both variable selection and clustering performance. We also analyzed a microRNA(miRNA) sequencing data of cervical cancers in detail by PMCFA and found two groups that have diverse prognosis by selecting 16 miRNAs. Literature retrieval found that among the 16 miRNAs the functional and molecular mechnisim roles of hsa-miR-140-5p was unclear in cervical cancer cells.To highlight the significance of variable selection of PMCFA, we conducted molecular, cellular, and animal assays to investigate the potential function and molecular mechnisims of hsa-miR-140 in cervical cancer. Results showed that hsa-miR-140-5p suppressed prolification, invation and metastasis of cervical cancer by direct targeting IGF2BP1(Insulin-like growth factor 2 mRNA-binding protein 1). These results shed light on the discovery of novel genes or moleculars from the public available cancer genomic datasets, and will provide more scientific evidences for cancer target treatments.The paper is organized as six chapters. In chapter 1, we introduced some backgrounds for Gaussian mixture model(GMM) and the difficulty of GMM when dealing with high-dimensional data. Technologies that aimed to overcome the “curse of dimensionality” problem were also reviewed. In chapter 2, we proposed the PMCFA model and the associated EM algotithm for parameter estimation were presented. Simutations and real data were used to illustrate the usefulness of the proposed method. In addition, in this chapter we also analyzed a publicly available miRNA sequencing dataset from The Cancer Genome Altas(TCGA) by PMCFA and found two groups that have diverse prognosis by selecting 16 miRNAs. Subsenquently, we determined the miRNA that suited for further functional analysis. In chapter 3, we presented meterials and methods for assays. In chapter 4, we demonstrated that hsa-miR-140-5p suppressed prolification, invation and metastasis of cervical cancer by direct targeting IGF2BP1. In chapter 5, we analyzed a publicly available messenger RNA(mRNA) and miRNA expression dataset of gliobalstoma multiform(GBM), and identified an integrated mRNA and miRNA expression signature for GBM prognosis. In chapter 6, some conclusions and outlooks were presented.
Keywords/Search Tags:clustering analysis, mixture model, penalized mixture of common factor analyzers, variable selection, cervical cancer, hsa-miR-140-5p, tumor suppression
PDF Full Text Request
Related items