Font Size: a A A

Statistical Modeling For Analysis Of Biological High-throughput Data And Its Application

Posted on:2010-08-04Degree:DoctorType:Dissertation
Country:ChinaCandidate:W H WangFull Text:PDF
GTID:1100360302483781Subject:Probability theory and mathematical statistics
Abstract/Summary:PDF Full Text Request
With the fast development of modern biology,it is generally accepted that the research on the molecule level is of great importance to find the essence of biological phenomena,and specifically understand the pathogenesis of human disease.Spurred on the advances of high-throughput data collection techniques,such as microarrays [78;115;135],yeast two-hybrid assays[58;130],mass spectrometry[40;54],chromatin immunoprecipitation[59;109],data on thousands of molecules and their interactions in humans and most model species have become available.This flood of information presents exciting new opportunities for understanding cellular biology and disease.At the same time,the high-throughput data is characterized by high dimensionality of predictors,which is far beyond the number of samples,complex data structure,great data noise,uncertain or missing values and so on.Given this landscape,which make most traditional statistical tools either fail or provide outcomes with limited usefulness, the great challenge is to develop new statistical model to explore,analyze and interpret this information effectively and efficiently.In this thesis,we mainly analyze the high-throughput data by establishing statistical models in the following aspects:1.Prediction of functional homogeneous module in biological network with dK random graph models.Many aspects of biological functions can be modeled by biological networks,such as protein interaction networks,metabolic networks,and gene coexpression networks. Studying the statistical properties of these networks in turn allows us to infer biological function.Complex statistical network models can potentially more accurately describe the networks,but it is not clear whether such complex models are better suited to find biologically meaningful subnetworks. Recent studies have shown that the degree distribution of the nodes is not an adequate statistic in many molecular networks.In chapter 2,we sought to extend this statistic with 2nd and 3rd order degree correlations and developed a pseudo-likelihood approach to estimate the parameters.The approach was used to analyze the MIPS and BIOGPID yeast protein interaction networks,and two yeast coexpression networks. We showed that 2nd order degree correlation information gave better predictions of gene interactions in both protein interaction and gene coexpression networks.However, in the biologically important task of predicting functionally homogeneous modules, degree correlation information performs marginally better in the case of the MIPS and BIOGRID protein interaction networks,but worse in the case of gene coexpression networks.Our use of dK models showed that incorporation of degree correlations could increase predictive power in some contexts,albeit sometimes marginally,but,in all contexts, the use of third-order degree correlations decreased accuracy.However,it is possible that other parameter estimation methods,such as maximum likelihood,will show the usefulness of incorporating 2nd and 3rd degree correlations in predicting functionally homogeneous modules.2.Recover the associations between the protein domain and complex disease based on protein domain interaction network.It is of vital importance to find genetic variants that underlie human Complex diseases and locate genes that are responsible for these diseases.Since proteins are typically composed of several structural domains,it is reasonable to assume that harmful genetic variants may alter structures of protein domains,affect functions of proteins,and eventually cause disorders.With this understanding,in chapter 3,we explore the possibility of recovering associations between protein domains and complex diseases with the use of domain-domain interaction networks.We define associations between protein domains and disease families on the basis of associations between nonsynonymous single nucleotide polymorphisms(nsSNPs) and complex diseases,similarities between diseases, and relations between proteins and domains.Based on a domain-domain interaction network,we propose the use of a "guilt-by-proximity" principle to rank candidate domains according to their average distance to a set of seed domains in the domain-domain interaction network.We validate the method through large-scale cross-validation experiments on simulated linkage intervals,random controls,and the whole genome,and we evaluate the method in terms of AUC score and mean rank ratio of disease domains. Results show that the AUC scores can be as high as 77.90%,and the mean rank ratios can be as low as 21.82%.We further calculate a genome-wide landscape of associations between domains and disease families and offer a freely accessible web interface for this landscape,which can be potentially used with existing methods for determining disease genes,thereby providing useful information in the localization of genetic risk factors underlying complex diseases.3.Verification of functional loci in the case that candidate loci are in strong linkage disequilibriumMultiple makers exhibiting strong linkage disequilibrium(LD) in a single genomic region and a phenotype of interest generate very compelling statistical associations in the large-scale genetic-association studies.LD,especially strong LD,between variations at neighboring loci can not only make it difficult to discern markers associated with phenotype,but also create difficulties for distinguishing the functionally relevant variations from nonfunctional variations.In chapter 4,we compared 5 different methods, Boosting,Lasso,Ridge regression,Stepwise and Single locus analysis,for identifying real functional variations in the circumstance of LD exiting in the variations at different loci via simulation.We found that if there are 100 samples,in the case of strong LD among 20 loci,Ridge regression performs the best while in the case of degenerated LD among 500 and 1000 loci,Boosting outperforms other methods.
Keywords/Search Tags:Random network model, protein interaction network, gene coexpres-sion network, pseudo likelihood, complex disease, protein domain interaction network, diffusion kernel, haplotype, linkage disequilibrium
PDF Full Text Request
Related items