Font Size: a A A

Study On Statistical Methods For Metagenomic Dafa Analysis

Posted on:2013-11-11Degree:DoctorType:Dissertation
Country:ChinaCandidate:Q ChangFull Text:PDF
GTID:1220330395470278Subject:Probability theory and mathematical statistics
Abstract/Summary:PDF Full Text Request
Metagenomics, which is now broadly regarded as analyses of genes ob-tained directly from environment, is a revolution in microbiology. It not only enables the research of uncultured microbes, which constitute a high pro-portion (more than99%) of microbial world, but also enables the research of interactions among microorganisms sharing a same environment and the whole microbial community in situ. With the rapid development of sequenc-ing technology, plenty of sequenced metagenomic data spring up, including some tag genes, like16S rRNA genes and randomly sequenced genes from mixed whole genomes.Several ongoing large-scale metagenomics projects related to human and marine life, as well as pedology studies, have generated enormous amounts of data, posing a key challenge for efficient analysis, as we try to1) under-stand microbial organism assemblage under different conditions.2) compare different communities, and3) understand how microbial organisms associate with each other and the environment.In this paper, we briefly introduce metagenomics, including some basic concept, research objects and some main problems called for efficient new methods or computational tools. Here, we concentrate on two problems:one is community comparison, the other is detecting OTUs which have signifi-cantly different abundance between two specified sample groups.1. Community ComparisonBeta diversity, which involves the assessment of differences between com-munities, is an important problem in many different research fields, especially in ecological studies. Many statistical methods have been developed to quan-tify beta diversity, we review some of them in Chapter2.Among these methods, UniFrac and weighted-UniFrac (W-UniFrac) are widely used recent years. Based on a phylogenetic tree, UniFrac measures the distance between communities by the fraction of length of the tree branches that lead to descendants from each single community, but not from both communities. It means to capture the total amount of evolutionary histo-ry that is unique to each community, presumably reflecting adaptation to one environment, but non-adaptation to the other. The W-UniFrac takes abundance information into consideration and weights each branch length by the difference of the fractions of sequences belonging to the branch in the two communities. However, W-UniFrac does not consider the variation of the weights under random sampling resulting in less power detecting the differences between communities.Consider the i-th branch of the phylogenetic tree. We demonstrate that, Ai, the number of individuals in community A that belong to the i-th branch, is hypergeometric with parameters (mirmrAT) under the null hypothesis, that the labels of the individuals are randomly distributed on the phyloge-netic tree, where mi=Ai+Bi is the total number of individuals belonging to the i-th branch, m=AT+BT is the total numbers of individuals in communities A and B.With some derivation, we propose the following variance adjusted weight (VAW) for the length of the i-th branch of the tree, We standardize the resulting statistic so that its value is between0and1. The final VAW-UniFrac is defined asBoth simulations and applicat ions to real data show that VAW-UniFrac can satisfactorily measure differences between communities, considering not only the species composition but also abundance information. 2. Detecting Operational Taxonomic Unit which have significantly different abundance between two specified sample groupsOne important problem of microbiome data analysis is to identify the bacterial taxa that are differentially abundant (DA) between different envi-ronmental/biological conditions.Methods for identifying the DA taxa are very limited. It usually involves application of the two-sample t-test or Wilcoxon rank sum test to test the mean difference for a given Operational Taxonomic Unit(OTU) in two con-ditions. Since some taxa are very rare, alternatively, Fisher’s exact test can be applied to test for presence/absence of the taxon. White et al. proposed to combine both Fisher test and t-test by first classifying the taxa into rare and common groups using an arbitrary cutoff. These existing methods test each taxon separately without considering the constraint of sum of the taxon compositions being one.Identifying the differentially abundant taxa has similarity to differential expression analysis in gene expression studies. However, the characteristics of the data are very different, and therefore requires new statistical methods. First, the variation of the counts of a given OTU can be very large among the samples, and most of the OTUs only appear in a very small fraction of the samples. This leads to data with lots of zeros in the table. Second, the data are not independent column-wise, because the summation of the counts in each column, which represents the total OTU count in that individual sample is pre-fixed and is determined by the sequencing process and depth. Third, the data are often zero-truncated since many rare OTUs may never been observed in the samples due to sequencing depth. Since different samples have different total OTU counts,the data in each row are not on the same scale, therefore cannot be compared directly.In this dissertation, we propose an emperical Bayesian method for de-tecting differentially abundant operational taxonomic units (OTUs) between two conditions. To account for over-dispersion and abundant rare OTUs, we propose to use a Beta-Beta-Binomial model to model the observed OTU count data. And to deal with the problem of truncation in the observed OTU counts, we propose to use the truncated Beta-Beta-Binomial model in the empirical Bayes calculation. Extensive simulations have shown that the new method results in better power and false discovery rate(FDR) control than the naive applications of the two-sample t-test, Wilcoxon rank sum test or the Fisher’s exact test. We demonstrate the methods using throat microbiome data set of smokers and no-smokers, and get some biological meaningful results.The outline of the dissertation is as follows.In Chapter1, we briefly introduce Metagenomics,explain some basic concept, especially about how to define operational taxonomic units (OTUs), and at last list some main research areas in metagenomics study.In Chapter2, we focus on the task of community comparison. By di-viding the current comparing methods into two groups,"OTU-based" and "Phylogeny-based",we review several classic methods for community com-parison, investigate UniFrac and weighted UniFrac more deeply, and develop a new statistic termed variance adjusted weighted UniFrac (VAW-UniFrac) to compare two communities based on the phylogenetic relationships of the individuals. To test the power of VAW-UniFrac, we first ran a series of sim-ulations which revealed that it always outperforms W-UniFrac, as well as UniFrac when the individuals are not uniformly distributed. Next, all three methods were applied to analyze three large16S rRNA gene sequence col-lections, including human skin bacteria, mouse gut microbial communities, microbial communities from hypersaline soil and sediments, and a tropical forest census data. Both simulations and applications to real data show that VAW-UniFrac can satisfactorily measure differences between communities, considering not only the species composition but also abundance informa-tion.In Chapter3, we focus on the task of detecting OTUs which have sig- nificantly differential abundance between two sample groups. we propose an emperical Bayesian method for detecting differentially abundant OTUs between two conditions. To account for over-dispersion, abundant rare O-TUs and the problem of truncation in the observed OTU counts, we propose to use a truncated Beta-Beta-Binomial model in the empirical Bayes calcu-lation. Extensive simulations have shown that the new method results in better power and false discovery rate control than the naive applications of the two-sample t-test, Wilcoxon rank sum test or the Fisher’s exact test and get some biological meaningful results in real data analysis.
Keywords/Search Tags:Metagenomics, UniFrac, VAW-UniFrac, Microbialcommunity comparison, Empirical Bayes, operational taxonomicunits (OTUs), Differentially Abundant, Phylogeny
PDF Full Text Request
Related items