Font Size: a A A

Design And Implementation Of Clustering Algorithm For Operational Taxonomic Unites (OTUs) Of Microbial 16s RRNA Gene

Posted on:2017-04-09Degree:MasterType:Thesis
Country:ChinaCandidate:F L DengFull Text:PDF
GTID:2310330512956566Subject:Animal breeding and genetics and breeding
Abstract/Summary:PDF Full Text Request
Due to the rapid advances of next generation sequencing technology, the high-through sequencing of microbial 16S rRNA gene has been extensively used in fields of both human health and animal husbandry. Although this method largely overcomes the defaults of traditional technologies, it is also challenging for effectively analyzing the large-scale sequencing data. To cluster short sequences into the operational taxonomic units (OTUs) is the most critical step within these prevalent bioinformatics pipeline, which has been already addressed by a few tools, such as Mothur and UPARSE. However, the accuracy and reliability for OUTs clustering still remain to be considerably improved. In the present study, therefore, we designed an improved algorithm, which performance tags annotation before clustering, for OTUs clustering of 16s rRNA gene sequences, from which a pieces of software was subsequently implemented by C and Python language. The main results are listed below:(1) Algorithm design of bioOTU:The first step of clustering algorithm is to pool all samples together and then get unique tags after the redundancy removals for all clean tags, which have been already subjected to the quality control. At the same time, we record the information of "abundance'" (the total number of duplication among all samples) and "sample size" (the number of samples in which unique tag is observed) for each unique tag. All unique tags are homogonously searched against reference database and then annotated at genus-level by Bayesian algorithm. After this step, all unique tags can be classified into these annotated tags and these unannotated tags. Subsequently, the tags being annotated by the same genus are subjected to pairwise alignment for calculating distances (including the k-mer distance and genetic distance), by which the OTUs clustering is initiated according to threshold value specified by user (such as 0.03). For each of these unannotated tags, we further search for its nearest neighbor among these already clustered tags and determine whether it could be appended to this OTU by comparing pairwise distance with the custom cutoff. After this, all pending tags without being successfully clustered by taxonomy-guided method should be further subjected to OTUs clustering by de novo algorithm together with the heuristic search, and before which the potential chimeras are detected and removed by using UCHIME algorithm.(2) Software implementation of bioOTU:Due to the flexibility of Python and high efficiency of C languages, both of them are used to develop bioOTU. The main frame of bioOTU is written by Python language, and. therefore, this design can facilitate the usage by just performing various scripts. However, C language is adopted to rewrite the core processes, such as the calculation of genetic distance, because for which the massive computation is required. This effort can efficiently improve computational efficiency of bioOTU. The abundance and annotation information of these constructed OTUs will be directly outputted after OTUs clustering. The bioOTU can be run on Unix-like systems and its codes are freely available now.(3) Comparative analysis on OTUs clustering of bioOTU:The 16S rRNA gene sequencing dataset of mock community (consisting of 21 known organisms) retrieved from Human Microbiome Project (HMP) is subjected to OTUs clustering at distance cutoff of 0.03 by bioOTU, Mothur, and UPARSE, respectively. Three tools were performed with default parameters. The results showed that bioOTU, Mothur and UPARSE produced 74, 311 and 28 OTUs, and among them 18,15 and 18 pre-designed species are successfully constructed, respectively. Meanwhile, abundances of all OTUs were calculated and compared with the expected values; and results showed that relative abundance showed a well agreement with expectation. A dataset of 16S rRNA gene sequencing for real community of human stool was also retrieved from NCBI. The result showed that bioOTU produced the minimum of OTUs (624) in comparison with Mothur (5,268) and UPARSE (922). According to comparisons on OTUs abundances, the result revealed that patterns on the OTUs abundance were similar between bioOTU and UPARSE, but both of them are significantly higher from Mothur. Based on the gold standard set as generated by blasting sequences against reference database, the index of normalized mutual information (NMI) were calculated for each tool to evaluate the accuracy of OTUs clustering. The result showed that NMI value of bioOTU (0.914) is lower than Mothur (0.922). but higher than UPARSE (0.903).
Keywords/Search Tags:Microbes, 16s rRNA, OTUs clustering, Next-generation sequencing
PDF Full Text Request
Related items