Font Size: a A A

An Ensemble Approach For 16S RRNA Classification In Metagenomics

Posted on:2019-09-21Degree:MasterType:Thesis
Country:ChinaCandidate:R Y ShuaiFull Text:PDF
GTID:2370330566998508Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the rapid development of the new generation of high-throughput sequencing technologies,researchers can obtain a large amount of biological sequencing data in a short period of time at low cost while sequencing multiple microbial genomes.Metagenomics researchers extracted all DNA sequences directly from environmental samples and used high-throughput sequencing techniques to obtain the genetic information of all the microorganisms in environmental samples.Then they can analyse the distribution and abundance of species in the microbial community and microbial community functions.Due to the conservatism and universality of 16 S r RNA gene sequences,16 S r RNA has gradually become a powerful tool for microbial detection,classificat ion and identification as a molecular indicator.This paper aims at the existing analysis process of metagenome sequencing data,try to find out the deficiencies in these analysis processes and improve the construction of new analytical workflow,improve the classification accuracy and classification efficiency.When using whole genome analysis of microbial communities,the whole-genome data is so large that only a very small percentage of species can be analyzed.When the number of biological sequencing fragments that need to be classified reaches a certain order of magnitude,the bottleneck of time efficiency of the algorithm may appear.In order to solve these problems,we designed a 16 S r RNA sequencing fragment classification algorithm based on integrated learning and carried out experimental research.In order to solve the 16 S r RNA sequencing fragment classification problem based on metagenomics,this paper proposes to use the hash function family to extract the features of the sequencing sequence and improve the efficiency of sequence clustering by reducing the alignment between dissimilar sequences.According to the similarities of the feature vectors between the sequences,the dataset is pre-subdivided and the clustering process is performed in each block,and the pairwise alignment between dissimilar sequences is reduced,thereby greatly reducing unnecessary calculation in clustering process,so as to improve the computational efficiency of clustering.When we deal with the pre-partitioning problem,we select the hash characteristics based on k-mer distribution to pre-partition biological sequencing fragments to ensure high similarity of the sequencing data in each block.The algorithm of this project is mainly composed of 16 S r RNA sequencing fragments of metagenomics,extracting the feature vector of sample dataset,choosing clustering algorithm,reference genomics feature extraction and feature selection,using training model of reference genome extraction and designing the integrated algorithm.The experimental results show that the metagenomic 16 S r RNA fragment classification based on integrated learning has a higher classification accuracy when dealing with a large amount of data.
Keywords/Search Tags:metagenomics, 16S rRNA, integrated learning, clustering, classification
PDF Full Text Request
Related items