Font Size: a A A

Taxonomic Classification And Analysis Based On 16S RRNA Variable Regions

Posted on:2019-03-22Degree:MasterType:Thesis
Country:ChinaCandidate:S LiuFull Text:PDF
GTID:2370330572952117Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Microorganisms are ubiquitous in the world(intestines,oral cavity,food,and air,etc.).The identification of microbial species has important implications for better analysis of microorganisms in the marine,soil,and atmospheric environments and improvement of people's health.The 16S rRNA gene is the most commonly used molecular marker for the systematic classification of prokaryotic microorganisms.The objective of identifying microbial species based on the 16S rRNA sequence is to classify unknown species sequences into the domains,phylum,class,order,family,and genus levels.There are two approaches to identify microbial species which based on homology analysis and based on machine learning algorithms.This paper proposes a method for identifying microbial species based on machine learning algorithms.Since the 16S rRNA sequence contains more than 2,000 genera,the classification problem is a typical superclass problem.In this paper,the neural network algorithms are used as the models and the variable regions of 16S rRNA sequences which are extracted by primer pair are used in the training of species identification models which can be used to identify species based on the variable regions of 16S rRNA sequence.For the different sequence label integrity of the transet No16 dataset and the Greengenes dataset,we proposed two schemes for species identification based on complete label data and species identification based on incomplete label data.Finally,the integration of these two schemes to achieve identification of microbial species based on the variable regions of 16S rRNA sequences.In the first scheme,the hierarchical classifier was used which use the hierarchical relationship between the labels.The sequence was first classified into the phylum level and then classified into the genus level.Four models were designed which included multilayer perceptron,hierarchical multilayer perceptron,convolutional neural network,hierarchical convolutional neural network.The training set in the second scenario comes from the Greengenes database which labels are incomplete.The experiment divides the data into groups according to the lowest level information of each sequence.The classification model is convolutional neural network.The third option is to integrate the first two methods using ensemble learning to make the model have better performance.The experiment firstly uses real data to test the models in three schemes,and compares the performance with the existing tools for microbial identification based on machine learning algorithms which include RDP Classifier and 16 S Classifier.Finally,according to the identification results of the integrated model,the community composition of each sample was analyzed.In the first scheme,the four species identification models designed in this paper were tested using 10-fold cross validation and the results showed that the hierarchical CNN model had the best identification effect.Then the hierarchical CNN model was tested using real data,compared with RDP Classifier and16 S Classifier,the hierarchical CNN model had the best identification at genus level.In the second scheme,the model that we proposed is better than RDP Classifier.At the genus level(except for V2),the performance is better than 16 S Classifier.Compared with the 16 S Classifier,for all variables region(expect V6 and V8),integrated model has better performance than 16 S Classifier,especially in V3,integrated model displayed up to 33.78%,51.08%,42.21%,46.76%,and 36.92% higher accuracy than 16 S Classifier at the phylum,class,order,family,and genus levels.From the overall effect,the identification performance of the algorithm we proposed are better than RDP Classifier,16 S Classifier.
Keywords/Search Tags:16S rRNA gene sequence, Microbial identification, Convolutional Neural Networks, Multilayer perceptron, Taxon, Ensemble Learning
PDF Full Text Request
Related items