Font Size: a A A

Microbial Species Classification:Method Based On 16S RRNA Variable Regions And Neural Networks

Posted on:2021-03-23Degree:MasterType:Thesis
Country:ChinaCandidate:W J JingFull Text:PDF
GTID:2480306050467384Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Microorganism is an invisible life,which exists almost anywhere in the world,including mountains,deserts,air and the human body,it can be used to ferment food,treat sewage,produce fuel,enzymes and other biologically active compounds,therefore the classification of microorganisms is very important for many aspects of human life and health.The 16S rRNA gene sequence exists in all prokaryotic microorganisms,so we commonly use the 16S rRNA gene sequence as molecular genetic markers for microbial species classification.Species classification based on 16S rRNA gene sequence refers to the analysis of 16S rRNA sequence to obtain the species rank information of phylum,class,order,family and genus of the unknown species sequence.This paper proposes a method for microbial species classification based on machine learning algorithm,which mainly uses variable region sequences of 16S rRNA as target sequences,and uses neural network as the basic model for microbial species classification.This paper uses primer pairs to extract the variable region sequences from the full-length 16S rRNA sequence.The k-mer method based on sliding window is used to perform word segmentation.The segmented sequence vectorize using an embedding layer,and is used as input data to classify using neural networks.Then,according to the label information of the data in the data set,this paper designs two classification schemes: one is direct classification,that is,direct classification of sequences to genus level.The other method is hierarchical classification,in which the sequences are first classified to the phylum level,and then the sequences classified to each phylum level are input into the genus level classification model under this phylum level,and then the sequences are classified to the genus level.On the two classification schemes,with LSTM and BiLSTM as a classification model for the training and optimized the model structure and parameters,has been based on LSTM,hierarchical LSTM,BiLSTM,hierarchical BiLSTM four kinds of classification model.Finally,the four model for multiple cross-validation comparing the classification results of the experiment,the best classification results based on the classification of BiLSTM model as the final classification model in each variable.In order to verify the effectiveness of the proposed model,this paper uses the SRA data set from NCBI database as test data sets,and uses the BLAST method to obtain the label information of the sequences in the test data set.Compared with the existing RDP Classifier and 16S Classifier based on machine learning algorithms for microbial species classification,result shows that in terms of classification accuracy,the classification model based on BiLSTM is superior to RDP Classifier and 16S Classifier,in terms of time performance,the classification model based on BiLSTM takes time between 16S Classifier and RDP Classifier.Considering both classification accuracy and time performance,the classification model based on BiLSTM has more advantages than the 16S Classifier and RDP Classifier.Aiming at the problem that different variable region classification models may have different classification performance for sequences,this paper analyzes the classification model based on BiLSTM for different variable regions.And then founding that on the level of the phylum,the V5 variable region classification accuracy of classification model is the best,in addition to the phylum of the other levels,the V2 variable region classification model of classification accuracy is the highest.Finally,according to the classification results of BiLSTM classification model,this paper analyzes the flora composition of each sample in the test data set.
Keywords/Search Tags:16S rRNA Variable Regions, Microbial Species Classification, Long Short-Term Memory, Bidirectional Long Short-Term Memory
PDF Full Text Request
Related items