Font Size: a A A

Research On Classification Of Metagenomic Sequencing Fragment

Posted on:2019-05-16Degree:MasterType:Thesis
Country:ChinaCandidate:J MaFull Text:PDF
GTID:2370330566499004Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the development of next-generation sequencing technology,a large number of metagenomic sequencing fragments can be generated at a low cost in a short time,which greatly promoted the study of microbial communities.Among them,the classification of metagenomic sequencing fragments is an important research content in metagenomics.It is not only an important premise of the microbial community species diversity research,but also has an extremely important significance on microbial community functional analysis.Due to the existing difficulties,such as the multiple abundance of species in the microbial community,the sequencing fragment length restrictions of sequencing technology,the limited number of reference genome,to achieve the accurate classification of metagenomic sequencing fragments becomes a research focus in the field of metagenomics.The accuracy of the existing metagenomic classification methods remains to be improved,especially when there is no close reference genome in the reference database,the accuracy of classification will be greatly reduced.How to classify massive metagenomic sequencing data accurately is the main research content of this subject.This paper focuses on the short sequencing fragments.The accuracy of existing metagenomic classification methods is low,and can not guarantee the stability of classification in the absence of similar reference genome in reference database.So we study an unsupervised algorithm based on characteristics of sequence composition,use the weighted coding region similarity measurement algorithm to search the most similar reference genome,and build the classification model based on the reference genome to determine the classification level and the taxonomy identifier of metagenomic sequencing fragments.In order to get higher purity binning results and reduce the time of similarity matching,we studied the application of metagenomic assembly algorithm in metagenomic classification,and select the best assembly algorithm to preprocess short sequencing fragments.In order to make the classification method applicable to the two cases of the even abundance and the uneven abundance,the processing method of the multiple abundance of metagenome was studied and applied to the classification algorithm.In order to improve the purity of clustering,we studied the distribution of similarity of metagenomic sequencing segments based on sequence composition characteristics,analyzed the clustering results of several kinds of clustering algorithms based on the distribution,and chose the clustering method with the highest purity.In order to improve the accuracy of querying similar reference genome,we analyzed the importance of subsequences for different reference genomes and designed a weighted coding region similarity matching algorithm.According to the distribution of similarity between two reference genomes,I designed the multi-level classification algorithm based on machine learning,to ensure those sequencing reads without close reference genome in the reference database can also be classified accurately.The algorithm consists of five parts:assembly preprocessing,sequencing fragments abundance partition,spectral clustering,weighted coding region similarity matching,and establishing SVM based classification model.The experimental results show that the accuracy of the classification method has been improved under the condition that the species abundance is not uniform and there is no similar reference genome in the reference database.
Keywords/Search Tags:metagenome, abundance partition, spectral clustering, weighted similarity matching, classification model
PDF Full Text Request
Related items