Font Size: a A A

Detection Of Microbe Composition And Abundance Using Next-generation Sequencing Data

Posted on:2022-03-25Degree:MasterType:Thesis
Country:ChinaCandidate:S WangFull Text:PDF
GTID:2504306602494854Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
The key goal of metagenomic study is to accurately detect microbial composition and abundance in a sample,which plays an important role in disease prevention and treatment,especially in precision medicine.Next-generation sequencing(NGS)technology make it possible to obtain low-cost and large batch of sequencing samples,which provides unprecedented opportunities for microbiomics.16 S r DNA widely exists in all microbial sequences,and its highly variable region sequences are usually used in microbial taxonomy.However,high similarity among sequences of different microbes,diverse forms of alignment,unavoidable sequencing errors and fixed species database pose various challenges.Currently,a series of algorithms are constantly emerging to discover microbial composition and abundance,but there is more room to promote their performance in complex samples with mixtures of noise.After analyzing the drawbacks of existing methods,we propose a new approach,PGMicro D,for the detection of microbial composition and abundance in a sample via NGS data.The main innovations of this thesis are as follows:(1)Aiming at sequencing disturbance,three factors(i.e.sequencing error,alignment forms,indicator of highly variable region)are unified to calculate “read-reference” belonging score,which can measure the confidence of one read aligning to one species reference.We design experiment to find belonging score threshold,and correct the alignment result by removing the reads with belonging score below this threshold.(2)We design an identification algorithm of microbe composition based on support vector machine.It extracts alignment features including quantitative character,spatial character and biological genetic character,then,simulate large batch of samples to train microbial composition classifier.When new sample is coming,the classifier judges whether each species one by one in the microbe database exists in this sample.(3)We design an estimation algorithm of microbe abundance based on species similarity.Because similar species contributes many sequencing reads to each other,we define the similarity factor between two species,and build similarity matrix among the whole 16 S sequences.According to the similarity matrix and the numbers of sequencing reads aligned to the species,a linear programming model about microbial abundance is established.In the end,the optimal values of this model are employed as every microbial abundance.So as to analyze the application range of PGMicro D,we design simulation experiment to explore the influence of sequencing depth,sequencing read length and sequencing error on PGMicro D.The performance of PGMicro D is evaluated based on both simulated samples and real samples,furthermore,it was compared with five peer methods on the same data.The results demonstrate that our proposed method can be applied in the current sequencing platforms,and owns remarkable performance.
Keywords/Search Tags:Microbe composition, Microbe abundance, Next-generation sequencing data, Machine learning, Linear programming
PDF Full Text Request
Related items