Detection Of Microbe Composition And Abundance Using Next-generation Sequencing Data

Posted on:2022-03-25

Degree:Master

Type:Thesis

Country:China

Candidate:S Wang

Full Text:PDF

GTID:2504306602494854

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

The key goal of metagenomic study is to accurately detect microbial composition and abundance in a sample,which plays an important role in disease prevention and treatment,especially in precision medicine.Next-generation sequencing(NGS)technology make it possible to obtain low-cost and large batch of sequencing samples,which provides unprecedented opportunities for microbiomics.16 S r DNA widely exists in all microbial sequences,and its highly variable region sequences are usually used in microbial taxonomy.However,high similarity among sequences of different microbes,diverse forms of alignment,unavoidable sequencing errors and fixed species database pose various challenges.Currently,a series of algorithms are constantly emerging to discover microbial composition and abundance,but there is more room to promote their performance in complex samples with mixtures of noise.After analyzing the drawbacks of existing methods,we propose a new approach,PGMicro D,for the detection of microbial composition and abundance in a sample via NGS data.The main innovations of this thesis are as follows:(1)Aiming at sequencing disturbance,three factors(i.e.sequencing error,alignment forms,indicator of highly variable region)are unified to calculate “read-reference” belonging score,which can measure the confidence of one read aligning to one species reference.We design experiment to find belonging score threshold,and correct the alignment result by removing the reads with belonging score below this threshold.(2)We design an identification algorithm of microbe composition based on support vector machine.It extracts alignment features including quantitative character,spatial character and biological genetic character,then,simulate large batch of samples to train microbial composition classifier.When new sample is coming,the classifier judges whether each species one by one in the microbe database exists in this sample.(3)We design an estimation algorithm of microbe abundance based on species similarity.Because similar species contributes many sequencing reads to each other,we define the similarity factor between two species,and build similarity matrix among the whole 16 S sequences.According to the similarity matrix and the numbers of sequencing reads aligned to the species,a linear programming model about microbial abundance is established.In the end,the optimal values of this model are employed as every microbial abundance.So as to analyze the application range of PGMicro D,we design simulation experiment to explore the influence of sequencing depth,sequencing read length and sequencing error on PGMicro D.The performance of PGMicro D is evaluated based on both simulated samples and real samples,furthermore,it was compared with five peer methods on the same data.The results demonstrate that our proposed method can be applied in the current sequencing platforms,and owns remarkable performance.

Keywords/Search Tags:

Microbe composition, Microbe abundance, Next-generation sequencing data, Machine learning, Linear programming

PDF Full Text Request

Related items

1	Research Of Rapid Microbe Detection Computational Methods Based On NGS
2	Study On Relevant Problems Of Biomedical Data Mining Based On Machine Learning
3	High Throughput Sequencing For Clinical Microbe Screening
4	Research On The Prediction Method Of Microbe-drug Association Based On Similarity Information
5	Structural Characterization And Antioxidant Properties Of Exopolysaccharides Produced By4Marine Microbe With Different Sources
6	A Study On Disease Prediction Model Based On Small Sample Medical Data And Its Privacy Preserving Technologies
7	Discovery Of Non-Invasive Diagnostic Markers For Cancer Based On CfDNA Sequencing And Machine Learning
8	Research On Genotyping Method Of Third-Generation Sequencing Data Based On Dynamic Programming
9	Microbe-Disease Association Prediction Based On Multi-Data Fusion And Graph Neural Network
10	Effects Of Rotavirus Infection On Intestinal Microbial Composition And Intestinal Barrier Function In Neonatal Mice