Font Size: a A A

Research On Component Identification Algorithm Based On Microbial Sequencing Data

Posted on:2024-06-25Degree:DoctorType:Dissertation
Country:ChinaCandidate:Q WangFull Text:PDF
GTID:1520306923957819Subject:Operational Research and Cybernetics
Abstract/Summary:PDF Full Text Request
Rapidly advancing sequencing technologies have generated massive and complex biological data.These biological data have made it possible for scientists to study biological questions related to human diseases computationally.Numerous studies have shown that human microbes are closely related to the development of human infectious diseases,liver diseases,metabolic diseases,respiratory diseases,psychological diseases,and autoimmune diseases.The strong link between microorganisms and human diseases has motivated computational biologists to study a range of computational problems related to microorganisms.The main topic we are concerned with in this thesis is component identification studies based on microbial sequencing data.We specifically focus on designing combinatorial optimization algorithms for identifying bacterial transcription components and microbial community strain components.Identifying bacterial transcription components can help to understand the mechanisms of transcriptional regulation in bacteria and to explore the relationship between the dynamic transcription of bacteria and human diseases.Alternative transcription units(ATUs)are the units of bacterial transcription that are dynamically encoded under different conditions and display overlapping patterns(sharing one or more genes)under a specific condition.Experimental methods can identify complex ATUs,but the time-consuming,laborious,and expensive nature of such methods makes them not universally applicable to bacterial transcription component identification under different conditions.Current computational methods can overcome the shortcomings of experimental methods in applicability,they are still unable to identify ATUs with overlapping patterns.Therefore,the development of computational methods for identifying ATUs is necessary to study the transcriptional regulation of bacteria.Simultaneously,diverse microbial communities of bacteria,archaea,and viruses have crucial roles in the environment and human health.Understanding microbial community components lays a solid foundation for exploring community functions.The widely available computational methods mainly identify microbial community components at the genus or species level.Methods for identifying microbial community strain components are still in their infancy and have shortcomings in terms of method applicability and identification accuracy.Therefore,it is still challenging to develop a computational method that can accurately identify the strain components of microbial communities.To address the above challenges in component identification research,two new component identification algorithms,SeqATU and metaStrain,were developed in this study to identify bacterial transcription components and microbial community strain components,respectively.The main work of this thesis is as follows:(1)We developed the first computational method,SeqATU,for the inference of ATUs with dynamic composition and overlapping patterns based on next-generation RNA-Seq data.ATUs are bacterial transcription components with overlapping patterns,which means that different ATUs can share genes under a specific condition.However,reconstructing overlapping ATUs based on the expression levels of genic and intergenic regions is a difficult problem.To this end,SeqATU utilizes a convex quadratic programming model to find the optimal combination of candidate ATUs that satisfy the expression levels of genic and intergenic regions.The goal of this mathematical programming model is to minimize the squared error,where the error is the gap between the expression level of ATUs and the expression values of the genic and intergenic regions.To solve the problem of non-uniform distribution of the RNA-Seq reads along the mRNA transcripts,we construct a bias rate function that provides linear constraints on quadratic programming and helps us to better characterize the complex structure of ATUs.The performance of the SeqATU algorithm was evaluated on different datasets.Results showed that ATUs predicted by SeqATU achieve satisfactory performance.In addition,we further evaluated the predicted ATUs by Gene Ontology(GO)and Kyoto Encyclopedia of Genes and Genomes(KEGG)pathway enrichment analyses and found that the gene pairs frequently encoded in the same ATUs are more functionally related than those that can belong to two distinct ATUs.We expect that the new insights derived by SeqATU will not only improve the understanding of the transcription mechanism of bacteria but also guide the reconstruction of a genome-scale transcriptional regulatory network.The SeqATU algorithm has the following innovations:1)It is the first computational method for predicting bacterial ATUs based on nextgeneration RNA-Seq data;2)It utilizes a convex quadratic programming model to find the optimal combination of candidate ATUs,which solves the problem of ATUs with overlapping patterns;3)The bias rate function constructed by SeqATU can effectively solve the problem of non-uniform read distribution,providing strong linear constraint information to the model.(2)We developed a computational method,metaStrain,for identifying microbial community strain components based on metagenomic data.Conspecific strains have a large magnitude of phenotypic variances such as virulence and pathogenicity,thus identifying strain components of microbial communities is important in applications such as treating patients and containing outbreaks.However,designing genetic patterns that effectively characterize the differences between strains is a difficult problem in strain component identification.To this end,metaStrain designs genetic patterns based on differences in the order of gene arrangement between strains,and then identifies strain components based on alignments between metagenomic sequencing reads and the designed genetic patterns.The two components of this method are constructing a genetic pattern database and building a strain component identification framework.For constructing a genetic pattern database,metaStrain designed unique patterns and combination-unique patterns based on differences in the order of gene arrangement between strains.Specifically,the design of unique patterns is achieved by solving the minimal unique substrings problem.For building a strain component identification framework,the strain component identification of metagenomic samples is achieved by the alignments of metagenomic reads and solving the quadratic programming model to calculate the strain abundance.When evaluated on test datasets,metaStrain greatly outperforms existing algorithms at strain component identification.Notably,the runtime of metaStrain is substantially improved compared to the current methods,demonstrating the excellent applicability of metaStrain.In addition,60%of the high-abundance strains found by metaStrain on real colorectal cancer(CRC)tissue samples are proven to be linked to CRC by published literature,confirming the trustworthy performance of metaStrain in identifying strains.In summary,metaStrain can perform accurate and rapid strain component identification of metagenomic samples.The metaStrain algorithm has the following innovations:1)metaStrain designed unique patterns and combination-unique patterns based on differences in the order of gene arrangement between strains,enabling fast and accurate identification of strain components of metagenomic samples;2)Finding unique patterns is profiled as solving the minimal unique substrings problem,contributing to identifying the critical unique patterns between strains and preventing storing redundant genetic patterns;3)metaStrain constructs a quadratic programming model based on the designed combination-unique patterns to identify strains without unique patterns;4)metaStrain designed a method to determine the abundance of strains using unique and combination-unique patterns and built a computational framework to identify the strain components of metagenomic samples.
Keywords/Search Tags:Bioinformatics, Microorganisms, Component identification, RNA-Seq, Metagenome, Combinatorial optimization
PDF Full Text Request
Related items