Font Size: a A A

Prediction Of Cis-Regulatory Motifs And Functional Modules Based On ChIP-Seq And Microarray Data

Posted on:2020-06-08Degree:DoctorType:Dissertation
Country:ChinaCandidate:Y LiFull Text:PDF
GTID:1360330602454659Subject:Operational Research and Cybernetics
Abstract/Summary:PDF Full Text Request
The development of biological technologies,especially high-throughput sequencing,makes it convenient to obtain massive biological data.However,the explosive expansion of biological data brings to humans not only illimitable development opportunities,but also data mining challenges on large-scale data,in which situation the interdisciplines such as bioinformatics come into beings and have been playing increasingly important roles in the area of life science.In the framework of bioinformatics,the potentials of theoretical tools such as mathematics,statistics and computer science are fulfilled with the aid of the platforms of high performance computers and databases,making it possible to solve a series of data mining problems on large-scale biological data.Omics is a product at a certain developmental stage of molecular biology.The generalization of system theory makes people no longer analyze a molecule or a kind of genic material from the individual view.On the contrary,it is preferred to regard the individuals possessing a specific function as well as their correlations as a system,so as to mine valuable information in a systematic way.Genomics,transcriptomics,proteomics and metabonomics are representative examples,of which genomics is the most widely used and most influential branch.Though nearly all the cells of an organism have the same genes,they differentiate into different forms and play different roles,which results from the fact that genes are not expressed in all the cells,and the switch controlling gene expression is the transcription factor.Transcription factor is a special protein which can directly regulate the gene expression by binding to the promoter,enhancer and silencer regions of genes,,which plays vital roles in genome.Therefore,it is critical to identify the binding sites(motif)of transcription factors on DNA,which could promote a series of downstream analyses.Although biological experiment is the most credible method to identify transcription factor binding sites,however,the binding sites of only a few transcription factors of a minority of model organisms have been validated,because of the complicated operations and high cost of experiments,which signifies the necessity of seeking for other options.The popularization of the next-generation sequencing technology brings out a great quantity of sequencing data,and the combination of chromatin immunoprecipitation with the next generation sequencing,i.e.ChIP-seq,is able to generate abundant potential binding sites of transcription factors.Thus,developing computational tools to mine the underlying motifs from ChIP-seq data can overcome the challenges of efficiency and cost confronted by traditional experimental methods.Yet it is still challenging to identify motifs via computational methods,owing to the short length and degeneration of motifs,added by the technical errors of sequencing.Besides,as the ChIP-seq data is large in size,most traditional motif finding algorithms are not applicable to ChIP-seq data.The up-to-date and on-going motif finding algorithms on ChIP-seq data usually determine the motif lengths by way of enumeration.Limited by data size,these algorithms tend to detect short motifs.Hence it is an extremely challenging task to design a highly-efficient motif finding algorithm that can accurately identify the motif length for large-scale ChIP-seq data.In this dissertation,we design a novel motif finding algorithm named ProSampler on the basis of statistical test on the k-mers and Gibbs sampling on the motif profile matrix set,in the light of the features of ChIP-seq data and the defects of the mainstream motif finding algorithms.We have run seven motif finding algorithms including ProSampler on six simulated datasets as well as 3 x 105 MNChIP-seq datasets,in order to test and compare the performance of ProSampler with the mainstream algorithms.The test results reveal that ProSampler can not only accurately reconstruct the motif profile matrices,but also sensitively identify the motif sites,and meanwhile effectively determine the motif lengths.Moreover,the superior performance of ProSampler on large-scale test datasets also reflects its good robustness.ProSampler mainly has the following three innovations:1)Use multiple thresholds for two proportion z-test to choose out the k-mers of different significance levels.With this method,the running efficiency can be increased owning to the decrease of data size,and the false negative rate can be decreased owing to the sensitive capture of the subtle motif information in sequences.In addition,two proportion z-test does not need extensive calculation,which could save the running time to a large extent.2)Design the Gibbs sampling algorithm using the preliminary motif profile matrix set as the sampling pool.Since each preliminary motif profile matrix is constructed from a significant k-mer,the preliminary motif profile matrix set is limited in size,which guarantees the fast convergence of Gibbs sampling in a short time and further increase the running efficiency.Besides,we can tune the k-mers within a motif profile matrix during iteration,and thereby precisely reconstruct the motif profile matrices.3)For the first time utilize two proportion z-test to determine the motif length.Because no exhaustive enumeration of motif length is needed in this method,it is highly efficient.On the basis of the innovations listed above,ProSampler can undertake the motif finding tasks on large-scale ChIP-seq datasets,and accurately obtain the motif profile matrices as well as their site information,and meanwhile precisely determine the motif lengths.At present,ProSampler has been achieved by using C++,and its source code as well as the executable files in Windows,Mac OS and Unix can be downloaded and used via link:https://github.com/zhengchangsulab/prosampler.Transcription factors do not regulate the downstream genes in genome by recognizing and binding to the corresponding segments solely.In fact,most transcription factors are jointly involved in gene regulation via mutual physical interactions in a synergistic or antagonistic manner,hence complex binding patterns exist between transcription factors and DNA.Analysis of the binding patterns of transcription factors on DNA can contribute to the further exploration of transcriptional regulation mechanism,which lays basis for a series of downstream genomic analyses such as gene expression.In this study,based on the 159 MNChIP-seq datasets of human embryonic stem cells,we use the ProSampler algorithm to analyze the binding patterns between 32 transcription factors and DNA,as well as the reasons leading to these patterns.This study mainly has the following four innovations:1)Discover the four binding patterns between transcription factors and DNA,i.e.1-0,1-1,0-1 and 0-0.2)For 1-0 and 0-1 patterns,infer that the target transcription factor indirectly binds to DNA by physical interactions with other transcription factors.3)For the 0-0 pattern,infer that the datasets have low quality,thereby it is difficult to identify known motifs from them.4)With the method of ProSampler algorithm and statistical analysis,identify 21 and 98 known and unknown motifs occurring in a large proportion of cell lines,which are called non-target zinger motifs.Unlike the previous methods of analyzing non-targeted zinger motifs,we use de novo motif finding algorithm to directly detect motifs,rather than use known motif profile matrices to scan DNA sequences in order to identify the motif sites,hence a large proportion of the identified non-targeted zinger motifs are discovered for the first time.One transcription factor tends to regulate multiple genes in genome.The gene regulated by the same transcription factor usually present correlations in trend of expression,which might also possess functional homogeneity.The gene or corresponding protein sets with functional homogeneity are called biological functional modules.The main tool to depict large-scale correlations in biology is network biology,yet it is difficult to accurately predict biological functional modules from a single network because of the extensive noises.A feasible option is to integrate multiple networks,from which frequent biological functional modules will be predicted called frequent dense sub-networks.In this study,we put forward a novel mathematical model named compatible network,in light of the defects of state-of-the-art prediction algorithms.Compatible network model could integrate the two characteristics,frequency and density,into the edge weight,and then optimize the two characteristics simultaneously.Based on the compatible network model,we design an accurate and efficient biological functional module prediction algorithm named MiMod.In order to test and evaluate the performance of MiMod,we download 43 gene microarray datasets from the GEO database,based on which 43 gene co-expression networks are constructed.Meanwhile,we download 13 protein interaction networks related to human blood from the SNAP database.We simultaneously run MiMod and NetsTensor on the two sets of networks for comparison.Test results indicate that the biological functional modules predicted by MiMod are more statistically and biologically significant,comparatively.As the core component of MiMod algorithm,compatible network model has the following two advantages:1)Small size.Since compatible network model is built from a subset of nodes,it is limited in size,which could save running time to a large extent.2)Juggle the two characteristics frequency and density.The integration of frequency and density into edge weight avoid the bias caused by optimizing the two characteristics separately,which could improve the accuracy.Moreover,MiMod also possesses the following two innovations:1)Utilize the sparse summary network model,which could further save running time.2)The utilization of biclustering algorithm can strike a balance between module size and frequency,requiring no additional parameters set by users.Based on the advantages and innovations listed above,MiMod algorithm can sensitively predict biologically significant functional modules.At present,MiMod has been achieved by using C++,and its source code as well as the executable files in Windows,Mac OS and Unix can be downloaded and used via link:https://github.com/LiYanSDU/SYSTEMS.To sum up,in light of the problems of motif finding and biological functional module prediction,we design ProSampler and MiMod algorithms,respectively.Test results indicate that these algorithms can cope with the corresponding problems effectively,and overcome the shortcomings of state-of-the-art algorithms.In addition,we use ProSampler algorithm and statistical methods to perform a series of analyses on the binding patterns between transcription factors and DNA,and make inferences about the probable reasons leading to these patterns.
Keywords/Search Tags:Combinatorial Algorithms, Bioinformatics, Motif Finding, Dense Sub-Network, Network Biology
PDF Full Text Request
Related items