Font Size: a A A

A Study On Eukaryotic Gene Expression Regulatory Systems Based On Statistical Modeling Methods

Posted on:2014-02-19Degree:DoctorType:Dissertation
Country:ChinaCandidate:Y M LuFull Text:PDF
GTID:1220330398989915Subject:Biochemistry and Molecular Biology
Abstract/Summary:PDF Full Text Request
Life is a multi-level complex system that executes biological functions throughcomplex interactions among various molecules, which is especially true when itcontrols the temporal and spatial specificity of gene expression. The aim of this thesisis to clarify a series of issues of eukaryotic gene expression regulation on the systemicand network level with the aid of multivariable statistical and/or machine learningmethods on high-throughput and large-sample experimental data. In the work, we notonly developed bioinformatics methods/tools that can accurately identify geneexpression regulatory relationships, but also obtained interesting results that providingimportant clues on the inherent rules behind the complex systems of gene expressionand regulation. We anticipate that the methods used and conclusions drawn in thiswork could provide a valuable reference for further uncovering the mechanisms ofgene expression and regulation and understanding the genesis and development ofcomplex diseases.As the core control system of cells, the gene expression regulatory system plays acrucial role for the efficient execution of biological functions in a temporal-andspatial-specific manner, thus has always been one of the most concerned researchtopic of molecular biologists and bioinformaticians. In earlier studies, regulation bytranscription factors is the keystone of the field of gene expression regulation, andnumerous prediction methods/tools have been developed. However, because most ofthe predictions merely depend on genomic DNA sequences and multiple sequencesalignment approach, their theoretical and experimental guidance is very limited. Withthe deepening of gene expression and regulation studies, many important geneexpression regulatory factors other than transcription factors have been discovered,such as non-coding RNA, epigenetic modifications of chromosome. In the newcentury, high-throughput experimental techniques, such as transcriptomics,proteomics and epigenetics, have been developed rapidly, providing huge amount of true status information of intracellular molecules for gene expression and regulationstudies. Developing gene expression regulatory relationship mining methods fromhigh-throughput omics experimental data and discovering inherent patterns of generegulation are the main motivation of this work.This thesis studied eukaryotic gene expression and regulation from two differentaspects: transcriptional regulation based on gene sequential structure and expressionregulation based on molecular interactions. First, for transcriptional regulation basedon gene sequential structure, we mainly focused on a special gene structure called‘bicistronic’ genes. Similar with the operon structure in prokaryotic cells, bicistronicgene can translate two different proteins from one transcript, thus the two proteinsshare the same regulatory signal. This kind of gene structure is very common inprokaryotic cells, but only very few instances were found in eukaryotic organisms.Bicistronic gene is an interesting gene organization of regulation, however itsfrequency in eukaryotic genomes is still unknown, and the relationship between itsfunction and structure is also unclear. To clarify these issues, we accurately predictedbicistronic genes in human genome using comparative genomics and machinelearning methods, considering their structural characteristics. We finally identified30conserved bicistronic genes in human genome, giving an accurate estimation thefrequency of this kind of gene structure in mammalian genomes. After the accurateidentification, we constructed the domain-domain interaction network of proteins ofbicistronic genes based on their protein domain information. Using averaged networkdistance method, we found that proteins producted by the same bicistronic gene aretend to have direct interaction, thus may take part in same pathway or execute relatedbiological functions. This conclusion gave the reason of the conservation ofeukaryotic bicistronic genes from the view of their functional productions. Despite ofthe greater evolutionary pressure of the maintenance of two open reading frames(ORF) in one transcript, it is a very effective way of gene organization, when its twoprotein products can share the same regulatory signal and have similar expressionprofiles.Although the gene organization level is a relatively efficient way of regulation,however, the more regular way of gene regulation in eukaryotic cells is the regulationby various regulatory factors, whose complex interactions with gene finally enable thecells to accurately control the temporal and spatial specificity of gene expression. As a result, the major part of this thesis focused on the studies of complex regulatoryrelationships between regulatory factors and genes. Eukaryotic cells can regulate geneexpression from different levels, including transcriptional level, RNA splicing level,mRNA stability level, translational level and post-translational level etc. In this thesis,we studied the gene expression regulatory systems from two different levels:transcriptional level regulation and mRNA stability level regulation.For the regulation of mRNA stability level, our work mainly focused on a type ofimportant regulatory non-coding RNA, microRNA. We studied the complexregulatory relationships between microRNAs and the targeting gene throughstatistically modeling the quantitative relationships of microRNA concentration andgene mRNA level. Comparing with other state-of-art methods, we found the Lassoregression model we constructed could identify the microRNA-gene targetingrelationships with higher accuracy. Based on this model, we then used the expressiondata of clinical samples of prostate cancer (PCa) to construct microRNA regulatorynetworks of two different stages of PCa: primary stage and metastasis stage. Bycomparing this two networks, we identified several functional network modules andthe corresponding hub microRNAs, which shown by literature analysis are closelyassociated with PCa genesis and metastasis. This study not only provided useful toolsof accurate identification of microRNA-gene targeting relationships using expressiondata of clinical samples, but also uncovered the modular feature of microRNAregulatory networks. Meanwhile, this study also showed the great power of statisticalmodeling methods in high-throughput biological experimental data mining.For the regulation of transcriptional level, we mainly focused on the regulation ofepigenetic modifications which are the rising research area during recent years. First,we developed a novel computational method named ‘DELTA’ which using theepigenomics data to identify DNA regulatory elements. Starting from the probabilitydistribution theory of random variables, DELTA uses support vector machine toidentify DNA regulatory elements, considering the distribution shape of epigeneticmodifications around elements. Based on tests in multiple real data sets, our datashow that DELTA can identify DNA regulatory elements with significantly higheraccuracy than other state-of-art methods. Besides, we also carried out a quantitativestudy on the regulatory relationships between histone modifications at promoters and gene expression. We demonstrated that histone modifcations at promoters canaccurately predict the changes of gene expression among various cell types with theaid of Lasso regression model, which suggests the important roles of histonemodifications in the maintenance of cell identities. Meanwhile, by constructing ahistone modification regulatory network, we found cell-type specific genes are tend tobe regulated by multiple histone modifications, suggesting that the histonemodifications are a major source of the complex expression profiles of cell-typespecific genes among diverse cell types.Because the basis and bedrock of gene expression regulatory systems is accurateidentification of regulatory relationships between regulatory factors and genes, wefocused on using multivariable statistical models and machine learning methods tomine regulatory relationships from high-throughput experimental data. Specifically,we used Lasso multivariable regression model to identify targeting relationshipsbetween microRNA and genes; probability distribution theory of random variablesand support vector machine to identify DNA regulatory elements; and Lassoregression model and correlation test to identify regulatory relationships betweenhistone modification and gene expression. After implement of several projects, wefound that the performance of statistical models in gene regulation studies is decidedby fitness between computational models and biological facts, but not the complexityof algorithms. After the accurate identification of regulatory relationships, weconstructed different types of complex gene regulatory networks, including:microRNA-gene regulatory network and histone modification regulatory network, andthe relationships between their topological structures and biological functions wereanalysis, several interesting conclusions were obtained.The gene expression and regulation is a complex biological process, so studiesfrom the network and systemic view can contribute to the understanding of itsfunctional mechanisms and uncover the coupling relationships between networktopology and biological functions. In conclusion, the network-based methods arepowerful techniques in the studies of complex biological systems. In the future work,we will consider the integratation of different types of interacting networks into oneintegrated network to simulate and analyze the biological systems, with the hope thatour studies will contribute to the ultimate understanding on the still-hot topic ‘what islife’.
Keywords/Search Tags:Gene expression and regulation, regulatory relationship identification, epigenmics, statistical modeling
PDF Full Text Request
Related items