Font Size: a A A

Research On Feature Analysis And Computational Identification Of Transcriptional Regulatory Elements In Genomes

Posted on:2007-09-05Degree:DoctorType:Dissertation
Country:ChinaCandidate:Y H DuFull Text:PDF
GTID:1118360215970506Subject:Control Science and Engineering
Abstract/Summary:PDF Full Text Request
One of the kernel tasks of bioinformatics in the post-genome era is to understand the complex regulatory mechanisms of gene expression. As transcription is the first step of gene expression, regulating on transcription is an important way of expression regulation as well. The DNA segments which have certain regulatory function in genomic sequences are called transcriptional regulatory elements. The identification and annotation of transcriptional regulatory elements is central to decipher the transcriptional regulatory mechanisms and to construct the transcriptional regulatory network. With the development of biological research and computer science, the computational identification methods have become the powerful auxiliary tools for traditional experiment methods. However, most of current computational methods only utilize the one-dimensional content features of primary sequences and ignore much of other important information, which result in poor specificity and a large number of false positives. Thus, taking feature analysis and computational identification of transcriptional regulatory elements as the research topic, this dissertation develops an identification framework based on information integration, which consists of three main steps: feature selection, feature calculation and integrated recognition, and then applies it to recognize three kinds of common transcriptional regulatory elements and their related signals: promoters, intrinsic terminators and transcription factor binding sites (TFBSs). The main contents and creative contributions of the dissertation are summarized as follows:(1) The research on feature analysis and computational identification algorithm of promoters. Promoters are sequence elements which regulate the initiation of transcription. According to the feature analysis of prokaryotic and eukaryotic promoters, a discriminant analysis algorithm for promoter identification based on feature selection and combination is proposed. This algorithm considers sequence content, dimensional conformation and energy distribution features of promoters as the candidate features, and then calculates their significance using proper characteristic models. The significance of each feature is estimated by the squared Mahalonobis distance between two classes. Through a stepwise procedure of feature selection, the discriminating ones are determined from the candidate features and combined as a multidimensional vector. Then the vector of combined features is further used by quadratic discriminant analysis (QDA) to predict the potential promoters. To make the characterization more accurate, an iterative searching algorithm called OCMISA is designed to search and calculate the dual composite motifs in local signal features of prokaryotic promoters. A similar iterative searching algorithm is also used to calculate the conserved motifs whose locations are not clearly known in eukaryotic promoters. The proposed promoter identification algorithm is trained and tested on actual datasets consisted of E. coliσ70promoters, B.subtilisσApromoters and human pol II promoters, respectively. The results show that the present algorithm achieves more competitive performance than several other current algorithms.(2) The research on localization algorithm of transcription start sites (TSSs). As TSSs have very close relation with promoters, the previous promoter identification algorithm is extended to locate the positions of TSSs. This localization algorithm firstly limits the rational searching ranges in genomic sequences based on the prior information of TSSs occurrence. The searching ranges of TSSs in prokaryotic genomes are usually fairly small, so it is intuitive to scan these ranges base by base using a sliding window which has the same format with former fixed promoter regions. Then the locations of TSSs are determined according to the likelihood scores of each potential position. To enhance the signal-to-noise ratio, a group of overlapping content variables in window sequence based on the resonance principle and the threshold filtration rules used to find the predicted positions are specially designed, respectively. The empirical distribution of distances between TSSs and translation start sites (TLSs) is also utilized to amend the likelihood scores. However, for TSSs in eukaryotic genomes, the searching ranges are generally too large to use the sliding window method. Under such situation, the calculation targets of localization algorithm are limited to candidate sites in searching ranges, which are determined according to the actual content of known TSSs. Then the location can be achieved directly by the promoter identification algorithm. The experimental results on actual datasets show that the proposed localization algorithm can find TSSs effectively and improves the specificity greatly compared with other algorithms.(3) The research on feature analysis and computational identification algorithm of intrinsic terminators. Intrinsic terminators are sequence elements which can terminate transcription in the absence of any additional factors. According to the thorough feature analysis, a more comprehensive feature set for intrinsic terminators is selected by combining the existing features and introducing new features. This feature set contains 5 variables which include sequence content, local conformation and energy distribution information. Based on the feature set, intrinsic terminator identification algorithms using QDA and support vector machine (SVM) are proposed, respectively. The favorable performance is achieved in 6-fold cross validation test on E. coli and B.subtilis datasets. Then the proposed identification algorithm is used to scan the putative intrinsic terminators in the whole genome of E. coli. Comparing with other typical methods, the total number of scanning hits decreases greatly when most of known intrinsic terminators are retrieved. The specificity of predicted results has been improved effectively. (4) The research on feature analysis and computational identification algorithm of TFBSs. As the basal regulatory elements, TFBSs are target sites where the transcription factors bind to the genomic sequences. According to the review of existing algorithms, a new searching identification algorithm for TFBSs which integrates conserved motifs and local conformational knowledge is proposed. This algorithm utilizes the maximal dependence scoring matrices (MDSMs) to model the motifs of TFBSs, and adopts the dinucleotide step parameters to calculate the local conformations of TFBSs. Then the two kinds of feature scores are combined as multidimensional vectors. Based on such vectors, a QDA classifier is trained to predict putative TFBSs through a sliding window. As the extension of the position specific scoring matrix (PSSM), MDSM rearranges the positions in the motifs of TFBSs, and finds out a new ordering that maximize the overall dependence among all neighboring positions. Through the rearrangement process, the long-range correlation is converted to neighboring correlation as much as possible, so MDSM can characterize the dependence between positions in motifs more comprehensively under a fairly low level of model complexity. The local conformations of TFBSs are introduced as epigenetic features, and they are effective supplement to content information of primary sequences. The experimental results on actual datasets of E. coli CRP, Fis TFBSs and human HNF4αTFBSs demonstrate the effectiveness of the proposed searching identification algorithm. Its predicted results achieve a higher specificity than that of other typical methods.
Keywords/Search Tags:bioinformatics, transcriptional regulatory element, computational identification, information integration, composite motif, feature selection, quadratic discriminant analysis (QDA), support vector machine (SVM)
PDF Full Text Request
Related items