Font Size: a A A

Studies On The Application Of Maximum Information Principle, Energy And Selection Constraints To The Prediction And Analysis Of Splice Sites In Genes

Posted on:2010-06-03Degree:DoctorType:Dissertation
Country:ChinaCandidate:H Y JinFull Text:PDF
GTID:1100360278468081Subject:Theoretical Physics
Abstract/Summary:PDF Full Text Request
To recognize gene sequences in genome and to clarify all functions of genes, not only experimental approaches are needed, but also theoretical methods are required to guide experiments. The maximum information principle is a fundamental principle in non-equilibrium statistical theory; the principle gives a good model for simulating the mutation-selection mechanism in the biological evolution, and can be taken as an important basis for extracting information in bioinformatics. Prediction of the complete gene structure is an important subject in the current research, and a crucial part in the subject is to accurately identify the splice sites (not only constitutive but also alternative ones) and all kinds of alternative splicing events. For predicting alternative 5' or 3' splice site events, it is the key step to predict flanking competitors of given splice sites.In this dissertation, the maximum information principle is applied to theoretical analysis of the splicing reaction, and an expression of reaction free energy involved by a donor or acceptor site segment is deduced. By introducing the concept of selection pressure index and corresponding constraint, the expression of the selection pressure index of k-mer in the sequence is deduced. When the theory is employed to predict splice sites and their flanking competitors, higher prediction accuracy is obtained. The main contributions are summarized as follows:1. Based on the basic physical principle of splicing reaction, traditional maximum information principle is used to analyze the conservative segments around splice sites. By introducing the concept of reaction free energy involved by a splice site segment in the splicing reaction and corresponding constraint, under the assumption of reaction free energy additivity, an estimative expression of reaction free energy involved by a splice site segment is deduced. As a simplified model, the expression can be employed to estimate the free energy change involved by a donor or acceptor site segment during splicing reaction. When it is applied to the prediction for splice sites in test set, the results show high accuracy, so the expression well presents the actual situation of splicing reaction.2. As a beginning of the theoretical estimation of the splicing reaction free energy, the accuracy still needs to be improved. Furthermore, we improve the reaction free energy additivity assumption to contain the dependencies among bases in splice site segments, and modify the traditional maximum information principle to contain the background probability. And then we deduced a more accurate estimative expression of reaction free energy which contains not only the background probability factors, but also all kinds of dependencies among bases. When it is employed to predict splice sites, the prediction accuracy is obviously improved compared with the results before modified. That indicates the improved expression is in accordance with the splicing reaction process more accurately.3. The improved estimative expression of reaction free energy is used to predict alternative and constitutive splice sites and their flanking competitors in human and mouse genes, the results are satisfactory. The prediction ability of the expression is comparable with some current popular methods such as maximum entropy model etc. For the prediction of flanking competitors of given splice sites, The reaction free energy of the candidate competitor itself outperforms another measure—the reaction free energy subtraction between a given splice site and its candidate competitor segment, that implies as far as general effect of the numerous splice sites is concerned, reaction free energy competition between a given splice site segment and its flanking competitor segment is not an only primary factor for alternative splice site selection.4. With the purpose of quantifying the intensity of natural selection on sequence segment or k-mers in it, we introduce the concept of selection pressure index and the corresponding constraint condition, and deduce the selection pressure index expression of k-mer in sequence segment by use of the maximum information principle. The expression can easily link with functions and then quantitatively estimate some physical quantity, the foregoing method for estimating the splicing reaction free energy can also be included into the frame of selection pressure index theory. When the theory is adopted to the prediction of constitutive and alternative splice sites of human and mouse, the prediction ability of integrative method, which is formed by the integration of tliree measures (estimative value of reaction free energy, average selection pressure indexes of k-mers in two flanking sequences), is obviously improved compared with single reaction free energy measure.5. Based on the information content of sequences, the information discrepancy index which can be used to predict coding regions is devised. The prediction ability of the index is comparable with the heterogeneity index. The selected situation of k-mers in flanking sequences of splice sites is analyzed by use of the selection pressure index, and some interesting conclusions are drown, such as GT dinucleotide on the left side of 5' splice site is under negative selection, so is AG on the left and right sides of 3' splice site. It is found that the selected situations of k-mers in the left and light flanking sequences of splice site are quite different, and two prediction measures are designed based on the result. By selecting seven measures including the estimative value of reaction free energy, etc., and employing quadratic discriminant analysis to integrate them into a coherent method, we predict the flanking competitors of given splice sites. The prediction accuracy is higher than the other methods in current literatures. It has the highest accuracy for flanking competitor prediction up to now.
Keywords/Search Tags:Maximum information principle, Reaction free energy, Selection pressure index, Information discrepancy index, Quadratic discriminant analysis, Splice site prediction, Alternative splicing, Competitor of splice site
PDF Full Text Request
Related items