Font Size: a A A

Research On Identification Of Gene Splice Site Based On Sequential Pattern Mining

Posted on:2017-04-13Degree:MasterType:Thesis
Country:ChinaCandidate:Y S SunFull Text:PDF
GTID:2180330485463999Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Bioinformatics is a new cross discipline which combines computer science and life science. Bioinformatics has not only become the basic discipline of biomedical research, but also has become more and more significant of computer science and technology research. How to understand gene expression is one of kernel questions in biology research field. Gene splicing is a highly regulated process, which is a pivotal step between gene transcription and translation. DNA transcrible into the pre-RNA, splicing result in different mature RNAs, which are the template of protein. Deeply studying the gene splicing mechanism is essential for understanding the gene expression process, which already is a research focus. There are many different types of regulatory elements involving gene splicing, among which splice sites are kernel signal for gene splicing regulation. A lot of studies already showed that many human diseases are related with mutations around splice sites, implying dysregulation of gene splicing is an important pathogenetic mechanism. Therefore, the correct identification of splice sites is the premise to study splicing mechanism and mutations around splice sites. There are huge amount of pseudo splice sites throughout human genome, how to identify canonical splice sites has been being a difficult issue.In this study, we developed a novel algorithm by integrating PSSM model and sequential features mined from splice site sequences, to quantitative analysis of splice site signal and identify splice sites. Splicing mechanism of gene is combinatorial regulation of different cis-elements and proteins, and further to study the combinatorial regulation mechanism for gene splicing.The following are the research contents and innovations of this thesis:(1) We proposed a splice site identification and signal strength quantitative modeling. In this study, we integrated sequential pattern mining and PSSM to propose sequential pattern mining model based on abundant taxonomic information and conservative features in gene sequences, to quantitatively analyze 5’and 3’splice site sequences. All experimental datum were downloaded from UCSC databases based on biological theory. The canonical and pseudo splice sites can be discriminated effectively by the sequential pattern model which is robust and outperform Maximum Entropy Model that is considered as the best model for splice sites. We further applied this model to study pathogenetic mutations around splice sites and this model can discriminate pathogenetic mutations from wild type SNPs effectively.(2) Study in combinatorial regulation of splice sites. The conservative of splice site sequence is essential to identify splice sites. The identification of algorithms and models are also based on the conservation. Gene splicing is a highly regulated process involving different types of signals including 5’and 3’splice sites and splicing regulatory elements and branch sites. How to understand the relationship among 5’ and 3’splice sites and splicing regulatory elements has been being a big challenge. We did explorer the mutual effect among those splicing signals using sequential pattern model. The 5’splice sites significantly affect the diversity of 3’splice sites. And further to count the distribution density of SRE. The results show that ESE, ESS, ISE compensate weak splice sites.
Keywords/Search Tags:Bioinformatics, Splice Site Identification, Sequence Pattern, Pathogenic Mutation
PDF Full Text Request
Related items