Font Size: a A A

Feature Extraction And Identification Algorithm On Arabidopsis Poly(A) Sites

Posted on:2007-08-17Degree:MasterType:Thesis
Country:ChinaCandidate:Y LinFull Text:PDF
GTID:2178360212477622Subject:Systems Engineering
Abstract/Summary:PDF Full Text Request
The 3'-UTR processing of eukaryotic mRNA is an important part of gene expression regulation. Messenger RNA (mRNA) polyadenylation is a crucial step during the maturation of most eukaryotic mRNA, in which a polyadenine [poly(A)] tract is added to the cleaved 3'end of a precursor-mRNA post-transcriptionally. Such a modification in the 3'-UTR ensures mRNA's functionality, such as translatability, stability and translocation to cytoplasm. More importantly, a poly(A) site marks the end of a mature mRNA; hence it can be used as a criterion to identify a gene.There is a consensus hexamer element AATAAA as a main poly(A) signal in about 55% of mammalian mRNAs. In plants, however, only 10% mRNAs contain this hexamer element, and alternative polyadenylation (using different poly(A) sites other than the normal one) is common. Current protocols of identifying plant poly(A) sites rely heavily on expressed sequence tags (ESTs) which happen to carry a poly(A) tract. However, due to differential expression and incomplete EST data, many poly(A) sites cannot be positively identified and in many cases are mis-annotated. Till now, predictions of animal poly(A) sites have been reported, while no such a prediction of plant poly(A) sites using a computer algorithm has been reported.Based on the previous model by our lab, I continued the research of feature extraction and identification algorithm on Arabidopsis poly(A) sites. Using entropy-based algorithm and entropy analysis methods, features around the poly(A) sites were extracted. The poly(A) site classification method based on SVM (Support Vector Machine) was studied. Based on the features obtained, the setting of model was optimized, two first-order inhomogeneous Markov models were added to the original GHMM model, and the score formula was improved. Mimicking the real characteristic of poly(A) sites, the identification range was extended from TA and CA to all the possible dinucleotides at the cleavage sites. In the meantime, HMM scaling technology was applied to our GHMM model to solve the problem of calculating precision. For the sake of experiments on multiple sites, I devised a multiple sites distinguishing and integrating program for ESTs. Finally, combining the original model with the improvements above, a poly(A) sites computer identification system called Poly(A) Sleuth (PAS) was built.Upon these enhancement of the algorithm, the optimal sensitivity and specificity...
Keywords/Search Tags:Poly(A) site identification, Entropy, SVM, Markov Model
PDF Full Text Request
Related items