Font Size: a A A

Analysis Of Post-transcriptional Regulatory Elements In5’UTR Of Human Transcripts Based On Genbank And Refgene Datasets

Posted on:2015-01-22Degree:MasterType:Thesis
Country:ChinaCandidate:Y H YeFull Text:PDF
GTID:2180330431469274Subject:Genetics
Abstract/Summary:PDF Full Text Request
Background and purposeAccording to the genetic central dogma,transcription means the synthesis of RNA based on the corresponding genomic DNA as a template.In this way,the messenger RNA carries information from genomic DNA and translate it into protein, which functions as an undertaker of life and metabolism.Unlike the prokaryotes,the transcrption process of eukaryotes is shown below:l.mRNAprecursor(pre-mRNA) is replicated with the genomic DNA template.2.Guanine in the5’end of the pre-mRNA is methylated.This is called5’ capping.The5’cap plays an important role in the mRNA translation initiation.It participates in the binding of ribomsomal complex on an mRNA.After binding,the ribosomal complex manages to scan the mRNA chain and select a proper AUG codon to initiate the translation procedure, which is the mechanism of the peptide synthesis.The process is also called "cap-dependent translation initiation".In addition,the5’cap structure enhances the stability of mRNA and avoid the degradation by exonucleases.3.mRNA splicing:In this step,the intron of mRNA will be wiped off while the exon reserved.There may be more than one splicing pattern for one single pre-mRNA,resulting in multiple mature mRNAs for one gene. 4.Polyadenylation:An adenylate chain is added to the3’end of pre-mRNAs by PolyA polymerases and the polyA binding protein binds on the chain for protection. Concerning the sequence feature,a typical human mature mRNA encoding a protein can be divided into several parts below(From5’end to3’end):5’cap;5’untranslated region(5’UTR);coding sequence(CDS),which is also called open reading frame;3’ untranslated region(3’UTR);and PoryA chain.5’UTR contains some regulatory elements that modify the translation of downstream open reading frame.The length of5’UTR varies from tens to thousands. The regulatory elements in5’UTR include:uAUG(upstream AUG);uORF(upstream open reading frame);IRES(internal ribosome entry site) and hairpin.The translation of an mRNA begins with40S ribosome’s recognition and binding on the5’cap of the mRNA.Then the ribosome starts to scan from5’end to3’end to find a proper initiation codon to initiate the protein synthesis.Thus,AUGs may exist between5’ cap and the initiation codon.We call them uAUGs.When an open reading frame is formed in the5’UTR,a uORF exists.Previous studies on5’UTR of Eukaryocytes mature mRNA indicate that the existence of uAUG and uORF is not rare.12%-50%of the genes contain uAUGs or uORFs.The existence of uORFs plays an important role in the regulation of translation efficiency of main open reading frame by triggering mRNA decay or modulating the translation level of downstream open reading frame.Some researches indicate that the disruption of uORF would result in the development of genetic diseases,including cancer,metabolic or neural disorders.For example,the transcripts of HRand TPO both contain uORFs.Once they are disrupted(The former involves in the mutation of uAUG while the latter produced a new terminate codon),MUHH and thrombocytosis occured respectively. Oppositely,a mutation which results in gain of uORF structure in the5’UTR may also have a significant influence on the physiological expression of downstream main open reading frame.Up to date,14mutations concerning this mechanism have been reported,including HBB and POMC.The mutations of these two genes respectively cause β-thalassemia and Proopiomelanocortin deficiency.How do the ribomsome binding on the mRNA selectively bypass the uAUG or uORF and choose the proper initiation codon to start translation?What influence occurs on the translation of downstream open reading frame when a uORF exists in the5’UTR?Among all the present available databases,how many genes hold the structure feature of uORF and how many of them act as the functional regulatory elements?Among all the genes involved in this regulatory mechanism,what sequence feature is shared?Finally,how can we link this mechanism with the current reported mutations in the ClinVar or cancer databases.Our study attempts a systematic research on the questions above.Materials and MethodsThis study is divided into3parts:1.Publie dataset collection;2.data mining and statistical analysis;3.experimental validation.1.Public dataset collection:This study is based on the datasets listed below:1.1RefGene: httpy/hgdo wnload.cse.ucsc.edu/goldenpath/hg19/database/1.2Genbank format of human RNAdata(human.rna.gbff.gz): ftp://ftp.nebi.nlm.nih.gov/refseq/H_sapiens/mRNA_Prot/1.3Variation data related to human diseases--C lin Var: ftp://ftp.ncbi.nlm.nih.gov/pub/clinvar/ 1.4Variation datasetl related to human cancer:TCGA https://tc ga-data.nci.nih.gov/tcga/tcgaDownload.jsp1.5Variation datasetl related to human cancer:COSMIC http://cancer.sanger.ac.uk/cancergenome/projects/cosmic/download1.6Gene ontology anotation sets http://www.geneontology.org/GO.downloads.ontology.shtml1.7Animal transcription factor Collection http://www.bioguo.org/AnimalTFDB/download/ge ne_list_of_Homo_sapiens.txt2.Data mining and statistical analysis:2.1After downloading the data sets mentioned above,mRNA entries with complete NM accession IDs and complete CDS information are choosen for further analysis,for the entries with complete CDS information engages us to locate the exact range of5’UTR for each mRNA.The data filtering and statistical analysis are programmed using perl(Edition Perl5.16.3)2.2Position annotation for each5’UTR by hg19coordinates: Link the filtering file with RefGene datasets,which records the hg19coordinates of transcription start site,translation initiation site and exons for each known gene.We link these two files to faciliate the subsequent variation studies and the5’UTR unique process.2.3As one single gene may posssess more than one transcript and these varying transcripts may share the same5’UTR sequence.To avoid counting redundancy,we unique the5’UTR by filtering those who share the same hg19start and end coordinates.After filtering,we obtain26902unique5’UTR sequences for subsequent uAUG and uORF analysis.2.4Basic statistical analysis is done on the26902unique5’UTR sequences including the count distribution of uORF,the distance distribution between uORF initiation codon and transcription start site,the distance distribution between uORF terminate codon and translation initiation site,and the length distribution of each uORF.2.5GO terms are divided into three components:biological process,molecular function and cellular component.The gene ontology file contains all terms for each known genes.Using this file we manage to annotate the GO terms for the genes containing uORFs and decide whether they are enriched in a specific subset.To validate the functional enrichment, we use the online software DAVID: http://david.abcc.ncifcrf.gov/2.6Integration of C linVar,TCGA and COSMIC databases:The variations recorded in the three databases are annotated by GRCh37/hgl9,which enables us to combine these variations with the uORF disruption or production.Further,we choose those variations for experimental validation and decide whether they are.related with disease.2.7Based on Kozak rule,we evaluate the context of uORF initiation codon and the main ORF initiation codon.The sequence context near true initiation codon of each mRNA is relatively conserved, which is also called Kozak consensus sequence.Among them,the optimal translation initiation context is GCC[A/G]CCaugG[not U] while the normally strong context is [A/G]NNaugG[not U].Except for the context above,any context is regarded as a weak translation initiation signal.Result and discussionWe find158155’UTR with uAUG and136185’UTR with uORF corrsponding to9066unique genes, which occupies a proportion of50.62%.The distribution of the uORF distance and length are shown in the main text.As for the gene ontology process,a total of40620terms and18986genes are extracted from the full ontology file and each gene occupies21.3GO terms on average.This study combines7public datasets to make a systematic analysis for mature human mRNAs with uORFs and complete the functional enrichment and multi-species alignment.Then we compare the context between uORF initiation codon and main ORF initiation codon.Finally,we annotate the variations in C linVar and cancer databases that may be associated with the uORF production or disruption and choose two potential targets for expermental validation.
Keywords/Search Tags:upstream open reading frame, 5’ untranslated region, functionalenrichment, variant annotation
PDF Full Text Request
Related items