Font Size: a A A

Research On Prediction Method For Protein Coding Small Open Reading Frame

Posted on:2023-08-20Degree:MasterType:Thesis
Country:ChinaCandidate:W W JiangFull Text:PDF
GTID:2530306836472354Subject:Electronic and communication engineering
Abstract/Summary:PDF Full Text Request
Small open reading frame(sORF)is a DNA sequence with a length of less than 100 amino acids,which is the current research focus of protein science.sORF is often ignored in genome annotation due to the short length and low expression level.With the rapid development of sequencing technology,more and more sORFs can also encode proteins and can be found in all regions of the genome.Protein coding sORF has gradually become a research hotspot in the field of biology,which also provides a challenge for genome annotation.In this context,we systematically and deeply analyzed the sequence characteristics of protein coding sORF using the existing sORF data resources,,and further developed a protein coding sORF prediction method,named sORFPredict,which will provide new methods and ideas for the future research and identification of protein coding sORF.The detailed work included the following 3 aspects:1.Codon preference analysis of protein coding sORF in different genomic regionsProtein coding sORF generally exists in all regions of the genome of different species.In different organisms,the frequency of synonymous codon usage is not evenly distributed,even in different tissues of the same species,but has a certain preference.The codon preference is closely related to gene expression.In order to reveal the similarities and differences of protein coding sORF in different genomic regions,this study performed a systematic analysis of different genomic regions in Arabidopsis,human and mouse,mainly including nc RNA,3’UTR,5’UTR,coding,pesudogene and intronic regions.We statistically analyzed the length distribution and GC content distribution of protein coding sORF in different genomic regions,and simultaneously estimated characteristics of the codon usage preference and the correspondence of analysis(COA).Our results showed that Axis1 was positively correlated with GC3 s in pseudogene regions;they were negatively correlated in 3’UTR,5’UTR,coding and intronic regions of human,while they were positively correlated in nc RNA and pseudogene regions;they were negatively correlated in nc RNA,5’UTR,pesudogeneand coding regions of mouse,while they were positively correlated in 3’UTR and intronic regions.Further analysis showed that Axis1 was negatively correlated with CBI in the coding region of mouse,while GC3 s was positively correlated with CBI.These different sequence characteristics in different genomic regions of Arabidopsis,human and mouse indicated a certain relationship between the base composition and gene expression,which could be used to distinguish the protein coding sORF in different genomic regions and provide reference for the study of sORF in the future.2.Research on prediction method of protein coding sORFThe characteristics of short sequence and low expression level bring great challenges to the recognition of protein coding sORF.Traditional sequence analysis methods and experimental methods,mainly including genome sequencing,transcriptome sequencing and mass spectrometry analysis are difficult to obtain effective results in the recognition of sORF.Therefore,it is far from enough to classify and identify protein coding sORF only by traditional sequencing methods.It is of great significance to develop effective sORF calculation,classification and recognition technologies.Herein,we firstly constructed two training sets and seven independent test sets based on the random sequence strategy,and the supplementary test set was simultaneously obtained using the reported experimental verification data.Then,we systematically analyzed the current methods related to the prediction of sORF coding ability.The existing methods had low prediction effects on sORF,and there is still a big gap in the practical application.On this basis,we constructed a prokaryotic prediction model based on the codon usage frequency and a eukaryotic prediction model based on the 3mer value of the sequence.The prediction effect of this method was obviously better than those of the existing methods in specific applications.Among them,the prediction accuracy of prokaryotic sequence could reach 91%,which was increased by about 25%;the prediction accuracy of eukaryotic sequence was about 83~87%,which was about 29~34%.These results provide a new method and idea for the future research and recognition of protein coding sORF.3.Development of protein coding sORF prediction platformAccording to above results,an online prediction platform of sORF coding potential,sORFPredict(http://www.tmliang.cn/sORFPredict),was built,and users can quickly predict whether the DNA sequence can encode via inputting the DNA sequence,selecting the corresponding predicting model and submitting sequence.Users can also view predicting results online,search predicted results of specific sequence through inputting sequence ID,and sort results according to encoded labels.Taken together,protein coding sORF plays an important role in life activities and gene expression.At the same time,it also brings great challenges to genome annotation and gene sequencing.Realizing the effective classification and recognition of sORF has important scientific significance and practical application value.In this study,bio-informatics technology and computational analysis methods are used to systematically analyze the sequence characteristics of protein-coding sORF by using the genomes of multiple species as carriers and deeply study the methods that can effectively identify protein-coding sORF,which will provide theoretical support for the future research and annotation of sORF.
Keywords/Search Tags:small Open Reading Frame(sORF), sequence features, codon preference, protein coding sORF prediction
PDF Full Text Request
Related items