Font Size: a A A

A Computational Study On The Sequence,Evolution And Clade-Specificity Of Long Non-Coding RNA

Posted on:2016-01-25Degree:DoctorType:Dissertation
Country:ChinaCandidate:S HeFull Text:PDF
GTID:1360330482456729Subject:Cell biology
Abstract/Summary:PDF Full Text Request
BackgroundSiepel et al found that vertebrate genomes have many conserved regions distributed in the so-called "deserts" of genomes.With the development of sequencing technology,it turns out that many of these conserved regions generate transcriptional fragments.These fragments,longer than 200bp,containing a poly A tail,without opening reading frames,and once assumed as "transcription noise”,are named as long non-coding RNAs(lncRNA).Recently,lncRNA were found to widely exist in metazoans,including mammals(mouse,dog,human),vertebrates(chicken,zebrafish),insects(Drosophila),and even the worm C.elegans.Although IncRNA sequences are poorly conserved,Ponjavicet al.found that,compared with ancestral repeats that undergo neutral evolution,lncRNAs in human,mouse and rat underwent significant positive selections.Subsequent experimental studies reveal that IncRNAs are actually involved in diverse biological processes.LncRNAs were initially identified to play a vital role in X chromasome inactivation.In mammals,to achieve equal dosage of genes on the X chromosome,an X chromosome is randomly inactivated in female somatic cells.The lncRNA Xist is the key controller,and deletion of Xist will prevent X inactivation.Gene imprinting means the permanent silencing of an allele in all somatic cells.Well known imprinted genes include those that control embryonic growth such as Igf2 and Igf2r,and whose imprinting is also regulated by lncRNAs.The truncation of lncRNAs or deletion of their promoters leads to loss of imprinting of nearby genes.Besides genomic imprinting and X inactivation,lncRNAs regulate the dynamic expression of abundant genes in a tissue-specific manner.For example,HOTAIR tissue-specifically represses HOXD expression,and ANRIL regulates CDKN2A/CDKN2B expression.LncRNAs are also found to help reshape chromosome 3D structures.For example,Firre can bring functionally related genes located on different chromosomes into the same nuclear compartment.Due to their tissue-specific expression and functions,lncRNA have become new therapeutic targets and diagnostic markers of diseases.While there is no doubt that lncRNAs have important functions,the mechanism of their regulatory roles is still poorly unknown.In metazoans a handful of conserved polycomb repressive complexes(PRC)conduct histone modification,and consequently,regulate gene expression.In Drosophila,a class of specific DNA sequence called polycomb responsive elements(PRE)can bind to and recruit PRC proteins to nearby target genes.However,mammalian genomes contain few PRE.As chromatin modulators,how PRC proteins recognize their target sites and genes is an important and unanswered question.Recently,it was demonstrated that lncRNA can interact with both chromatin and DNA modulators and help locate the latter to target sites,but the details of how remain absent.Two mechanisms are possible:lncRNAs can exploit 3D structure of chromosomes and thus locate themselves to target genes,and IncRNAcan bind to DNA by forming RNA:DNA triplexes with Hoogsteen or reverse Hoogsteen base pairing.A program for predicting ncRNA's DNA binding domain and binding sites based on canonical Hoogsteen base pairing was reported,but no typical IncRNAs were analyzed probably due to its poor performance of predicting IncRNAs,Recent experimental studies reveal that,in addition to canonical Hoogsteen base pairing rules,more Hoogsteen base pairing rules enable ncRNA to bind to DNA sequences.LncRNAs show three evolutionary features.First,they have conserved and specific structures but do not encode proteins.This feature allows IncRNAs to accumulate compensatory mutations that maintain structural conservation but make sequence diverged.Second,many lncRNAs contain transposons,indicating that transposons are involved in IncRNA formation and evolution.Third,lncRNAs show not only tissue-specific expression but also lineage-specific evolution.Species-specificity of lncRNAs provides a mechanism for explaining diverse gene expression in different species and phenotype variations.These features indicate that considerable studies are needed to reveal the origin,evolution,and function of IncRNAs.As mentioned,lncRNAs regulate X inactivation,the imprinting of some genes,and tissue-specific expression of abundant genes,by recruiting PRC and DNMT proteins to different genomic sites.When their sequences have mutations or expression is misregulated,lncRNAs can cause diverse aberrance of genome modification and gene expression in different cellular contexts.Cancers are notable cases.For example,HOTAIR is highly expressed in primary breast cancers,and ANRIL is found to be involved in about 30%of cancers.In addition,the considerable number of primate-specific lncRNAs indicates that IncRNAs are important for the neural development and neurological diseases.So far,experimental studies focus only on the function of particular IncRNAs in specific tissues by knocking down or silencing these IncRNA,which is very time-consuming.Computational studies can reveal sequence features and functional domains of lncRNAs,but only few studies have been reported due to lack of powerful softwares.The above background raises the following important questions.(1)When did important lncRNAs emerge during mammalian evolution?(2)How did lncRNAs obtain multiple exons?(3)How to predict lncRNAs' DNA binding motifs and binding sites so as to predict their target genes?(4)Whether have lncRNAs' DNA binding motifs evolved gradually?(5)To what extent lncRNAs show distinct species-or clade-specificity?To answer these questions,the main objectives of this study are:(1)to reveal the origin of some important IncRNAs,(2)to reveal the evolutionary features of these IncRNAs,(3)to reveal the clade-specificity of 13562 human lncRNAs,(4)to reveal the evolution of functional domains,(5)to develop an algorithm and software to predict lncRNAs'DNA binding motif and binding sites,(6)to analyze the DNA binding motif and binding sites of some important IncRNAs.Methods1.Identify human IncRNAs' orthologs in other species Upon the 13562 lncRNAs reported by the GENCODE project and other experimentally identified important lncRNAs,we searched sequenced genomes to identify the orthologs of each and every exon in an IncRNA.Since compensatory mutations make lncRNAs to have conserved structures but diverged sequences,genome searches were performed using Infernal instead of BLAST/BLAT,on our local servers and on the Tianhe 2 supercomputer.2.Analyze the sequence and evolutionary featuresPhylip,MrBases,and MEGA were used to build phylogenetic trees.PAML was used to analyze evolutionary rates.EvoNC was used to compare the evolution of an lncRNA and its neighboring protein-coding genes.Phylip,MEGA and different models were used to compute distances between IncRNA sequences(the assembled sequence of 12S and 16S rRNA were used as the reference).Pmmulti and RNAalifold were used to align IncRNA sequences.RNAfold and Mfold were used to predict structures of IncRNA exons.3.To uncover human-and primate-specific lncRNAs upon experimentally identified lncRNAs in humanFirst,we searched the orthologs of the 13562 human lncRNA in 14 mammals,which uncovers human-and primate-specific lncRNAs.Second,we turned search results into numbers,with 1 representing the presence and 0 representing the absence of an lncRNA in a genome.Upon this representation,a tree was constructed using the mix program in the Phylip package to reveal the events of gain and loss of lncRNAs in the 14 mammals.4.To develop the software LongTarget to predict lncRNA's DNA binding motif and binding sitesBy systematically reviewing published papers we integrated all reasonable Hoogsteen and reverse Hoogsteen base-pairing rules into 24 rule-sets.For a DNA duplex of interest,two RNA strands were reconstructed for the minus and plus strand,respectively,upon each base-pairing rule-set.We align the lncRNA to each of the two constructed RNAs to identify DNA binding motifs and binding sites simultaneously.Permutation test was used to evaluate the sensitivity and specificity of the algorithm.5.The evolution of functional domainsWe used LongTarget to predict not only the DNA binding motifs in human HOTAIR,but also the binding motifs in HOTAIR orthlogs in other species so as to analyze the evolution of functional domains.Results1.Results of HOTAIR analysisOrthologs of HOTAIR exist only in placental mammals,with some exons showing species-specific absence.HOTAIR exon2 is absent in dog,mouse,and rat,and HOTAIR exon6 has high-scoring hits in primates but low-scoring hits in other mammals.A large region of exon 6,conserved in other mammals,is absent in mouse and rat.HOTAIR exons show different evolutionary features.We found that HOTAIR exonl,exon2,exon4 and exon6b show higher substitution rates in primates than in other mammals,while exon3,exon5 and exon6a show little difference in rates between primates and other mammals.In addition,HOTAIR obtained more positive selection signals in mammals than its neighboring HOXC genes.Moreover,HOTAIR exon1 has a highly conserved hairpin substructure and exon6 has a highly conserved stem-loop structure.The two substructures occur robustly in diverse predicted structures of full-length HOTAIR.2.Result of ANRIL analysisANRIL orthologs first appeared in Xenarthra and Afrotheria,but has no orthologs in non-mammal vertebrates,monotremes,and marsupials.More exons were obtained in Laurasiatheria.The number of exons increases suddenly in marmoset and decreases gradually during rodent evolution.Notably,no exons were reliably identified in mouse and rat.ANRIL in simians has 19 exons,which contain 9 transposons.Insertion of these transposons has made ANRIL exons more conserved.3.Clade-specificity of 13562 human lncRNAsOf the 13562 human lncRNAs,1008(7%)have orthologs in monotremes(platypus);13239(98%)have orthologs in chimpanzee;4416(30%)and 4099(28%)have orthologs in mouse and rat,respectively.We used the mix program in the Phylip package to estimate the gain and loss events of lncRNA in ancestor nodes of the phylogentic tree built with the 14 species.The mix program reveals that the most recent common ancestor of rodents,Lagomorpha,tree shrew and primates had 7458(55%)lncRNA.After the divergence of rodents and Lagomorpha from tree shrew and primates,the number of lncRNAs decreased steadily in the branch of lagomorpha and rodents,but increases steadily in the branch of tree shrew and primates.In the ancestor of primates,the number of identified lncRNAs is 10498(77%).4.Develop LongTarget for predicting lncRNA's DNA binding domain and binding sitesUpon all biologically reasonable Hoogsteen and revers Hoogsteen base pairing rules and 24 integrated rule-sets,we developed a new algorithm and the program LongTargetto predict lncRNAs' DNA binding motif and binding sites.Permutation test indicates that LongTarget has high sensitivity and specificity.5.Using LongTarget,we predicted the binding motifs and binding sites of more than 20 typical lncRNAs,including ANRIL,H19,Aim,Kcnqlotl,and so on,mainly known imprinting control IncRNAs.Many predicted binding sites are in promoter regions and CpG sites,highly consistent with experimentally identified histone methylation marks in these sites.We also used Triplexator to predict DNA binding motifs and binding sites for these lncRNAs and found that considerable predicted binding sites are outside biologically reasonable regions.Conclusions1.Both HOATIR and ANRIL were originated in eutherians.HOTAIR obtained functional domains,yet ANRIL obtained exons,in evolution,and both show clade-specific evolution,indicating clade-or species-specificity of IncRNAs and potential relationship between lncRNA evolution and speciation.2.Transposons have contributed significantly to the origin and evolution of ANRIL and many other IncRNAs.The insertion and domestication of transposons influence the sequence,structure and conservation of IncRNAs.3.Most IncRNAs may emerge in eutherians and show distinct clade-specific features.Of the 13562 IncRNAs identified in human,only 1008 were identified in platypus,and many are specific only to primates.4.LongTarget outperformsTriplexator in predicting lncRNA's DNA binding motifs and binding sites.
Keywords/Search Tags:LncRNA, Functional domains of IncRNA, Molecular evolution, LongTarget, Transposons, Genome modification
PDF Full Text Request
Related items