Font Size: a A A

Identify Orthologs Of Human Long Non-coding RNAs In Mammal And Construct The LncRNA Database LongMan

Posted on:2017-02-14Degree:MasterType:Thesis
Country:ChinaCandidate:X X YangFull Text:PDF
GTID:2180330488983901Subject:Cell biology
Abstract/Summary:PDF Full Text Request
Background:Eukaryotic genomes were once considered as composed mainly of protein-coding genes, but recent studies revealed that only about 2% of the human genome encodes protein-coding genes. Of massive non-coding transcripts produced by the genome, abundant ones that longer than 200bp are called long non-coding RNAs (lncRNAs). LncRNAs were initially found in humans and mice, but more recent studies have revealed they exist in a wide range of multi-cellular organisms, including mammals, arthropods, fish, and even plants. Ponjavic et al revealed that lncRNAs have evolved more slowly than neutral sequences and been subject to stronger purifying selection, which indicated that they should be functional. Subsequent experiments gradually revealed lncRNA’s functions in a variety of developmental and physiological processes.During evolution, lncRNAs have showed not only conservation but also variation and even turnover. Firstly, since lncRNAs do not produce proteins, they have accumulated more mutations. However, they have conserved structures that ensure the conservation of their functionality. Selective pressure prefers acting more on their secondary structures instead of their sequences, leading to considerable compensatory mutations at the sequence level. Secondly, compared to protein-coding genes, lncRNAs show significant specifity between lineages and expression patterns. The former is thought to be associated with evolution of gene expression regulation and phenotypic differences in eukaryotes, whereas the latter seems to play important roles in physiological and pathological processes. Finally, the rich classes and large numbers of transposons embedded in lncRNAs indicate that transposon activities are important for the origin and evolution of lncRNAs. A typical example is ANRIL, which shows a two-stage clade-specific evolution process. After divergence from the common ancestor of rodent, scandentia, and primates, ANRIL seems to have gradually lost exons during rodent evolution, until completely disappeared in mouse and rat. On the other hand, during the evolution of scandentia and primates, ANRIL gradually acquired up to nineteen exons, many of which are transposon derived that significantly modified the sequences and structures of ANRIL. Due to the large number of lncRNA, experimental studies alone can’t quickly reveal their origin, evolution and function, making computational analysis indispensible.Since lncRNAs are important for transcriptional regulation and epigenetic modifications, disorders of their sequence and expression can cause varied wrong gene expression, and resultingly, diseases. Many primate-specific lncRNAs are assumed to contribute to the development of the cerebral cortex, to look into them helps understand the mechanisms of human-specific diseases. To obtain a great number of homologous lncRNA sequences by computational methods can directly determine conservation and. lineage-specific gain and loss of sequences, helping the analysis and prediction of lncRNA functional domains.As more lncRNAs were found in multiple species, to experimentally examine a few lncRNAs can no longer satisfactorily unveil their features and functions. To use bioinformatics methods to collect, integrate and analyze lncRNA data will be an important direction of the lncRNA studies. The database issues of Nucleic Acids Research have published some literatures about lncRNA database, including lncRNAdb, lncRNAdisease, ChlPBase, and Deepbase. LncRNAdb collected experimentally verified lncRNAs. Since lots of lncRNAs produced by RNA-seq do not have verified functions, there are just 294 lncRNAs in the current LncRNAdb v2.0, making it not much helpful for the analysis of lncRNAs. LncRNAdisease focuses on the association between lncRNAs and diseases, containing 322 lncRNAs and 221 diseases annotated upon about 500 papers. Since an lncRNA can regulate genome modification at multiple sites, the diseases it can cause depend on the target genes it regulates. So, to predict lncRNAs’target genes is essential to revealing the lncRNA-disease relationships. ChlPBase predicts transcription binding sites and potential transcriptional regulation of lncRNA upon ChlP-Seq data. Also upon deep-sequencing data, Deepbase identifies and annotates ncRNAs. Notably, above databases include annotated lncRNAs but no homologous lncRNAs, and do not provide information for large-scale comparative analysis of lncRNAs.We think that lncRNA researches face several issues that bioinformatics can help solve:How have human lncRNAs evolved? To what extent do they show lineage-specificity? How to build a database to help the large-scale study of lncRNAs? To solve these issues, we set thse task for this research:(1) to identify mammalian lncRNAs orthologous to human lncRNAs in 16 mammals; (2) to build an lncRNA database upon these lncRNAs to support large-scale comparative studies of lncRNAs.Methods:1. To obtain orthologs of human lncRNAs in 16 mammalsAccording to GENCODE project (version 18), we obtained 13562 human lncRNAs from human genome (hgl9 version), then searched their homologous in chimpanzee, macaque, marmoset, tarsier, mouse lemur, tree shrew, mouse, rat, guinea pig, rabbit, dog, cow, elephant, hedgehog, opossum and platypus. To do the genome search, we used RNAfold to predict the second structure of exons of lncRNAs, and searched orthologs using Infernal. Large-scale genome searches were performed on our local server and the supercomputer "Tianhe-2".2. To analyze the sequence and evolutionary features of some IncRNAsPhylip was used to construct phylogenetic tree and calculate the distance among sequences.3. To reveal human and primates-specific IncRNAsWe asigned IncRNAs that have orthologs in a species 1 and IncRNAs that have no orthologs in any species 0, then converted orthologs of the 13562 human IncRNAs to numbers. Upon the resulting data, we estimated the gain/loss events in the phylogenetic tree of IncRNAs.4. To annotate transposons in IncRNAsAccording to Repbase, the transposon annotation database, we run RepeatMasker for all exons of IncRNAs.5. To build database of orthologous IncRNAs in mammalianWe used MySQL 5.1 in Linux CentOS 6.5 system to build a mammalian orthologous lncRNA database (LongMan). Python, Symfony framework and Apache were used to build database, import data, and develop user interface.Results:1. Lineage-specificity of human lncRNAsOrthologs have the distribution:monotreme (platypus) have 1008 (7%) orthologous genes; mouse and rat have 4416 (30%) and 4099 (28%) respectively; human’s closest relatives, chimpanzee, has 13239 (98%) orthologs. In all,323 (2%) lncRNAs are unique to human.After estimation of gain and loss events at common ancestor nodes by the mix program in the Phylip package, we obtained the results showing that the most recent common ancestor of lagomorphs, rodent, scandentia and primates had 7458 (55%) lncRNAs. After the divergence, lncRNAs gradually decreased in braches of lagomorphs and rodent, while increased in scandentia and primates. At the ancestor node of primates, the number of lncRNA has reached 10498 (77%).2. Establishment of LongMan, a mammalian lncRNA orthologs databaseLongMan currently contains 133646 lncRNAs, also provides information such as sequence feature, sequence alignment, exon information, transposon information, and species-specific insertions and deletions. It allows flexible search and display.Conclusions:1. The analysis of orthologous lncRNAs revealed that they exhibit distinct lineage specificity. While about 2% of them are unique to human, more than 70% are primates-specific. In addition, the monotreme, platypus, has 1008 orthologs, suggesting that some have an ancient origin.2. LongMan is the first database of orthologous lncRNAs. It not only contains many lncRNAs (133646 in all), covers multiple species (17 species) and clades (monotreme, marsupials, and other mammals), but also provides secondary information. The abundant number and information of lncRNAs in LongMan are highly valuable for comparative and functional studies of lncRNAs.
Keywords/Search Tags:LncRNA, LongMan, Database, Orthologs
PDF Full Text Request
Related items