Rice(Oryza Sativa)is one of the most important three food crops in the world.1/2 of the world’s population feed on rice.Rice is mainly cultivated in southeast of China,Japan and southeast asian countries.Asian cultivated rice is divided into two subspecies indica and japonica.Indica rice is further divided into two subtypes,including indica I and indica II.Rice considerd as the model model plant among the cereal crops and it is the first among cereals whose complete genome is sequenced.Although the production of indica accounts for more than 70% of the toal world rice yield,it still lacks a high quality reference genome.In this study,ZS97(indica I)genome and MH63(indica II)genome was sequenced,both of them are the parents of the famous hybrid rice SY63.In order to build two high quality and good continuity reference genomes,next-generation sequencing(NGS),third-generation sequencing(TGS)data and previous BAC-end data were generated.Then the transposable elements,protein-coding genes(PCGs)and conserved non-coding genes are analyzed.And then the comparative genomics approach was used to analyze the two genomes.At last,the transcription data is discussed.The main results were as follows:(1)Contamination removal and contig continuity improvementThe BAC clones were used for the third generation sequencing for ZS97 and MH63 genome.The target sequence would be contaminated by bacterial chromosome from generation to generation.Nucmer and BLAT were applied to detect ZS97 and MH63 contigs contamination.There are 19 ZS97 contigs and 17 MH63 contigs inserted by Escherichia coli(E.coli)fragment separately.Then the inserted E.coli fragments were removed by either the contamination-free contigs or the NGS contigs.In this study,we made full use of the complementarity between the next generation sequencing technology and the third generation sequencing technology to improve the continuity of ZS97 and MH63 contigs.The contig numbers of ZS97 decreased from 318 to 237 after the linkage,and the contig numbers of MH63 decreased from 216 to 181.Finally,the linearity between ZS97/MH63 and Nipponbare genome combined with early BAC-end sequencingwere applied to rank the contigs of the genome.235 contigs of ZS97 were located to the12 chromosomes,while 179 contigs of MH63 were positioned into the 12 chromosomes.(2)Transposable element annotation and gene annotationAs the two high quality genomes were generated,the next important thing was to make a comprehensive systematic annotation to the two genome.Transposable element(TE),protein-coding genes and non-protein coding genes were analyzed in this part.The TE region accounted for about 42% of the sequences in both genomes.We focused on centromeres and telomeres in the study of TE sequences.It was found that the centromeres of chromosome 8 and chromosome 10 in the ZS97 were intact.What’s more,there were as much as 4 intact chromosomes in the MH63 genome,which were chromosome 6,8,9 and 12.The telomere region was not complete,but it was distributed on ends of the chromosomes.Genomic expression region is always the hotspot of the genome study.We focused on the protein coding genes and highly conserved non-protein coding genes such as ribosomal RNA(rRNA),transfer RNA(tRNA),small nucleolar RNA(snoRNA),small nuclear RNA(snRNA),micro RNA(miRNA).The protein coding gens were separated into two parts,TE-related genes and non TE-related genes.The number of protein coding genes in ZS97 were 54,831,among of which the number of non TE-related gene number were 34,610.While the number of MH63 coding genes were57,174,of which the number of non-TE-related genes were 37,324.It was found that few of the TE-related genes were expressed and the expressed ones had low expression levels.We found that the density of the non TE-related genes were not consistent along the chromosome,which is much lower near the centromeres than the rest part of the chromosomes.In this study,Interproscan and blastp were used to predict the function of protein genes.The cloned genes were also used to correct the gene models and gene functions.Non-coding RNA also plays an important role in the cell.In this study,tRNAscan,RNAmmer and mirdeep2 were facilitated on non-coding RNA gene prediction.There were 592 tRNA,449 snoRNA,92 snRNA,341 miRNA in ZS97 genome while 589 RNA,457 snoRNA,97 snRNA and 363 miRNA on MH63 genome.But only a few rRNAwere detected in both genomes.Only 40 and 60 rRNA were detected in ZS97 and MH63 respectively.(3)Genome structure variation and its effect on coding genesIn this study,we made a comprehensive analysis of the genome based on two high-quality genomes,which provides a basis for the comparison of the two genomes.This part mainly analyzed the existence of single nucleotide polymorphism(SNP),small insertion and deletion mutations(InDel),Presence/absence variations(PAV),inversions,translocations in the genome.1,300,802 SNPs and 251,387 In Dels were detected between ZS97 and MH63 genome.There were 4509 segments existed in ZS97 genome while absent in MH63 genome,with a total length about 21.48Mb;in turn,there were 4,566 segments present in MH63 genome while absent in ZS97 genome,with a total length about 23.32 Mb.The largest PAV was in chromosome 4 of ZS97,the length of which is more than 1Mb.There were 131 large inversions detected between ZS97 and MH63 genome,the total length of which was about 1.96 Mb based on ZS97 genome and 1.85 Mb based on MH63 genome.The largest inversion occupied 362,444 bp on chromosome 12 based on ZS97 genome position.There were more than five thousand translocations between the two genomes,with a total length of about 8.94 Mb.The structural variations between ZS97 and MH63 genomes have great influence on the coding genes.PAV region directly determines the existence of genes either in ZS97 genome or MH63 genome.In the analysis,3984 genes were only located in ZS97 presence region,of which 1,389 were non TE-related genes;while 4,308 were only located in MH63 genome,1,713 of which were non TE-related genes.7,866 genes of ZS98 and 8088 genes of MH63 were heavily affected by SNP and InDel.Inversion and Translocation also affected protein coding genes.(4)Collinearity of non TE-related genesExcluding TE-related genes,GAP region genes and PAV region genes,the remaining non TE-related genes were compared by all-to-all blastp and then combined with MCScanX to detect collinearity genes.A total of 15,214 genes were identicalbetween ZS97 and MH63 genomes,4,174 genes had nonsynonymous mutations between the two genomes.5,932 genes with identity more than 80% and coverage more than 50%and also in pairs in MCScanX between the two genomes.Totally,25,320 were detected as collinearity genes.The rest of the genes were divided into "gene divergent",which include 6,010 genes in ZS97 genome and 7,334 genes in MH63 genomes.(5)Transcriptome analysisRNA-seq and Iso-seq(Isoform sequencing)data were used as evidence in the previous protein-coding gene annotation.In addition to providing help for gene annotation,RNA-seq also plays an important role in differential expression gene detection and sample cluster analysis.In this study,we have found that the tissue difference and the amount of gene expression Coincided best,the light had worst correlation with gene expression levels according to the cluster of 72 samples,which also fits with the number of different expressed genes(DEG)between the samples.The number of DEG between tissues was largest,the biggest number of which was more than10,000 DEG,followed by the number of DEG between high temperature and low temperature,and then between breeds,while the DEG between the long day and short long day had the lowest number of DEG,with which of 27.Iso-seq is able to detect the full length transcript,but the problem is that the error rate is relatively high,with more than half of the sequence carrying at least one insertion and deletion error.We tried to correct the mistake with the high quality genomes.After correction by using the reference genome,the length of the coding sequence could be raised from 50% to 70%,which hinted that it was useful to correct the Iso-seq data using the reference genome. |