| Rapid advances in high-throughput sequencing technologies have enabled biology to enter the era of Big Data.Volumes,variety,and velocity of Big Data present a considerable challenge for traditional analysis strategy.In this study,large-scale maize RNA-Seq data were used for transcriptome map construction,gene expression abundance exploration,unmapped read digging,and graph genome-based application.Corresponding integrated bioinformatic pipelines and analysis platforms were built to provide references for large-scale transcriptome analysis.The detailed research results were summarized as follows.Transcriptome map construction based on large-scale RNA-Seq data.Large-scale maize B73 inbred RNA-Seq data,combining with maize B73 reference transcriptome and maize B73 FLNC Pac Bio reads,were collected and used to re-construct a B73 transcriptome map and identify 17,952 novel transcripts.Among novel transcripts,947 of them are in intergenic regions,73% of which contain transposable elements.As a significant step,this integrated analysis method has been ensembled into deep TS,a transcriptional switch analysis platform.Signature gene identification based on large-scale RNA-Seq data.High-dimensional gene expression matrices generated from large-scale maize B73 inbred RNA-Seq data were decomposed into AMs(amplitude matrices)and PMs(pattern matrices)through matrix factorization technology.Based on PMs,sample clustering and spatial-transcriptome analysis were implemented.In sample clustering analysis,774 seed-related signature genes were identified including some experimentally validated genes,such as Zm GRAS20,Zm ZAG2,and Opaque2.In spatial transcriptome analysis,a series of signature genes were identitied through setting different number of metagenes.Unreported signature genes provide new biological knowledge for further understanding of the molecular mechanism of maize kernel development.By integrating the analysis method,easy MF,a user-friendly web platform that aims to facilitate biological discovery from large-scale transcriptome data through matrix factorization,were presented.Exploration and application of large-scale unmapped RNA-Seq reads.In traditional RNA-Seq analysis pipeline,a small but significant fraction of RNA-Seq reads is usually unexplored,owing to their unmappability to the genome sequence.Here,unmapped RNA-Seq reads from large-scale B73 inbred were de novo assembled,and identified 635 novel transcripts missing in reference genome annotation.At transcript sequence level,some of these novel transcripts encode chloroplast-related proteins,transporters,ubiquitin-protein ligases;at transcript expression level,some of these are involved in seed development and drought stress response processes;through co-expression module-based GO enrichment,some of novel transcripts are related with photosynthesis,protein translation,and chromatin composition.Based on analysis method on unmapped RNA-Seq reads,CAFU,a framework for exploring unmapped RNA-Seq data,were constructed.Graph genome-based transcriptome analysis.For problem of inaccurate alignment and unmappability in RNA-Seq data when mapping against non-corresponding inbred reference genome,a graph genome-based transcriptome analysis strategy was constructed to build graph genome using B73 linear genome and Han21 genetic variants.Based on Han21 graph genome,read-genome alignment,gene quantity,and differential expression analysis were implemented.The results indicated that graph genome-based transcriptome analysis can effectively improve the accuracy of RNA-Seq read alignment results and downstream transcriptome analysis results.In summary,this research explored application of large-scale transcriptome data including transcript structure annotation and gene expression pattern.Focusing on issues in transcriptome analysis,unmapped RNA-Seq reads and graph genome-based transcriptome analysis were further explored.Based on the above researches,a series of integrated bioinformatics pipelines and analysis platforms were built,which would provide convenience for deep mining of large-scale transcriptome data,solving related biological problems,and exploring biological regulation mechanisms. |