| The DNA sequencing technology is dramatically advancing, whichhas measured the complete genome sequences of nearly2000prokaryotes,and especially the widely used next-generation sequencing (NGS)technology is also promoting bacterial transcriptomic studies. However,the growing massive-scale omics data call urgently for rapid and deepmining for phenotype characterization. In this study, we discussed thebioinformatic strategies of biological big data mining with focus onthiopeptide gene cluster and non-coding RNA (ncRNA) of bacteria.Thiopeptides are a growing class of sulfur-rich, highly modifiedheterocyclic peptides that are mainly active against Gram-positive bacteriaincluding various drug-resistant pathogens. Recent studies also reveal thatmany thiopeptides inhibit the proliferation of human cancer cells, furtherexpanding their application potentials for clinical use. Thiopeptidebiosynthesis shares a common paradigm, featuring a ribosomallysynthesized precursor peptide and conserved posttranslationalmodifications, to afford a characteristic core system, but differs in tailoringto furnish individual members. In this study, we have developed a web-based tool ThioFinder to rapidly identify thiopeptide biosynthetic genecluster and the cleavage sites of precursor peptides from DNA sequenceusing a profile Hidden Markov Model approach. Fifty-four new putative thiopeptide biosynthetic gene clusters were found in the sequencedbacterial genomes of previously unknown producing microorganisms.Identification of new thiopeptide gene clusters, by taking advantage ofincreasing information of DNA sequences from bacteria, may facilitatenew thiopeptide discovery and enrichment of the unique biosyntheticelements to produce novel drug leads by applying the principle ofcombinatorial biosynthesis.A ncRNA is a functional RNA molecule that is not translated into aprotein. It plays important regulatory roles in a variety of cellular processes,such as bacterial pathogenesis and drug resistance. The utilization of RNA-Seq technology in transcriptomics has allowed a high-throughputexplosion of bacterial ncRNA. The second part of this study first collected1490ncRNAs in17bacterial strains by mining the reported RNA-Seq data,and then expanded the dataset by878published experimentally verifiedncRNAs. With the obtained big dataset, we identified the conservedsequneces of the promoter and terminator regions of ncRNA genes, whichare found to be closely related to the G+C content of bacterial genomes.The refined sequence features may facilitate prediction of ncRNA genes inbacterial nucleotide sequences. |