Font Size: a A A

The Research Of Microbial Genomics Based On High-throughput Sequencing

Posted on:2014-05-02Degree:MasterType:Thesis
Country:ChinaCandidate:Y HuangFull Text:PDF
GTID:2250330398489950Subject:Bioinformatics
Abstract/Summary:PDF Full Text Request
With the rapid development of high-throughput sequencing(HTS)technologies, the increase inthe depth of genome sequencing together decrease in the consumption of time and cost, makes awide application of HTS in the research of microbial genomics. The microorganisms that have beencompletely sequenced consist mainly of special microorganisms, model microorganisms andmedical microorganisms. The post-genomics research brings revolution to understanding andmanipulating microorganisms. However, the explosive growth of HTS data makes it difficult fordata analysis, especially for the whole genome assembly. It is one of the most serious challenges toextract the desired information from the vast amounts of sequencing data.Genomics research includes two aspects: the structural genomics that aims at the wholegenome sequencing; functional genomics that aims at gene function identification, namely thepost-genomics research. HTS could conduct a variety of sequencing, including whole genome,transcriptome and metagenomics sequencing. Meanwhile, HTS provides some new methods for thepost-genomics analysis.The main HTS platforms includes: Roche’s454, Illumina’s Hiseq and Miseq, LifeTechnology’s Ion Torrent. Illumina’s HTS system dominates the HTS market, with the advantagesof high accuracy and high throughput, and the disadvantages of long running time and short readlength. Roche’s454is famous for its long read length, but with poor accuracy and expensivesequencing reagents. Ion Torrent is the fastest sequencing platform.Whole genome sequencing is important for the understanding of a species with its molecularevolution, genetics and gene regulation. But, HTS is performed by randomly interrupting thesequence and subsequently parallel-sequencing the short fragments. To obtain the whole genome,HTS data must be assembled, which means the data need to be reconstructed by computer program(based on the overlapping sequences) to assemble the original sequence. A lot of assemblers basedon different algorithms had been reported. But they usually generate lots of contigs, mainly becauseof the repeat regions,even using PE/MP sequencing data. How to assemble these contigs into thewhole genome sequence is a challenge for HTS. Even though a lot of gap filling software had been reported,they are very difficult to use. The major gap filling methods include the following fourstrategies:(1) using different assemblers to assemble the same data;(2) combining the de novo andreference guided assembly;(3) using different sequencing platforms to fill the gaps;(4) designingspecific PCR primers flanking the gap and sequence the PCR products. There is not a universalprogram capable of handling the various HTS data. Often, HTS data needs to be assembled on acase-by-case basis, which is obviously a difficult task for average researchers without sequencingassembly experiences..Therefore, it has been a bottleneck to get the whole genome sequence and design analyticalmethods in genomics research for HGS. In this paper, we focus on the whole-genome assembly andproposed several methods to fill the gaps, as well as developed some home-made perl scripts toautomate the analysis. In addition, a few solutions were provided to solve the problems encounteredin microbic genome studies such as: genome annotation, submission and MLVA typing.In the assembly method section, we firstly introduced the using of3common software (Velvet,SOAPdenovo, Newbler), which are representatives when dealing with different data. In this paper,we analyzed the significance of the each software’s parameters and the influence, based on actualHTS data. Based on our experience, we provide practical parameters of the3software to researchworkers. Then, as the existing assemblers can only generate contigs and scaffolds but not the wholegenome, we proposed three assembly methods for filling the gaps of the large scaffoldings,therefore complement the existing assembly software. These3methods are: contig localization(including the reference mapping strategy and mate pair sequencing data based strategy), terminalextension to fill the gap, and reference-guided assembly method. These methods are easy tounderstand and to be implemented. We use perl scripts to implement them. These perl scriptsinclude:1, Programs to identify the relationship between the upstream and downstream contigs,using paired data;2, Terminal extension program for gap filling;3, gap filling program to integratethe de novo and reference-guided assembly results.In the genome analysis section, we only propose solutions for our actually encounteredproblems, due to the wide range of genomics research. These solutions include genome annotationand submission, as well as MLVA genotyping methods. For genome annotation and submission, thecommon procedures are described, together with a Perl script to generate the5-column tablerequired in genome sequence submission. In addition, to avoid the shortcomings in the traditional MLVA analysis (length-based analysis of PCR amplicons), a sequence-based MLVA method wasproposed. To implement this task, we developed a program specifically used for extracting thesequence of each MLVA site from the whole genome sequence.The methods introduced in this paper have application value, mainly presented in the perlscripts. Theory of these programs is simple and easy to understand, especially suitable for thenon-specialist of HTS data analysis to perform individualized microbial genome analysis. In orderto verify the reliability of these methods, each method in this paper is applied to a case study (suchas Rickettsia rickettsii, Morganella morganii, Burkholderia pseudomallei, Bacillus anthracis,phages). It proves that the methods can effectively fill lots of gaps (without extra sequencingexperiments), thus saving time and cost for whole genome sequencing. The second part introducedthe genotyping method based on sequence which has higher accuracy compared with the traditionalMLVA method, indicating a broad application prospect in the future.Bioinformatics is a discipline with application orientation, with various algorithm hidden indifferent analytical tasks. Individualized analysis on specific objects is often needed in genomicanalysis including whole genome sequencing and post-genomics analysis. The methods introducedin this paper are of high practical value but not necessarily applicable to every microorganismspecies. I hope this paper could provide useful references or tools for other researchers.
Keywords/Search Tags:high-throughput sequencing, assembly, gap filling, genomics
PDF Full Text Request
Related items