Font Size: a A A

Development And Application Of Pan-genomics Analytical Method

Posted on:2015-04-15Degree:DoctorType:Dissertation
Country:ChinaCandidate:Y B ZhaoFull Text:PDF
GTID:1220330467480039Subject:Bioinformatics
Abstract/Summary:PDF Full Text Request
The increasing development of DNA sequencing and genome assembly technology makes many bacterial genomes available over the past decade. The accumulating genome data is not only exhibiting great diversity in bacterial genomes, but also providing microbiologists abundant resources in terms of medical studies. Since the conception of pan-genome was proposed in2005, it has been widely employed in studies such as bacterial genomic dynamics, bacterial taxonomy, and reverse vaccinology. Meanwhile, pan-genome has been also utilized to investigate wide range of genomes from viral, fungi to plant genomes, and even more specific, cancer genomes. To better serve pan-genomics, to date a great number of software and programs have been developed. However, they are still suffering specific limitations and shortcomings such as limitation in analytical function.In this study, we used Perl language to successfully develop a pan-genome analysis pipeline (PGAP), which is aimed to analyze multiple bacterial genomes based on the protein-coding genes data (protein and nucleotide sequences as well as annotation information). Five analytical modules are integrated in PGAP, including ortholougs genes identification, pan-genome analysis, genetic variation analysis, evolutionary analysis and functional enrichment analysis. All the modules above can be conducted by only one line command. To improve the efficiency of the algorithm in terms of rapid genome data accumulation, we continually developed a distance guide (DG) sampling algorithm to perform pan-genome analysis for large-scale genome data. In this algorithm, genome diversity was characterized by calculating the difference within orthologous clusters in different strains, and then sample strains combinations according to genome diverisity. To test the performance of DG algorithm, we did the comparison using DG algorithms against totally randomly (TR) sampling algorithm and real data (without sampling). The results indicate that DG algorithm has an overwhelming advantage in both accuracy and stability comparing with TR algorithm, and the RMS values between simulation results with DG algorithm and real data are lower than0.1%of the total gene cluster number with the sample size500to30-strains population. For large-scale population (from30to200strains), the result differences between any two simulations with DG algorithm are also lower than0.1%of total gene cluster number. Therefore, DG is an efficient and qualified sample algorithm on pan-genome analysis for large-scale genome data. Users could use DG via PanGP, which was developed in Qt with an interactive graphic interface. On the basis of the analytical methods in these two programs, we employed seven Salmonella Paratyphi A genomes for the pan-genomic analysis as a test. Our results reveal that S. Paratyphi A genomes are highly conserved both in gene content and genome architecture. The closed functional pan-genome indicates that S. Paratyphi A has little imported genes. SNP analysis was carried out in both protein-coding genes and whole genome aspects. The high ratio of substitutions in part of the core genome implies that homologous recombination events occurred frequently, which might be a significant factor for S. Paratyphi A genome evolution. Based on the changes in the pan-genome size and cluster number (both in the core functional genes and core pseudogenes), we assume that the sharply increasing number of pseudogene clusters has a strong relationship not only with the decreasing number of core functional gene clusters but the inactivation of functional genes as well, which is suggesting that the S. Paratyphi A genome is being degraded.Computational methods in comparative genomics and bioinformatics for multiple genomes initiated us to develop a more efficient algorithm for pan-genome analysis, which were validated and applied in seires of research. This achievement provided us an efficient analytical platform for pan-genomics, which would significantly facilitate the investigation of bacterial genome functions and genome evolutions.
Keywords/Search Tags:pan-genome, core gene, automatic analytical pipeline, samplingalgorithm
PDF Full Text Request
Related items