Font Size: a A A

Reference-Free Comparative Genomics:Algorithms And Applications

Posted on:2014-04-17Degree:DoctorType:Dissertation
Country:ChinaCandidate:H G YiFull Text:PDF
GTID:1220330434471179Subject:Bioinformatics
Abstract/Summary:PDF Full Text Request
We found a universal paradigm in the comparison methodology—the context-object paradigm and we revealed the context-object representation is a good formation for system comparison. We also revealed that the degree of differences between two analogous systems is non-correlated with the comparison order, this principle would be rather useful in comparative genomics because it indicates that the sequence information is useless for the p-distance (as well as other genetic distances based on p-distance) estimation between two genomes. Therefore, on principle, the distance-based phylogenetic approaches do not require complete genome sequence and sequence alignment.We proposed the context-object representation of biological sequences and its generalization, this representation could quickly find the homologous sites between two genomes regardless of sequence information. We proposed the co-phylog algorithm and it could covert either complete genome or unassembled sequencing data into the context-object representation, thus could calculate the evolutionary distance between two organisms and build the phylogenomic tree without reference genomes. According to our testing results, the reference-free phylogenomic tree constructed by co-phylog algorithm is very consistent with the phylogenetic tree based on the complete genome sequence.Equipped with the co-phylog algorithm, we explored the phylogenetic relationships on several genera whose phylogenetic relationships were unknown before. We anticipated the results could facilitate other researchers.We also presented the reference-free SNPs calling algorithm—co-snp, which could call SNPs directly from the high throughput sequencing data. We also found that the number of mutations between the paralogous sequences is almost nonrelated with its copy number, thus we could infer the sequencing depth distribution of low mutated paralogous sequences according to the depth distribution of highly mutated paralogous sequences, which help us exclude the false positive SNPs come from repetitive regions.
Keywords/Search Tags:Reference-free comparative genomics, phylogenomics, context-object
PDF Full Text Request
Related items