Font Size: a A A

Research On Parameterized Models And Algorithms For The Haplotype Assembly Problem

Posted on:2009-06-20Degree:DoctorType:Dissertation
Country:ChinaCandidate:M Z XieFull Text:PDF
GTID:1118360245981911Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Haplotyping plays an important role in locating complex disease susceptibility genes.The haplotype assembly problem is a computational problem that,given a set of DNA sequence fragment data of an individual,induces the corresponding haplotypes.For the problem, based on different optimal criteria,there are many different computational models,such as Minimum SNP Removal(MSR), Minimum Fragment Removal(MFR),Minimum Error Correction (MEC),MEC with Genotype Information(MEC/GI),etc.Most of the models for the problem have been proven to be NP-hard,and there are not practical exact algorithms for them.Based on the observation that,for the real DNA sequence fragement data,the maximum number k1 of SNPs that a fragment covers is usually smaller than 10,and the maximum number k2 of the fragments covering a SNP site is usually not more than 19,we parameterize the MSR and MFR models.Based on the parameterized models,we propose two exact algorithms PMSR and PMFR to solve the gapless MSR and MFR models in the time complexity O(nk1k2+mlogm+nk2+mk1)and O(mk22+mk1k2+mlogm+nk2+mk1)respectively.To solve the MSR and MFR models with gaps,we design two exact algorithms PGMSR and PGMFR of the time complexity O(2knk1k2+mlogm+nk2+mk1)and O(2kmk1k2+23kmk22+mlogm+nk2+mk1)respectively.Extensive experiments show that,compared with Bafna et al.'s corresponding algorithms,the parameterized aglorthms above are more efficient and are practical in genome-wide haplotype assembly.To deal with long mate-pairs,in which there may be many holes and k will be very large,we propose two parameterized exact algorithms PMMSR and PMMFR to solve both models in the time complexity O(nk1k222h+k12h+nk2+mk1)and O(nk23k2+mlogm+nk2+mk1)respectively, where h is the maximum number of fragments covering a SNP site whose value is unknown at the SNP site.In real DNA sequence fragment data,k2 is usually not more than 19 and h is usually not more than 17. The theoretical complexity analysis and experimental results show that the running time of PMMSR and PMMFR is not directly related with k,and PMMSR and PMMFR algorithms can efficiently deal with MSR and MFR models with long mate-pairs respectively.Baed on the characteristics of real DNA sequence fragment data, we parameterize MEC and MEC/GI models,and introduce two exact algorithms PMEC and PMEC/GI to solve them.The time complexity of both algorithms is O(nk22k2+mlogm+mk1),and the experiments show that,when m is 100 and Wang et al.'s branch and bound algorithms are impractical,PMEC and PMEC/GI run as fast as Wang et al.'s genetic algorithms,and that,compared with Wang et al.'s genetic algorithms, PMEC and PMEC/GI are more accurate in haploype reconstruction.To improve the accuracy of haplotype reconstruction,we propose a computational model WMEC/GS for the haplotype assembly problem, which is based on weighted fragments and genotype with errors.Then the model is proved to be NP-hard even with gapless fragments.To solve this model,based on the characteristics of fragment data,the paper introduces a parameterized exact algorithm PWMEC/GS of time complexity O(nk22k2+mlogm+mk1).Extensive experiments on MEC/GI, WMLF and WMEC/GS show that WMEC/GS is more accurate in haplotype reconstruction than other models.
Keywords/Search Tags:single nucleotide polymorphisms (SNPs), haplotype, genotype, NP-hard, parameterized algorithm
PDF Full Text Request
Related items