Font Size: a A A

Algorithm Research On One-Sided Genomic Scaffold Filling Problem Based On Contig

Posted on:2024-05-02Degree:MasterType:Thesis
Country:ChinaCandidate:Y Q ZhuFull Text:PDF
GTID:2530307076974809Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the development of biological gene sequencing technology,the genome data shows an exponential growth trend,and the study of genome data has come into the public’s awareness.However,the data obtained by biological gene sequencing technology are fragmented compared with genomes.How to fill the incomplete genomic scaffold completely with computers has become a great challenge.The current biological gene sequencing technology has been developed to the fourth generation,which makes the acquisition of genome data faster and more accurate.The progress of biological gene sequencing technology has promoted the rapid development of life science,personalized medicine,genome mapping and other fields.Genome data is playing an increasingly important role.The current biological gene sequencing technology still has some limitations in obtaining complete genome sequences.Therefore,filling the incomplete genomic scaffold by computer technology can reduce the cost of biological gene sequencing and improve the integrity of genome sequence.It has certain feasibility and positive significance.The genomic scaffold filling problem is an important field of computational genomics.It refers to using the idea of combinatorial optimization to fill the missing genes into the incomplete genomic scaffold,so that the filled scaffold is similar to the reference genome.Nowadays,contigs can be directly obtained by Celera Assembler and other tools.It is more practical to study the genomic scaffold filling problem based on contig than the previous ordinary genomic scaffold filling problem.In this thesis,the one-sided genomic scaffold filling problem based on contig is studied from two aspects:maximizing the number of common adjacencies and maximizing the number of duo-preservations.The main works of this thesis are as follows.(1)According to the similarity measure of maximizing the number of common adjacencies,this thesis focuses on the one-sided genomic scaffold filling problem based on contig.At present,H.Jiang et al.proposed a 2-approximation algorithm using the maximum matching.However,this thesis finds that this algorithm has some problems such as redundant adjacencies in repeated calculations.In this thesis,the missing genes are divided into Type-1 substrings,Type-2 substrings and Type-3 substrings based on the different number of common adjacencies generated by inserting missing genes.At the same time,the insertion of Type-n(1≤n≤3)substrings with different lengths was studied,which solved the problem of redundant adjacencies in the process of genomic scaffold filling caused by repeated genes.Based on the above,this thesis proposes an improved algorithm based on maximum matching and greedy strategy,and proves that the approximate ratio of the algorithm is 15/8 and the time complexity is O(n 2.5).(2)In this thesis,duo-preservation is applied to the one-sided genomic scaffold filling problem based on contig for the first time.In this thesis,the relationship between genome sequences is redefined by using duo-preservation.Good 1-strings,middle 1-strings and bad1-strings are processed in turn by maximum matching.An approximate algorithm with time complexity of O(n 2.5)is designed.This algorithm can solve the case in genomic scaffold filling problem of maximizing common adjacency,where all adjacencies between two genome sequences are common adjacencies,but they are not the same gene sequence.(3)A visual genomic scaffold filling platform is developed to simulate the genomic scaffold filling process.The main function of the platform is to randomly generate genome sequences with different specifications,record the specific process of genomic scaffold filling and calculate the filling time.The platform can generate a dataset containing 10000genome sequences with different specifications.Based on this dataset,the above two tasks are tested on the platform respectively.For(1),this thesis proves that the improved algorithm has a certain improvement in the number of common adjacencies and running time at the levels of different gene missing rates and genome sequence lengths.For(2),this thesis verifies the feasibility of the approximate algorithm in terms of the number of duo-preservations and running time.
Keywords/Search Tags:scaffold filling, genomic, maximum matching, greedy strategy, approximation algorithm, NP-complete
PDF Full Text Request
Related items