Font Size: a A A

Study On All-to-all Comparison Problems And Parallelization Of Gene Sequence Alignment Algorithms

Posted on:2020-07-12Degree:DoctorType:Dissertation
Country:ChinaCandidate:L X LiFull Text:PDF
GTID:1360330602478617Subject:Agricultural IT
Abstract/Summary:PDF Full Text Request
All-to-All comparison problem is a special kind of computing problem,which widely exists in the fields of bioinformatics and data mining.A reasonable and effective data distribution strategy of all-to-all comparison problem is to make full use of the computing power of each node in distributed system,improve the computing efficiency of multiple sequence alignment task.Multiple sequence alignment is a time-consuming computational task,and parallel design of alignment algorithm is the key to improve alignment speed.Sequence file alignment in distributed system is a typical all-to-all comparison problem.Sequence alignment in biology can understand the similarities and differences of nucleotide composition and gene sequences of different species,reveal the potential function of genes,and clarify the evolutionary relationship of species and the internal structure of genomes.This dissertation researched that focuses on data distribution strategy for All-to-All comparison problem in distributed system,evaluation of large sequence file segmentation,construction of distributed file distribution framework and parallelization of sequence alignment algorithms.Main work and innovations of this dissertation are as follows:1.The data distribution strategy of all-to-all comparison problem in distributed system is studied.In this dissertation,the all-to-all comparison problem is formally described,and a multi-objective optimization data file distribution model is proposed to meet the requirements of data localization,storage balance and no more than the upper limit of node storage and node load balance.This dissertation designed a data file distribution strategy,and proposes a data distribution strategy on data files in data centralization.The effectiveness of the model and algorithm is verified by simulation experiments.2.The algorithm of file segmentation and merging is studied.Aiming at a large gene sequence file,by normalizing the goals of load balancing,storage balancing,not exceeding the upper limit,and the minimum average computation of the nodes,the importance coefficients of each goal are set according to the actual application,and a document segmentation evaluation model for full alignment is constructed.The multi-objective optimal data file distribution model studied above is presented,and the optimal file segmentation algorithm is given.3.Based on the file distribution strategy proposed in this dissertation.A distributed file distribution system based on Hadoop framework is constructed by using the aforementioned file distribution strategy.The distributed cluster environment is set up,and the file storage is carried out by using the HDFS,while the Yarn framework is used for resource management,and the file distribution program is realized by using the Java programming technology.The experimental results show that under the conditions of data localization,balanced storage,no more than the upper limit of node storage and load balancing,files can be distributed to different nodes in the distributed system.4.Parallelization of gene sequence alignment algorithm is studied.Smith-Waterman algorithm is improved in terms of improving the running speed and reducing the time complexity of the algorithm.Smith-Waterman algorithm and Blast algorithm are parallelized on Spark platform.The accuracy,efficiency and expansibility of the parallel algorithm are verified by a series of experiments,such as the accuracy experiment of the design algorithm,the multi-node comparison experiment of the cluster,the comparison experiment of different nodes of the cluster,and the expansibility experiment.The main contributions and innovations of this dissertation are as follows:Aiming at the problem of all-to-all comparison,a multi-objective optimization file distribution model is constructed,which satisfies the conditions of data localization,node load balancing,storage balancing and minimum storage of nodes;a large sequence file segmentation evaluation model is constructed;and parallel schemes of Blast algorithm and SW algorithm based on Spark platform are designed.
Keywords/Search Tags:Distributed system, All-to-all comparison problem, File distribution model, Gene sequence alignment, Algorithm parallelization
PDF Full Text Request
Related items