Font Size: a A A

Research On The Construction And Sequence Splicing Parallel Optimization Method Of The Second And Third Generation Genome Hybrid Assembly Process

Posted on:2018-10-30Degree:MasterType:Thesis
Country:ChinaCandidate:H B WuFull Text:PDF
GTID:2350330515455962Subject:Instrumentation engineering
Abstract/Summary:PDF Full Text Request
With the rapid development of bioinformatics,the world has entered the age of life science and information science.Third generation single molecule sequencing technology is poised to revolutionize genomics by enabling the sequencing of long.Bioinformatics is facing more challenges with the advances in the sequencing technology.More and more data ae accumulated from the sequencing which will consume more computing resource to analyze them,and new sequencing technology produces sequences with different characteristics,which raises more challenges for sequence assembly technology.Based on the above,this research focuses on the strategy of de novo assembly with the hybrid sequencing data and the techniques of parallel algorithms method,aiming to meet the needs of scientific research with the sequencing data and ensure save more computing resources in the progress of sequence assembly at the same time.The main work is as follows.Firstly,the large and complex biological data needs a powerful computing resources.In order to meet the needs of the research group,a bioinformatics platform with clustering techniques should be established.Here the method and procedure for building a high performance biological PC cluster by using the Rocks operation system were introduce(Rocks cluster).In this Rocks cluster,we both achieve effective integration of resources with the clustering techniques and provide a convenient and powerful data-processing platform for bioinformatics research in the future.Secondly,the rapid development of DNA sequencing technology has promoted the development of genomics.This paper analysis the advantages and disadvantages of the different generation sequencing technology,the third generation sequencing technology make the read be long with high error rate and the second generation sequencing technology make the read be short with low error rate.In order to make full use of the advantages of both type data and achieve a better result of genome assembly,we design some hybrid genome assembly process on the Rocks cluster.It is a hybrid correction assembly algorithm that to correct the three generation sequencing data based on the two generation sequencing data.Finally,considering the high memory consumption in the process of hybrid genome assembly,when the Rocks cluster faced with complicated genome assembly,it will be not work well due to the huge memory consumption.In order to solve this problem,this paper analyzes the information memory usage during the assembly process and attempts a solution based on the condition of Rocks cluster.First,we make use of Global Array to Manage and allocate memory of the compute nodes,dividing the computation and data;second,we design the method of process-level parallelism to relieve memory pressure on a single node.At the same time,in order to seek a better solution to gene hybrid assembly error correction method,we put forward a new algorithm idea based the hybrid genome assembly.We first use the second generation sequencing data to construct the assembly graph,and then map the third generation sequencing data to the graph.The purpose is to simplify the graph and to find the correct path of genome assembly with the long read of the third generation sequencing data,thus avoiding the error correction process.
Keywords/Search Tags:Bioinformatics, Rocks cluster, Hybrid genome assembly, Error correction, Parallel optimization
PDF Full Text Request
Related items