Font Size: a A A

Design And Optimization Of Parallel Algorithm For Biolgogical Sequences

Posted on:2017-03-30Degree:MasterType:Thesis
Country:ChinaCandidate:Q G ZhengFull Text:PDF
GTID:2308330485982225Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Parallel computing provides a best way to solve the performance bottleneck of uniprocessor. It can make full use of computer hardware to realize high performance computing. It includes three research directions:parallel computing platform, parallel software, parallel algorithm。The status quo of parallel computing is that the development of parallel algorithm falls so far behind parallel hardware that parts of parallel algorithm cannot make full use of parallel hardware effectively. It is the lack of high-performance parallel algorithm that becomes a big obstruction to the development of parallel computing. The important reason for this situation is that designing and optimizing parallel algorithm is much more complex than serial algorithm. How to design a parallel algorithm has a significant meaning.We choose two basic subjects in bioinformatics sphere:protein sequence alignment and DNA genome assembly. They are not only fundimental but also extremly difficult and important, because lots of research depend on them. DNA sequence is composed of four bases(ATCG). Protein sequence consists of 20 amino acids. Considering their features, we should use different strategies to speedup two algorithms.Sequence alignment is a basic component and important basis of bioinformatics. The basic idea is the general law that amino acid sequence of protein determines its structure, and the structure determines its function. By detecting the similarity between sequences, we can find the fucntion, structure and evolutionary information. But with the implementation of human genome project, the size of the protein sequence databases has been increasing exponentially, and serial alignment algorithm has been difficult to meet the needs of practical applications. So we need to design efficient parallel algorithm to speed up the protein alignment algorithm. By using the latest Intel MIC (Many Integrated Core) architecture Xeon Phi coprocessor, we design a heterogeneous system architecture to speedup the protein sequence alignment algorithm. We call it XPFS. In our algorithm, we implement three levels of parallelism:instruction level parallelism, thread level parallelism, and process level parallelism. In order to make efficient use of the hardware, we adopt a dynamic task allocation strategy and cut total task into small block. Because the task block can be relatively small, so we can avoid the problem of load imbalance. Compared with the traditional PFSearch algorithm, we can obtain nearly 4 times speedup.DNA sequence contains the most basic genetic information of organisms, and it is the fundamental source of molecular biological research. Due to the limitations of the sequencing technology, the current sequencing machine can only read a very short DNA fragment (75-400bp), which is far from enough for molecular biology research. Sequence alignment refers to the reconstruction of DNA sequence by the combination of a number of short sequence segments. This is an indispensable research direction for molecular biology. There are two types of sequence assembly:de novo, mapping assembly, de novo has been a hot issue in the field of molecular biology. The difficulties are:lots of fragments, sequencing errors, no reference genome, repeat regions, large memory required and so on. According to all those difficulties, we design a new de novo asssembler, ARCS. Through experiments, we have compared the results of our algorithm and the existing alignment algorithm, and ARCS has achieved a better result in the experimental species, and the performance of time and memory has been significantly improved.
Keywords/Search Tags:Genome Assembly, Sequence Alignment, Intel MIC, Parallel Computing, Biological Sequence
PDF Full Text Request
Related items