Font Size: a A A

Research And Implementation Of XP-EHH Algorithm Based On Spark Platform

Posted on:2017-01-22Degree:MasterType:Thesis
Country:ChinaCandidate:C C LiuFull Text:PDF
GTID:2308330488464366Subject:Software engineering
Abstract/Summary:PDF Full Text Request
With the development of the Internet and computer technology, the data in various fields is growing in a high speed. In order to fully analyze and excavate the valuable information among those data, some people proposed a new storage model and computing framework. The cloud computing platform based on Spark and Hadoop provides a distributed storage and distributed computing technology to deal with the big data, and those techniques have been widely used in practical production.Population genetics, a branch subjects of genetics, which can elucidate the mechanism of biological evolution by studying the dynamics of gene mutation among the population. The change of the proportion of a gene in the genetic process is called Selection. The genetic diversity and the evolutionary history of different groups can be directly reacted by the detection of the selection signal among populations. With the maturation of the second-generation sequencing technology, the genetic data showing a growth of explosive. As a result, how to analyze and make use of those data become a serious challenge.The signal detection method XP-EHH (Cross Population Extend Haplotype Homozygosity) for population selection is based on EHH (Extended Haplotype Homozygosity) and introduced a comparison strategy in iHS (Integrated haplotype score) method, which can archive a high detection efficiency on fixed or have fixed selection signal. As the current XP-EHH program, based on C++ language, only supports small data sets and has low performance in big data tests, we proposed a new algorithm based on Spark platform-SparkXpehh. The main contributions of this paper can list as following:(1) We proposed an algorithm base on Spark platform-SparkXpehh and implemented by Scala language. Compared with the traditional multi thread program, SparkXpehh can make full use of the idea of distributed parallelization, which can keep a good performance in big data tests, and have a good scalability.(2) We designed a data storage system based on Hadoop HDFS. We designed and implemented a data caching strategy based on RDD and Redis, which can make SparkXpehh be able to adapt to the different size of tests data and the different size of RAM.(3) We have made a full test for the algorithm of SparkXpehh, the maximum amount of test data is about four million. At the same time, we have made a scalability testing used difference nodes of the Spark cluster.From a lots of tests, we can see that our algorithm has a great improvement in performance compared to the traditional program, and it can provide a reference for other algorithms in bioinformatics to migrate to the distributed computing platform.
Keywords/Search Tags:Selection signatures, XP-EHH, EHH, Spark
PDF Full Text Request
Related items