Font Size: a A A

Distributed Gene Sequence Similarity Calculation Based On Secure Multiparty Computation

Posted on:2017-01-02Degree:MasterType:Thesis
Country:ChinaCandidate:L C WangFull Text:PDF
GTID:2308330485980617Subject:Computer system architecture
Abstract/Summary:PDF Full Text Request
Genomic data sharing and analysis provides a key way of understanding biological mechanisms, carrying out high level medical diagnosis and treatments, facilitating clinical healthcare data reuse, and accelerating scientific discoveries. As one of important similarity measurements, Edit distance is widely used in human genome studies. However, because of the sensitive personal information contained in genetic data, broad dissemination without adequate protection would cause harmful effects on participants. Confronting the challenge of privacy-preserving sequence sharing and analysis over genome datasets stored in different institutions around the world, this paper proposed a distributed genome sequence similarity calculation model based on Secure Multiparty Computation(SMC) and implemented it with the latest technology advances in Goldreich-Micali-Wigderson(GMW) protocol. The main contributions of this work consist of following three points:(1) Design and implementation of the distributed secure sequence analysis model using GMW circuit. This paper introduced and contrasted current SMC privacy protection protocols to conclude that GMW has some advantages over other protocols after adopting optimization techniques like Oblivious Transfer(OT) extension and OT pre-computation. Then, we designed a distributed secure genome sequence comparison model independent of any third-party platform, which meets the requirements of realistic scenario. Furthermore, it was implemented with logic circuit based on GMW protocol.(2) Genome data preprocessing and results integration. To increase the efficiency without loss the worthy of application, an approximate Edit distance method was leveraged to measure the similarity of gene sequences. To enable the parties to share part of the heavy computation loads of the model locally, we align the test datasets to a public reference sequence and further optimizing the distance method. To minimize the circuitry complexity as much as possible, we utilize the hash algorithm DJB2 to encode genome sequences as binary string. To reduce memory consumption and network bandwidth, we propose the idea of data segmentation. Therefore, the final results should be a synthesis of secure computation outputs conducted from all data blocks.(3) Experiments and discussion. We designed experiments for specific application scenarios as well as representative parameters, and summarized the performance of the prototype in terms of efficiency, number of gates(e.g. circuit complexity), Error Rate, and network bandwidth. What’s more, we compared experiment results with other reported work to further demonstrate the advantages of our model. Compared with the existing solutions, the distributed secure sequence analysis model not only supports sequence comparison over any number of parties against a semi-host adversary corrupting setting without leaking any party’s genetic information, but also only take 8s to get the distance of two sequences that respectively belong to two parties with 5000 gene locus on each sequence, and 320 s to compute the distance in a 3-party setup with 5 sequence in each party.
Keywords/Search Tags:genome sequence Edit distance, privacy-preserving, data sharing, secure multiparty computation, distributed computation
PDF Full Text Request
Related items