Using apache spark for scalable gene sequence analysis

Posted on:2017-04-09

Degree:M.S

Type:Thesis

University:Texas A&M University - Commerce

Candidate:Syed, Muthahar

Full Text:PDF

GTID:2468390011498750

Subject:Computer Science

Abstract/Summary:

PDF Full Text Request

Scientific advances in technology have helped in digitizing genetic information, which resulted in the generation of the humongous amount of genetic sequences. Genetic sequences contain the details of human DNA, and analysis of these large-scale sequencing data is the primary concern. This thesis introduces a scalable genome sequence analysis system, which makes use of parallel computing features of Apache Spark and its relational processing module called Spark Structured Query Language (Spark SQL). Spark framework provides an efficient data reuse feature by holding the data in memory. Holding the data in memory significantly reduces the data access time and thus increases performance. The experimental approach to demonstrate the scalability of this proposed system is implemented on Spark parallel computing cluster implemented on top of Yet Another Resource Negotiator (YARN). Experiments detailed in this thesis make use of publicly available 1000 genome Variant Calling Format (VCF) data (Size 1.2TB) as input. The input data are analyzed using Spark and the end results are evaluated to measure the scalability and performance of the system. I further implemented a web-based interface where users can specify the search criteria, and Spark SQL performs search operations on the data stored in memory, providing optimal results.

Keywords/Search Tags:

Spark, Data

PDF Full Text Request

Related items

1	The Research Of Big Data Manipulating Technology Based On Spark
2	The Query Execution Optimization In Spark SQL
3	Application Research Of Real-time Data Analysis Based On Spark Computing
4	Research And Implementation Of Efficient WEB Container Log Processing System Based On Spark
5	Research And Implementation Of Data Imputation Technology Based On Spark
6	The Research And Implementation Of Mining Large Data Based On Spark
7	Research And Implementation Of Data Hybrid Computing Platform Based On Spark
8	The Design And Implement Of Data Source Connector Based On Spark SQL
9	Research On Fast Data Cube Computation Method Based On Spark Platform
10	Research On The Design Of Spark-acclerated Boosting By Majority Voting Algorithm