Research On Parallel FM-Index Algorithm Based On Spark In RNA-Seq Reads Mapping

Posted on:2019-06-12

Degree:Master

Type:Thesis

Country:China

Candidate:F Liu

Full Text:PDF

GTID:2348330566959844

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

With the continuous development of the second-generation high-throughput sequencing technology,the amount of data generated by the RNA-Seq sequencing technology has continuously increased.Although the huge amount of data contains a lot of biological information,it has brought researchers the problem of bioinformatics analysis.The rapid and efficient analysis of these massive biological data and the mining of data information are important issues that are urgently needed to be solved in today's bioinformatics.For the storage and processing of massive bioinformatics data,it is obviously unrealistic to use thousands of computers to work.Therefore,applying the cloud computing technology to large data sets for grouping is the best solution for storing,processing,and analyzing large data sets.In the RNA-Seq data analysis process,the reads mapping sequence alignment process uses the Reads mapping algorithm to read the coordinate information(chromosome number and position in the chromosome)in the reference genome using the reads fragment obtained from RNA-Seq sequencing.The Reads mapping analysis process is the first and important step in the RNA-Seq data analysis process.The quality of data analysis results,software program running time,etc.will affect the subsequent data analysis.With the development of high-throughput sequencing technologies,the amount of data generated by RNA-Seq has high-throughput,low-cost,and huge amounts of information.Traditional sequence comparison tools pose enormous challenges for bioinformatics analysis in terms of time consumption and computer memory requirements.Therefore,it is necessary to select an appropriate Reads mapping algorithm,and the comparison of reads sequences plays an important role.The Reads mapping sequence alignment process can be abstracted intoa string search problem in computer algorithms that searches for a substring in a long string to determine the position of the substring.The commonly Reads mapping algorithms include Hash Table algorithm,Suffer Array algorithm,Kart algorithm and FM-Index algorithm.In this paper,the commonly four basic algorithms are simply analyzed and compared.Compare the four algorithms in terms of memory requirement,run time,and sequence alignment accuracy by simulating the reads sequence data set.The FM-Index algorithm with comparatively good comparison is selected,and the subsequent algorithm is parallelized.In parallelizing the FM-Index algorithm based on Spark,the reference genomic indexing and the reads sequence alignment are mainly parallelized in the Reads mapping process.The reference genome index was established through the Spark distributed computing framework for parallelization.The establishment of the reference genome index was divided into three steps,namely the cutting of the reference genome sequence,the shuffling and sorting of the key value pairs,and the persistence of the RDD index.First,the large reference genome is divided into small reads,and the different RDD buffers are allocated to the memory.Then the indexing and sorting of the key-value pairs are established.Finally,in the reads sequence alignment process,the input of a large number of reads sequence can not be divided into different RDD,and the reference genome is sequenced to determine the coordinates of the reads in the reference genome.In order to achieve the parallelization of the algorithm and optimize the serial algorithm,the purpose of reducing the time consumption and memory requirements of the reads mapping process is achieved.In the age of the big data of group science,massive bioinformatics data make it difficult for traditional sequence comparison tools to efficiently complete Reads mapping.Therefore,the traditional mapping algorithm and cloud computing technology are combined to develop the Reads mapping sequence analysis process adapted to biological big data,which becomes an effective method to solve the problem of RNA-Seq data analysis.The rapiddevelopment of big data cloud computing technology has greatly helped solve the biological problem.In the process of sequence comparison,it is of great significance to bioinformatics to construct a cloud computing environment,optimize mapping of short reads,and further advance the development of RNA-Seq data analysis.

Keywords/Search Tags:

Bioinformatics, RNA-Seq, Reads mapping, Spark distributed framework, FM-Index algorithm

PDF Full Text Request

Related items

1	Design And Implementation Of A Distributed Hybrid Index Structure Based On Spark
2	Compression Algorithm Of Burrows-Wheeler Transform Index Faced To Genome Re-Sequencing
3	Research On Recommendation Algorithm Based On Spark
4	Design And Implementation Of Voltage Index Management System Based On Spark Platform
5	Parallel Research On Data Mining Algorithm Based On YARN And Spark Framework
6	An Ad-hoc Query Engine Based On Spark SQL
7	Research On Analytics Of Distributed Big Temporal Data
8	A collaborative framework for knowledge acquisition and management for bioinformatics applications
9	Research On Apache Spark Distributed Parallel Computing Framework Optimization Technology
10	IPTV User Complaint Prediction System Design And Implementation Based On Spark Distributed Computing Framework