Research On Parallelization Of H+H Pipeline Based On Spark Cluster

Posted on:2023-06-11

Degree:Master

Type:Thesis

Country:China

Candidate:J N Guo

Full Text:PDF

GTID:2530306851989559

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

With the rapid development of high-throughput sequencing technology,a large amount of gene sequencing data has been generated and accumulated in the field of bioinformatics.Quantitative analysis of gene sequences can discover key information such as functional structure and genetic information on nucleic acids and proteins,and is one of the important links in genetic data analysis.This thesis proposes an automated analysis method for the H+H(HISAT2+HTSeq)workflow pipeline,aiming at the difficulty of operation in the process of gene quantitative analysis due to the lack of computer expertise for bioinformatics researchers.This method realizes the automatic concatenation of HISAT2 and HTSeq,and uses the alignment output of HISAT2 directly as the input data of HTSeq to complete the automated workflow of quantitative analysis.It reduces the computer operation steps to a certain extent,reduces the operation difficulty of data analysis,and provides a convenient way for researchers.Due to the huge gene sequencing data and the high computational complexity of the gene sequence alignment process,the problems of low efficiency and time-consuming quantitative analysis of gene sequences are caused.This thesis proposes a parallel computing method of H+H workflow pipeline based on Spark cluster,and through comparative experiments,it is proved that,without changing the calculation accuracy,compared with the single-machine H+H Automated workflow pipeline,H+H workflow pipeline based on Spark cluster can effectively improve its computing efficiency.In conclusion,this study not only realizes the automated workflow of gene quantitative analysis and provides a convenient analysis method;but also proposes a parallel computing method of scalable H+H automated workflow pipeline based on Spark big data technology,which can flexibly improve H+H.The analysis efficiency of automated workflow pipelines is of great significance for promoting the development of genetic data analysis methods in the field of bioinformatics.

Keywords/Search Tags:

Spark, Big data, Parallelization, HISAT2, HTSeq, H+H, Pipeline

PDF Full Text Request

Related items

1	The Parallelization Research Of Genomics Data Comparison Algorithm And The Construction Of Comparison Platform Based On Spark
2	Research On Parallelization Of Spatial Data Mining Clustering Algorithm Based On SPARK
3	Research And Application Of Parallelization Of Community Discovery Algorithm Based On Spark
4	Research On Spatial Data Matching Of City Underground Pipelines And Its Parallelization
5	Research On Parallelization Of Improved AP Algorithm Based On Spark And Its Application In Protein Complexes Identification
6	Community Detection Algorithm Based On Seed Expansion And Its Parallelization
7	Research On Parallelization Of Single Pulse Search Based On Spark
8	Group Theory Based Data Dependence Model For Loop Parallelization
9	Research And Implementation Of Principle Component Analysis And Factor Analysis Parallelization Based On Spark
10	The Research For Key Technology Of Astronomy Big Data Integration Based On Spark