With the rapid development of high-throughput sequencing technology,a large amount of gene sequencing data has been generated and accumulated in the field of bioinformatics.Quantitative analysis of gene sequences can discover key information such as functional structure and genetic information on nucleic acids and proteins,and is one of the important links in genetic data analysis.This thesis proposes an automated analysis method for the H+H(HISAT2+HTSeq)workflow pipeline,aiming at the difficulty of operation in the process of gene quantitative analysis due to the lack of computer expertise for bioinformatics researchers.This method realizes the automatic concatenation of HISAT2 and HTSeq,and uses the alignment output of HISAT2 directly as the input data of HTSeq to complete the automated workflow of quantitative analysis.It reduces the computer operation steps to a certain extent,reduces the operation difficulty of data analysis,and provides a convenient way for researchers.Due to the huge gene sequencing data and the high computational complexity of the gene sequence alignment process,the problems of low efficiency and time-consuming quantitative analysis of gene sequences are caused.This thesis proposes a parallel computing method of H+H workflow pipeline based on Spark cluster,and through comparative experiments,it is proved that,without changing the calculation accuracy,compared with the single-machine H+H Automated workflow pipeline,H+H workflow pipeline based on Spark cluster can effectively improve its computing efficiency.In conclusion,this study not only realizes the automated workflow of gene quantitative analysis and provides a convenient analysis method;but also proposes a parallel computing method of scalable H+H automated workflow pipeline based on Spark big data technology,which can flexibly improve H+H.The analysis efficiency of automated workflow pipelines is of great significance for promoting the development of genetic data analysis methods in the field of bioinformatics. |