With the increasing maturity and wide application of the third generation sequencing technology,long sequence alignment has become am emerging research field.In transcriptomics research,locating and mapping long read RNA into the reference genome is a key step in the analysis of the third generation RNA sequencing data,whose accuracy has a crucial impact on the subsequent analysis of differential gene expression,RNA editing and detection of fusion genes.However,the third generation RNA sequencing data is characterized by long reads,high errors and high throughput,which throws out challenges to the design of long RNA read alignment algorithm.The existing algorithms generally obtain low accuracy of RNA sequence alignment on the genome with complex structure and large scale.And then,many positions determined by these methods still have deviations from the real location,resulting in low accuracy of exon boundary recognition.Therefore,this paper designs and implements an alignment algorithm called WFMap for reads of the third generation RNA sequencing.This method carries out global search through indexing and pre-alignment strategy.It finds the reliable and approximate location of RNA reads,and then locally adjusts the location of the first stage through exact alignment,so as to accurately and efficiently locate each part of RNA sequence on the genome.The main research methods of this paper are as follows.Firstly,the reference genome is indexed by minimizer.The hash index is constructed by calculating minimizers in the custom window.Secondly,in order to improve efficiency,we map reads to reference by pre-alignment using seed-and-extend strategy.The approximate position of reads in the reference genome is locked through region selection,graph mapping,anchor extending and anchor filtering.Finally,we adopt WFA algorithm to exact comparison.At this stage,the annotation files are first utilized to identify the exon boundaries.WFA algorithm is used in the optimal sets of anchors,which is an exact gapaffine algorithm that makes the best of homologous regions between the sequences to accelerate the alignment process.In WFMap,applying WFA to RNA sequence alignment and accurately identifying exon boundaries by genome annotation are two innovations of this paper.The experimental results show that WFMap performs better than the existing alignment methods under evaluation matrics of different data sets.Specifically,it achieves the best performance on small-scale and simple splicing data sets and shows applicability on largescale and complex splicing data sets.In addition,it still performs well on data sets of different species and sequencing technologies(Pac Bio and ONT),indicating that the algorithm has good generalization.By visualizing the results,it can be seen that WFMap has certain advantages in dealing with exon boundaries.In particular,we find that the error rate of reads will affect the results of the alignment algorithm.When the accuracy of reads increases,the results of the alignment algorithm will also be better.In this paper,we analyze the problems in long RNA sequence alignment and explore solutions,which provides a new idea for the development of subsequent related algorithms. |