Font Size: a A A

Learning-Based Consensus Construction From Long Error-Prone Reads

Posted on:2021-03-16Degree:MasterType:Thesis
Country:ChinaCandidate:S J WangFull Text:PDF
GTID:2370330611499998Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Since the launch of the Human Genome Project,genome sequencing has widely influenced the research methods of life sciences,and the genomes of various model species have been continuously analyzed in global laboratories.In recent years,with the increase in genome sequencing data throughput and cost reduction,it has become a routine method in the field of biomedicine.At present,the third-generation sequencing represented by Pacific Biosciences and Oxford Nanopore Technology long-read sequencing can generate sequencing fragments of enough length,which greatly promotes the development of genome assembly,mutation detection and other analytical fields.However,the third-generation sequencing sequences has a very high error rate(?15%),which affects the accuracy of the analysis results and limits its application in medical research and clinical diagnosis.Therefore,scientists are committed to developing more efficient analytical methods to break this limitation.Genome assembly is the process of reconstructing several M or even hundreds of M genome sequences from a large number of short fragments obtained by random sequencing.The ultimate goal is to generate complete and accurate consensus sequences.Although the application of the third-generation sequencing technology has greatly improved the integrity of genome consensus sequences,the high error rate of sequencing has limited its accuracy.Especially when assembling repetitive sequences and haplotypes,there are still challenges in obtaining high-quality and accurate consensus sequences.The key to generating consensus sequences is to obtain accurate multiple sequence alignment results.Considering the features of long-read,high error rate and high throughput of the third-generation sequencing sequences,resource-intensive sequence error correction and consensus construction are required to obtain high-quality assembly results.This research proposes a consensus generating model that contains deep learning and reinforcement learning methods,which can not only improve the results of multiple sequence alignment,but also obtain gene consensus with higher accuracy.The subject mainly carried out the following three work s:(1)Proposing a method based on reinforcement learning to adjust the alignment of genetic data,which adopts the asynchronous advantage actor critic algorithm to learn the comparison strategies.Since the current mainstream multiple sequence alignment methods still have many shortcomings,it is hoped that the results of the alignment could be improved through effective strategies.(2)Proposing a mechanism called curiosity reward,which can further adjust the results of multiple sequence alignment to make it not only get better results on evaluation indicators,but also be closer to the actual meaning of biology and more in line with the structure of gene sequences' features.(3)Introducing deep learning methods to extract the structural features of multiple sequence alignment results which can help generating consensus sequences with higher accuracy by combining the characteristics of each sequence data with different throughput number.This practice can make consensus still maintain excellent accuracy by using less data without obtaining the quality value at the time of sequencing,nor reading the ultra-long sequence at a time,which can process small data blocks more flexibly.
Keywords/Search Tags:Gene Sequencing, Multiple Sequence Alignment, Consensus, Deep Learning, Reinforcement Learning
PDF Full Text Request
Related items