Font Size: a A A

Methods For CtDNA Sequencing Data Analysis

Posted on:2018-02-06Degree:DoctorType:Dissertation
Country:ChinaCandidate:S F ChenFull Text:PDF
GTID:1318330536487231Subject:Pattern Recognition and Intelligent Systems
Abstract/Summary:PDF Full Text Request
Liquid biopsy,as a technology developed in recent years,is able to diagnose diseases(i.e.cancer)by analyzing the samples from blood,urine or other body fluid.Since the sample is taken from the body's circulating system,it can provide more comprehensive information comparing to tissue biopsy,and also can better overcome the problem of tumor heterogeneity.Liquid biopsy focuses on three targets: circulating tumor cells,exosomes and cell free nucleic acid(DNA or RNA).Due to the convenience that cell-free DNA(cf DNA)is easy to extract and can be directly prepared as DNA libraries to be sequenced,it has been studied by thousands of studies and is widly adopted in clinical applications.Although the experiemental technology of dealing such ctDNA is mature,the analysis of cf DNA sequencing data remains a challenging task.This is caused by the fact that the circulating tumor DNA(ctDNA)only accounts for a very small part of the whole cell-free DNA,typically less than 2%,and can be as low as 0.1%.Another reason is that the library preparation and sequencing processes all generate errors,and consequently result in lots of false positive mutations,which is called the background noise.So we have to develop new algorithms and methods to eliminate the systemic noises,and meanwhile improve the sensitivity of detecting the real mutations.This thesis focuses on the computational methods of analyzing ctDNA sequencing data,and is the direct outcome of the author's experience with ctDNA sequencing data processing in recent years.The main content includes,but not limits to: how to better preprocess data and apply error correction method to obtain cleaner data,how to better detect gene fusions with low tumor mutation frequency,how to detect target mutation by just scanning the raw fastq data without alignment or variant calling,how to interatively visualize mutations with WEB technologies,and how to simulate tumor sequencing data using reference genome and configurable mutations.In this thesis,I'll also introduce my effort of applying machine learning technologies to classify data from cf DNA and genomic DNA.Although the topic of this thesis is about ctDNA sequencing data analysis,most of the presented methods,technologies and tools can also be used to analyze the sequencing data from tissue biopsy.For better preprocessing sequencing data,I developed two software(After QC and fastp),wihch are based on similar algorithms.After QC focuses on the algorithm exploration and fastp is the reimplementation of After QC with much higher performance and more useful functions.These two tools can automatically perform adapter cutting,global trimming,sliding widow trimming,quality filtering,quality profiling and quality control(QC)in a single pass of FASTQ scanning,and output clean data with QC report.For paired end data,an algorithm called overlap analysis is utilized to align each pair of reads,and based the overlapping result it can apply to adapter detection and error correction.For single end data,another algorithm is IV developed to detect the 3' adapter sequences,which is based on assemblying the high frequency KMER counted in the last N(N=10)sequencing cycles.Their algorithms and implementation details are given in section 3.Conventional bioinformatic pipelines often include many steps and apply different filtering conditions.Since some reads or bases may be filtered out in each step,a longer pipeline may introduce more false negatives,which are not acceptable for clinical applications.To fastly detect and visualize target mutations,I developed Mut Scan,a tool utilizing error-tolerant DNA sequence-searching algorithms based on rolling hash and bloom filter.This tool can detect and visualize target mutations from FASTQ data directly,and generate interative HTML based on read pile-ups.Since it's 20X+ faster than conventional pipelines,it can be used for fast mutation screening of target mutations.The details of algorithms and implementation are provided in section 4.Detecting gene fusions from tumor sequencing data is another difficult problem,especially from the ctDNA sequencing data with very low MAF.How to avoid false positives and false negatives is the key point but not easy to achieve.To obtain both high sensitivity and specificity for fusion detection,I developed two tools(Fusion Direct and Gene Fuse)to detect gene fusions of the genes in the COSMIC(Catalogue of Somatic Mutations in Cancer)gene fusion curation list.Similarly Fusion Direct is for algorithm exploration and Gene Fuse is much more engineered for being used in producation contexts.These tools are based on the index of mapping KMER to genome coordinations.For a read to be detected,the genome coordination set mapped by its KMER is evaluated to seek for a consisent alignment of a pair of fused genes.The algorithms and implementation details are provided in section 5.To tune bioinformatics pipelines for tumor sequencing data analysis,we usually need some data with ground truths,which are difficult to obtain in clinical contexts.I developed a tool called Seq Maker to simulate NGS data with a reference genome and a list of different configurable mutations,including single nucleotide variants,insertions/deletions,gene fusions and copy number variants.Particularly this tool can simulate the sequencing errors,amplication biases and other sequencing artifacts to produce data much real like Illumina sequencing results.Seq Maker is presented in section 6.In practice,we have to separate plasma and white blood cells(WBC)from a tube of blood,extractDNA from both samples and prepare sequencing libraries respectively.To prevent messing up and cross-contamination of plasma DNA and WBC DNA,I developed a classifier of them based on machine learning techniques.This classifier achieves an accuracy of 99.87% for bootstrapping cross validation.The method of this classifier is given in section 7.Although different aspects of ctDNA sequencing data analysis are covered by this thesis,there are still lots of other aspects in such domain are not well covered.Some of them will be a part of my future work.A brief introduction of these topics is presented in section 8 to demonstrate what problems I will continue to work on.
Keywords/Search Tags:liquid biopsy, circulating tumor DNA, ctDNA, gene fusion, QC, mutation visualization, Open Gene
PDF Full Text Request
Related items