Design And Validation Of CleanSeq,A Genome Contamination Data Processing And Analysis Pipeline

Posted on:2023-12-06

Degree:Master

Type:Thesis

Country:China

Candidate:C Y Wang

Full Text:PDF

GTID:2530306782466834

Subject:Software engineering

Abstract/Summary:

PDF Full Text Request

Contamination frequently occurs in bacterial cultures,severely affecting the reproducibility and reliability of whole-genome sequencing(WGS)results.It is critical to develop and utilize improved bioinformatics tools to analyze potential contamination in sequencing data and extract targeted valid data from contaminated WGS datasets.Based on this research background,this thesis developed the Clean Seq pipeline to automatically detect and remove contaminating reads from WGS data,analyze potential genomic variants by local realignment,and perform appropriate validation.In this paper,we design computer simulation datasets,public datasets,and real experimental datasets for validating the Clean Seq efficiency.The results demonstrate the high applicability and reproducibility of Clean Seq.The main research contents and results of this paper are as follows:·For the problem of WGS data contamination,this paper implements the contamination detection and cleaning module of the Clean Seq tool,which is used to determine whether the original sequencing data has contamination,and clean the data after contamination is found.The method is as follows.The raw data is compared to the bacterial genome database,species tags are assigned to the sequencing reads,and whether there is species contamination is determined according to the species tags.The BLAST(Basic Local Alignment Search Tool)tool was used to align the raw data to the reference genome of the target species and the genome of the contamination species for marking.According to the labeling results,the reads unique to the target species are obtained to complete the data cleaning.·For mutation analysis,we develop the Clean Seq mutation calling & verification module,which is used to perform mutation calling and verify the reliability of the calling results.The method is as follows.First,use the conventional mutation calling process to perform mutation calling on the sequencing data to obtain the original mutation calling set VCF(Variant Call Format).The mutated kmer were generated according to the mutation information in the VCF,and BLAST was used to align the mutated kmer to the sequencing data.According to the alignment results,reads with high homology to the mutant kmer are extracted and visualized to verify the authenticity of the mutation.·This paper conducts a comprehensive performance validation of Clean Seq using multiple datasets.By setting a variety of simulated datasets,experiment datasets,and public datasets,the contamination detection & cleaning modules,mutation calling & verification modules,and Clean Seq as a whole are verified,respectively.Results show that Clean Seq can effectively process contaminated sequencing data and output reliable analysis results.

Keywords/Search Tags:

Whole genome sequencing, Bioinformatics, DNA contamination detection, DNA data cleaning, Mutation validation

PDF Full Text Request

Related items

1	Detection Of Genome Variants Based On Hight Throughput Sequencing Data
2	De Novo Mutation Detection Method Based On High-Throughput Sequencing Data
3	Study On Process Parameters Of Laser Cleaning Contamination Layer On Surface Of Ceramic Artifacts
4	Methodology Study On Detection Of Indels From Next-generation Sequencing Data
5	Research On Genomic Sequence Alignment Methods Based On High-throughput Sequencing Data
6	HTS Based Screening Of Novel Zoogenic Virus And Virus Whole Genome Sequencing And Analysis
7	Microbial Contamination Analysis Of Retail Pepper And Chili In Sichuan Province And Construction Of Whole-genome Database Of Highly Heat-resistant Strains
8	Genome Sequencing Of Phlebia Tremellosa And Bioinformatics Mining Of Genes Related To Lignin Degradation
9	Detection Of Copy Number Variants Based On Genome Sequencing Data
10	Complete Genome Sequence And Annotation Of Bacteriophage PaP2, With Biological Characterization Of This Phage