Font Size: a A A

Design And Validation Of CleanSeq,A Genome Contamination Data Processing And Analysis Pipeline

Posted on:2023-12-06Degree:MasterType:Thesis
Country:ChinaCandidate:C Y WangFull Text:PDF
GTID:2530306782466834Subject:Software engineering
Abstract/Summary:PDF Full Text Request
Contamination frequently occurs in bacterial cultures,severely affecting the reproducibility and reliability of whole-genome sequencing(WGS)results.It is critical to develop and utilize improved bioinformatics tools to analyze potential contamination in sequencing data and extract targeted valid data from contaminated WGS datasets.Based on this research background,this thesis developed the Clean Seq pipeline to automatically detect and remove contaminating reads from WGS data,analyze potential genomic variants by local realignment,and perform appropriate validation.In this paper,we design computer simulation datasets,public datasets,and real experimental datasets for validating the Clean Seq efficiency.The results demonstrate the high applicability and reproducibility of Clean Seq.The main research contents and results of this paper are as follows:·For the problem of WGS data contamination,this paper implements the contamination detection and cleaning module of the Clean Seq tool,which is used to determine whether the original sequencing data has contamination,and clean the data after contamination is found.The method is as follows.The raw data is compared to the bacterial genome database,species tags are assigned to the sequencing reads,and whether there is species contamination is determined according to the species tags.The BLAST(Basic Local Alignment Search Tool)tool was used to align the raw data to the reference genome of the target species and the genome of the contamination species for marking.According to the labeling results,the reads unique to the target species are obtained to complete the data cleaning.·For mutation analysis,we develop the Clean Seq mutation calling & verification module,which is used to perform mutation calling and verify the reliability of the calling results.The method is as follows.First,use the conventional mutation calling process to perform mutation calling on the sequencing data to obtain the original mutation calling set VCF(Variant Call Format).The mutated kmer were generated according to the mutation information in the VCF,and BLAST was used to align the mutated kmer to the sequencing data.According to the alignment results,reads with high homology to the mutant kmer are extracted and visualized to verify the authenticity of the mutation.·This paper conducts a comprehensive performance validation of Clean Seq using multiple datasets.By setting a variety of simulated datasets,experiment datasets,and public datasets,the contamination detection & cleaning modules,mutation calling & verification modules,and Clean Seq as a whole are verified,respectively.Results show that Clean Seq can effectively process contaminated sequencing data and output reliable analysis results.
Keywords/Search Tags:Whole genome sequencing, Bioinformatics, DNA contamination detection, DNA data cleaning, Mutation validation
PDF Full Text Request
Related items