Campylobacter spp is a class of zoonotic pathogens that can cause diarrhea in humans.Among several types of Campylobacter found,Campylobacterjejuni and Campylobacter coli are two main pathogens,which cause more than 90%of human diarrhea cases with percentages of 90%and 10%respectively.Traditional biochemical methods for Campylobacter identification have several problems,such as multiple steps,time-consuming and low throughput.And Polymerase Chain Reaction(PCR)based methods also have problems such as expensive reagents,multiple-step experiment and sample contamination that results in false positives and false negatives.In recent years,whole genome sequencing technology has been used in Campylobacter research.After processing and analysis,the sequencing data can be used to characterize different types of Campylobacter,or to quickly identify the genotype characteristics of populations,such as virulence and drug resistance.In this thesis,a bioinformatics method capable of accurately detecting Campylobacter is constructed based on the whole genome sequencing data of Campylobacter.The main work includes:(1)Constructed a computational pipeline for Campylobacter identification based on whole genome sequencing data,including quality control of sequencing data,genome sequence assembly,whole genome feature extraction,and Campylobacter identification based on support vector machine(SVM)/deep neural network(DNN).(2)Studied and compared several quality control methods of sequencing data,and conducted quality control tests on the whole genome sequencing data of a Campylobacter sample using FastQC;studied and compared several genome sequence assembly methods and assemble the whole genome of the Campylobacter sample using SPAdes.(3)Extracted significantly different features by analyzing the whole genome sequences of Campylobacter samples,which includes whole genome sequence analysis,gene annotation,drug resistance gene analysis,multi-site sequence typing(MLST)and CRISPR-Cas system analysis.The experimental results manifest that sequence length,GC content,codon sequence density,aspA allele number,glyA allele number,and CRISPR repeat sequence NZCP0178591 can be used as significant features for distinguishing Campylobacter jejuni from Campylobac ter coli,in which the repeat sequence NZCP0178591 represents high distinguishable ability.(4)Constructed two Campylobacter identification models based on SVM and DNN respectively with a feature set including genomic sequence length,GC content,codon sequence density,aspA allele number,glyA allele number,and the CRISPR repeat sequence NZCP0178591.Experimental results present that both of the machine learning methods exhibit good performance for Campylobacter identification,and the DNN-based method slightly outperforms the SVM-based one.In summary,the proposed computational method based on whole-genome sequencing data can be used to accurately distinguish Campylobacter jejuni from Campylobacter coli,and related bioinformatics methods and pipelines can be used for analysis and study of genome-wide sequence types of Campylobacter and even other prokaryotes. |