ObjectivesThe high-throughput whole genome sequencing was applied to detect molecular typing between infected and colonized Streptococcus pneumoniae(S.pneumoniae,SP)isolates and construct a phylogenetic tree,so as to explore the difference of genetic background between infected and colonized isolates.This study aimed to clarify the differences of resistance phenotypes,molecular characteristics(resistance genes and virulence genes),accessory genes,SNPs and k-mers between infected and colonized strains,revealing disease-associated markers of SP from multiple aspects and providing genetic evidence for identifying potential infected strains and disease tracing.MethodsCross-sectional study design was conducted,and nasal swabs from kindergarten children and clinical specimens(sputum,cerebrospinal fluid,throat swab,alveolar lavage fluid)from infected children in several hospitals were collected and confirmed for SP.For all strains,antibiotic susceptibility,whole genome sequencing technique and bioinformatic analysis were used to detect resistance phenotypes,MLST,serotype,resistance genes,virulence genes,accessory genes,SNPs and k-mers.Phylogenetic tree was reconstructed with core SNPs.The difference between two groups was compared using the Pearsonc~2 test and Fisher’s exact probability method.The combination of single factor test(c~2 test and linear mixed model)and machine learning algorithm(random forest)was applied to identify disease-associated markers for SP isolated from children.Results1.Molecular typing of infected and colonized SP:Among 783 isolates,a total of 80 STs were identified.The predominant STs of infected isolates were ST271(29.3%),ST320(9.5%)and ST902(7.2%).The predominant STs of colonized isolates were ST902(15.9%),ST90(13.8%)and ST271(8.5%).A total of 31 serotypes were identified from 783 isolates.For infected isolates,the predominant serotypes were 19F(43.0%),6B(15.2%)and 23F(8.3%).For colonized isolates,the predominant serotypes were 6B(32.7%),19F(13.1%),15A(11.1%)and 23A(7.4%).2.Comparison of resistance rates between infected and colonized isolates:The rates of resistance to penicillin,cefotaxime and azithromycin were higher in infected isolates than in colonized isolates(P<0.05)while the rates of resistance to chloramphenicol and levofloxacin were lower in infected isolates than in colonized isolates(P<0.05).3.Comparison of resistance and virulence genes between infected and colonized isolates:For resistance genes,the rates of erm(C),mef(A)and msr(D)carriage in infected isolates were higher than that of colonized isolates(P<0.05),while the rate of cat(p C194)carriage in infected isolates was lower than that of colonized isolates(P<0.05).For virulence genes,therates of pce,pit A,pit B,sip A,rrg A,rrg B,rrg C,srt G1,srt G2,srt C1,srt C2,srt C3,nan A,nan B,cps4A,cps4B,cps4D and zmp C carriage in infected isolates were higher than that of colonized isolates(P<0.05),while the rates of cbp G,pfb A,lyt A and lyt B carriage were lower than that of colonized isolates(P<0.05).4.Phylogenetic analysis of infected and colonized isolates:SP isolates were distributed across each of the genetic lineage of the phylogenetic tree,which implied that disease-associated isolates had multiple genetic backgrounds rather than represented a specific pathogen lineage.5.Screening for disease-associated markers:Single factor analysis and random forest was combined to identify disease-associated phenotypic and genotypic markers:(1)For disease-associated resistance phenotypes,cefotaxime and azithromycin resistance were selected,and the accurary rate and AUC for this model were 63.85%and 0.60,respectively;(2)For disease-associated resistance genes and virulence genes,msr D,pce,rrg A,lyt A,lyt B and zmp C were selected,and the accurary rate and AUC of this model were 69.84%and 0.73,respectively.The combination of GWAS and random forest was applied to identify disease-associated genotypes at genome-wide level:(1)For pangenome level,12 disease-associated accessory genes were selected,which included rha B,group_1633,group_2924,group_328,mut Y,group_241,btu D_2,kps T,liv F,group_3422,group_2927 and group_3630.The accurary rate and AUC for this model were 86.81%and 0.84,respectively.(2)For SNPs level,35 disease-associated SNPs were selected,which included ply 1833192G→A and1833192G→A,Cps4D 323549T→C,rrg C 444246C→A,rpo C1864900C→A,gnd A 354485T→C,glm U 932278C→T,pht A1111880T→C,msr B 1280995G→A,dpr A 1199326T→G,dlt D2091669T→C,arc A 1942621G→A and fts A 1488264G→A.The accurary rate and AUC for this model were 91.26%and 0.94,respectively.(3)For k-mers level,12 disease-associated k-mers were selected and corresponding genes were pbp1A,msr A,srt G1,srt C1,pit A,gal E,zmp B and cps C.The accurary rate and AUC for this model were 90.30%and 0.92,respectively.Conclusions1.The predominant STs of infected isolates were ST271,ST320,and ST902,while the predominant STs of colonized strains were ST902,ST90and ST271.And infected and colonized isolates were distrtibuted across the phlygenetic tree rather than represented a specific lineage,implying the similarity and diversity of genetic backgrounds between infected and colonized isolates.2.This study identified disease-associated markers at different levels,including 12 disease-associated accessory genes,35 disease-associated SNPs and 12 disease-associated k-mers.And the good predictive performances of these models suggested that these disease-associated markers have a considerable potential to identify highly pathogenic strains.3.This study explored genome genetic variation based on“pangenome,SNPs and k-mers”three-dimensional analysis strategy.And disease-associated markers of SP were identified effectively by combining GWAS and random forest,providing genetic evidence for tracing the source of highly pathogenic infection and precise targeted intervention. |