| Caries is a kind of infectious oral disease.If not treated in time,dental caries are one of the most popular human diseases.It can easily cause pain and infection and seriously affect the quality of patients’life.The interaction of bacterial communities is closely related to oral health status[1,2].It has been of great importance to study the correlation of caries microflora and the exploration of caries diagnosis methods based on microbiome.Currently,a large number of microflora studies have emerged based on the correlation between caries and the microbiome of healthy people.However,due to the differences in experimental methods,target populations and sequencing methods,such studies often fail to compare the research results at the same level.At the same time,traditional oral microbiome-based dental caries diagnosis studies require disease samples and corresponding healthy samples to construct the model,or use genetic markers based on disease samples to diagnose caries,but both of these need to be on the premise of clear disease diagnosis.Moreover,due to differences in individual genetic factors,the model may not be able to be replicated in populations in different regions.Therefore,in order to explore a universal model for caries diagnosis in a large population,we used a cross-research,big-data-driven search-based strategy to diagnose caries so that we can find a universal and unbiased dental caries diagnosis method based on oral flora at the level of big data.The main contents and results of this paper are divided into the following two parts:The first part is the statistical analysis based on oral microbiome study dataObjective:The distribution of high-throughput sequencing data of oral microbiome was investigated in different aspects of oral disease status,country,sampling location and sequencing method.Method:Search the oral microbiome research literature based on high-throughput sequencing technology in different electronic databases of literature between 1997 and2019,and screen the literature according to the inclusion and exclusion criteria.After completing the literature screening,download the sample data and related meta information in the literature.Results:(1)After preliminary investigation,202 literatures were screened out of 6,431literatures selected for this study,and 38,998 cases of sequencing data and relevant metadata information were downloaded.(2)The oral microbiome data is dominated by health data,accounting for about 86%;among the remaining oral diseases,periodontal disease and caries are the main types,accounting for 45%and 28%respectively;In terms of national distribution,oral microbiome data were mainly derived from in the United States(61%),followed by Japan(13%)and China(7%).The most common sampling location was saliva(52%),followed by dental plaque(28%).16S r RNA sequencing data accounted for 79%of the oral microbiome data,while metagenomic sequencing data accounted for 20%and Internal Transcribed Spacer(ITS)sequencing accounted for 1%of the remaining 21%.Conclusion:(1)79%of the high-throughput sequencing data obtained in this study were 16S r RNA sequencing;(2)From the perspective of distribution,the data obtained in this study are concentrated in North American countries,mainly the United States,and the research objects are mainly oral health population;(3)In the study of oral diseases,periodontal disease was the main disease,followed by caries,accounting for45%and 28%respectively.(4)In this study,the amount of data taking saliva as the sampling site was 52%,followed by dental plaque,accounting for 28%.The second part is a caries diagnosis model was established based on microbiome big dataObjective:Explore a universally applicable new method for caries diagnosis;at the same time,explore the conditions under which this method can achieve the greatest diagnostic accuracy in order to provide new ideas for clinical disease diagnosis and prognosis.Method:Choose the collecting data based on 16S r RNA sequencing samples and divided into three groups,respectively the Baseline group,hereinafter referred to as the Baseline,which means the data from the subjects were all oral healthy.Caries group,hereinafter referred to as the Caries,which means the dental caries sample data,and healthy Controls,hereinafter referred to as the Control,the sample data for oral health,with the Caries group from the same study,the last two were combined into Caries Data Set.Using the Parallel-META analysis of bioinformatics,and search functions based on search-based strategy(namely community comparison algorithm based on evolutionary relationships,found in the known data and unknown sample height on the community structure similar to that of known match)calculate the microbiome of each sample is MNS,(MNS can measure of sample uniqueness compared with healthy samples,disease samples exist a unique novel index),and then calculate theαdiversity of the three sets of data,using PERMANOVA for statistical analysis,Random Forest method(RF)to train and predict the data,and finally to evaluate the accuracy of the caries diagnosis method through the ROC curve.Results:(1)All reads were annotated into 10 phyla and 63 genera.Species richness andαdiversity showed that species richness and diversity of Baseline and Control groups were significantly higher than that of Caries group,indicating that oral microbial diversity of healthy subjects was higher than that of Caries patients,and RF results indicated that dental caries and healthy samples could be distinguished at genus level.(2)After data processing,three groups of sequencing specimens of 22243 cases of high quality samples data,calculating the value of MNS,and the distribution of MNS according to the year,the results show that MNS in normally distributed each year(compared with normal distribution curve).This study collected early microbiological sample amount is enough,without significant preferences,and data size decreases with increasing year.(3)MNS of the Baseline and the Control were significantly lower than that of the Caries,indicating that the Caries sample.the caries samples had a more varied flora.(4)The ROC curve was used to evaluate the Data Set Caries,and get AUC was 0.67.And then PERMANOVA was used to compare the effects of different host factors on MNS.It was found that caries status,age,sampling location,country and DMFT index all had significant effects on MNS,among which caries status had the greatest effect.After controlling for host factors such as age,sampling location,country and DMFT index,the accuracy of the diagnostic data set increased with the highest AUC=0.88 in the case of detailed meta information of the sample.The results indicated that the caries diagnosis model could achieve good diagnostic performance after controlling host factors.Conclusion:(1)After controlling for host factors,the diagnostic accuracy of Chinese children with high caries was higher(AUC of saliva sample=0.88,AUC of dental plaque sample=0.87).(2)By comparing the diagnostic performance of Chinese children’s dental plaque samples with high,medium and low caries,it was found that there was a high diagnostic accuracy in early,middle and late caries,with an AUC of0.74,0.74 and 0.87,respectively.In summary,this study collected global human oral microbiome sequencing data and found the current situation of oral data classification.Meanwhile,based on big data,the search-based strategy was used to diagnose caries with the highest accuracy of 0.88.The results showed that the method of caries diagnosis was feasible.This study enriched the application of microbiological caries diagnostic methods.The method of evaluating the diagnostic performance of caries based on the severity of caries has great application potential in clinical diagnosis,treatment and disease prognosis.It is of great significance for the early detection,early diagnosis and prognosis evaluation of caries. |