| With the development of information technology,medical data has produced a huge amount of medical data,which is not only reflected in the large number,variety and update speed,but also in the potential value of the data itself.Mining these potential information is of great significance for medical examinations,cancer treatment,and medical resource allocation.In this paper,a type of data mining algorithm is used to analyze the clinical data of cervical cancer,and the related factors such as pathogenic factors,examination methods and recommended treatment methods of cervical cancer are explored,and the corresponding classification decision model is established.This paper mainly completes the following two aspects:1.According to the UCI database Cervical Cancer(Risk Factors)data set,the characteristics of medical data in Caracas Hospital,Venezuela,the data was pre-processed.Firstly,the data contains missing values.In this paper,the direct deletion method and the constant interpolation method are combined to process the missing values.Then,because the data is unbalanced,this paper uses the upsampling method to process the unbalanced data..Finally,there are continuous attributes in the data.,this paper uses the equal-width binning method to discretize continuous attributes,and measures the discrete effects by information values.2.This paper uses a type of data mining algorithm to evaluate the risk factors of cervical cancer clinical data,which can be indirectly converted into a two-category problem.The paper mainly uses decision tree(DT),random forest(RF)and support vector machine(SVM).As the main line,the experiment was carried out in sequence.Firstly,create a decision tree classification model,calculate the diagnosis rate of the disease and the diagnosis rate of no disease,and secondly,optimize the model twice.Optimization(1):Optimized based on the minimum number of samples contained in the leaf nodes(MSSOLN-DT).Optimization(2): Pruning optimization(PO-DT)for decision trees.The decision tree is compared with the two optimized models.The results show that the MSSOLN-DT has a minimum reentry error of 0.0550 and 10-fold cross-validation error of0.1267.The optimized DT structure is simpler than the classic one..Then,this paper uses the linear kernel function as the kernel function of SVM,constructs the SVM model,andcalculates the diagnosis rate of the disease and the diagnosis rate of no disease.Finally,this paper constructs a random forest model.The paper compares and analyzes the decision tree,support vector machine and random forest creation model.Through analysis and comparison,it is found that the model constructed by random forest has a good effect in the classification and recognition of cervical cancer.When the class label is “Hinselmann”,the accuracy is up to 98.21%;when the class label is “Schiller”,although the accuracy is the lowest among the four types of labels,it also achieves the effect of 91.94%. |