| With the rapid development of computer technology and the Internet,medical clinical data have been accumulated,which makes it possible for researchers to excavate available information from large-scale data,study medical auxiliary diagnosis model and make it an important direction in medical research field as a medical assistant diagnosis technology.Machine learning algorithm can not only establish the model based on data,but also help users to make right medical decisions.In recent years,a large number of machine learning algorithms have been successfully applied to various medical researches.However,the simple classification model has been unable to meet the needs of the current auxiliary diagnosis.In the medical application of machine learning algorithms,more emphasis is placed on the understanding and interpretability of the algorithm for the results of the generation.With the support of the National Natural Science Foundation(61471124)project,this paper studies the classification and interpretative of machine learning,and puts forward the algorithms of data preprocessing,visual classification and rule extraction.The main work and research results are as follows:Firstly,this paper designs a random resampling preprocessing algorithm combining missForest and synthetic minority type oversampling technology.In view of the characteristics of missing data and unbalance in medical data,the missing data in missForest is used to process the missing data in the sample.On the premise of synthesizing a few class samples,a new data set of class balance is constructed by random resampling of different classes of samples.The effectiveness of the preprocessing algorithm is verified by visual classification and rule extraction algorithm.Secondly,on the pre processed data sets,the High-Quality Visualizations of Large(LargeVis)classification algorithm based on the random forest similarity matrix is designed.In view of the problem that the traditional classification results can not provide the classification interpretation and the high dimensional feature space and data redundancy of medical data,the LargeVis is used to visualize the internal model of the random forest.Then,the similarity matrix of random forest was visualized with LargeVis,and the low dimensional data were used to train the random forest model and predict the sample category.The visual classification integration algorithm designed in this paper can distinguish the different types of samples,and the classification performance and operation efficiency are superior,and can be used to explain the causes of the classification results.Thirdly,in order to further extract the expression of characteristic relations between data,a random forest rule extraction and feature selection algorithm combining elastic norm is designed.The algorithm is based on the characteristic importance score of the random forest,through a mixed rule extraction and feature selection method.The rule extraction results are used for feature selection.In the generated rules,the characteristics of the rules are selected and the important rules are extracted with the elastic norm encoding.In order to make the users trust the rules effectively and the rules of performance accuracy,coverage and accuracy of secondary quantitative extraction.The experimental results show that the classification accuracy of the proposed rule extraction algorithm is 93.81%compared with the existing algorithm,and the rules extracted are in high coincidence with the detection index of the hospital hepatitis,which can provide an useful explanatory assistant diagnosis for the users.In conclusion,the topic proposes an intelligible integrated algorithm design based on the traditional classification of random forests.The application on the fetal heart rate and hepatitis data sets shows that this algorithm can understand and explain the classification results to a certain extent on the premise of ensuring the classification accuracy,which enriches the methods of medical assistant diagnosis. |