Font Size: a A A

Study Of Cascade GA-CatBoost In Predictive Diagnosis Of Gestational Diabetes Mellitus

Posted on:2020-08-26Degree:MasterType:Thesis
Country:ChinaCandidate:B CuiFull Text:PDF
GTID:2404330596486228Subject:Software engineering
Abstract/Summary:PDF Full Text Request
With the continuous development of information intelligence and the integration of Internet and traditional medicine,data mining and machine learning technologies have been used more and more frequently to predict the possibility of disease occurrence.In the field of medical diagnosis,more and more physiological indicators,disease types and biotechnology technologies have made it more difficult for doctors to diagnose diseases.To solve this problem,machine learning and data mining can extract hidden,potentially valuable and novel information from medical data to improve diagnostic accuracy and reduce time and cost.On the one hand,it can provide further validation for doctors' diagnosis results,on the other hand,it can also provide doctors with an analytical tool for complex diseases.CatBoost(Category Boosting)is a machine learning framework based on gradient boosting tree,which supports category features and string type features.Gradient boosting is a powerful machine learning technology that is the main method to solve the problems of heterogeneous features,noise data and complex dependencies.This paper takes gestational diabetes mellitus as the research object,Iterative Selforganizing Data Analysis Techniques Algorithm(ISODATA)was used to detect and eliminate outliers.CatBoost was used as the prediction model and genetic algorithm(GA)was used to optimize the parameters,choose the optimal training parameters.Finally,CatBoost is cascaded with XGBoost and LightGBM,namely cascaded GA-CatBoost.The main research work is as following:(1)In view of this research topic,this paper summarizes and analyses the related research at home and abroad,introduces the pathogenesis and characteristics of medical diagnostic indicators of diabetes mellitus,especially gestational diabetes mellitus(GDM),and describes in detail the current common methods and advantages and disadvantages of predicting and diagnosing diabetes mellitus.(2)Aiming at the influence of outliers on prediction results,an iterative selforganizing data analysis algorithm and error processing criterion are used to detect and eliminate outliers.Outliers have obvious influence on prediction classification,so it is necessary to learn the features of clusters while preventing the interference of outliers.The algorithm allows each instance to belong to only one set,and the goal is to achieve high similarity within the set,while the similarity between sets is very low.Using ISODATA to detect outliers can reduce the interference of outliers and improve the accuracy of prediction.(3)Building a variety of classifiers,compare and analyze the performance of various classifiers.Because the data set of gestational diabetes mellitus includes continuous attributes and missing values,the missing values are filled according to the data type and processed by one-hot;the IV values are used for feature analysis,and the combined features are constructed.Finally the results show that the CatBoost classifier has the best classification..(4)This paper uses genetic algorithm to search for multi-point space,and ultimately gets the global optimal solution.There are many parameters in CatBoost.,and the accuracy of prediction depends heavily on the setting of parameters.Each parameter has its own function.Depending on the subjective judgment and heuristic method,the workload is huge and the accuracy is low.In this paper,genetic algorithm(GA)and grid search(GS)are used to optimize the parameters of CatBoost model.By comparing the AUC(Area Under roc Curve)values,the parameters obtained by GA are better.Finally,cascading GA-CatBoost,XGBoost(eXtreme Gradient Boosting)and LightGBM,namely cascading GACatBoost,can improve the generalization ability of the model.
Keywords/Search Tags:GDM, ISODATA, CatBoost, Genetic Algorithm, Combination of Features, Data Mining
PDF Full Text Request
Related items