Font Size: a A A

Research On LightGBM Classification Algorithm And Its Application In Medical Data

Posted on:2021-07-22Degree:MasterType:Thesis
Country:ChinaCandidate:R LiFull Text:PDF
GTID:2504306305966619Subject:Master of Applied Statistics
Abstract/Summary:PDF Full Text Request
With the rapid development of computers,the medical system has also made tremendous progress in patient data management and infrastructure,the informatization construction in the medical field has also transformed the medical data into a "golden mountain" of data assets.How to make good use of such huge data assets has become the focus of scholars’ research.Chronic kidney disease refers to a gradual and irreversible decline in renal function over a period of several months or years.The prevalence of chronic kidney disease has increased year by year at China and abroad.In the field of chronic kidney disease,there are already scholars who want to make the most use of data assets.For example,some scholars conducted national censuses of chronic kidney disease patients to summarize the development trend of domestic patients,and some scholars use classic statistical methods to analylilze the risk factors.This thesis mainly studies the high-risk event.prediction model of patients with chronic kidney disease and analysis of its influencing factors,in order to achieve the role of early warning to assist doctors in diagnosis and treatment.The data used in this paper are real chronic kidney disease patient data in a hospital.Based on the hospital admission data of all chronic kidney disease patients in the hospital’s HIS(Hospital Information System),we predict whether high-risk events will occur in patients with chronic kidney disease within two weeks of admission.We first perform data pre-processing on the raw data,including the steps of raw data extraction,data integration of different sources,and data cleaning.Secondly,we use the Pearson linear correlation coefficient and the MIC correlation coefficient to perform feature extraction on the processed data set.Three data sets are obtained,namely all data sets,data sets excluding linear relations,and data sets excluding linear and non-linear relations.Then based on three different classification models of LightGBM,XGBoost,and logistic regression,the simulated data and the real patient data were modeled.First,combininge the results of 5,000 simulations data sets,we compare the differences of AUC,F1,RECALL,and running time between the three models,and explain the advantages and disadvantages of each model.Then we compare the three models on different real data sets,and finally sort and analyze the factors that affect patients’ high-risk events based on the best performing LightGBM model.
Keywords/Search Tags:Chronic Kidney disease, Maximum mutual information coeddicient, logistic regression, XGBoost, LightGBM, Influencing factor analysis
PDF Full Text Request
Related items