Font Size: a A A

A Study On Disease Prediction Model Based On Small Sample Medical Data And Its Privacy Preserving Technologies

Posted on:2024-01-09Degree:DoctorType:Dissertation
Country:ChinaCandidate:S J YangFull Text:PDF
GTID:1524306944966829Subject:Cyberspace security
Abstract/Summary:
With the popularization of electronic medical records and data mining technologies,the key techniques on data analysis has been widely used in the area of public health and medical services,the core of which is to organize and analyze the huge amount of data generated from healthcare application scenarios and combine data analysis and artificial intelligent technology to design effective algorithms and models to obtain effective results,help healthcare technicians to accelerate diagnosis and decision-making,so as to improve the diagnosis and treatment efficiency as well as optimize the resource allocation.However,healthcare data usually contains large amount of patients’ identification information and health-related records,as a result,machine learning and data mining tasks based on healthcare data is a key application field for privacy data protection.Based on the current research status in this field,the research content of this work consists of the following four parts:(1)Early Detection of Diseases Based on Fisher’s Linear Discriminant Analysis and Healthcare DataIn the first part,we propose a disease prediction model based on De-Sparse Fisher Linear Discriminant Analysis(De-Sparse LDA)to solve the ill-posed problem induced by High Dimension and Low Sample Size(HDLSS)datasets.Under the HDLSS settings,the model we proposed tries to solve the problem of the inverse of covariance matrix does not exist or the over-penalty when using Graphical Lasso to estimate the inverse of covariance matrix of the samples,by introducing the De-Sparse Graphical Lasso algorithm to estimate the inverse of covariance matrix.In the experiments based on the true electronic medical records data,De-Sparse performs higher prediction accuracy on testing data,compared to the other baselines.(2)Combinatorial Discriminant Analysis with Applications to Feature Selection from Healthcare DataIn the second part,we propose a Combinatorial Fisher Linear Discriminant Analysis(CDA),to optimize the combination of feature selection problem in disease prediction tasks.By leveraging the conditional Raylaigh Flow and the sparse-regularized 10 norm,CDA selects the k features from all the p feature variables to formulate an optimal feature subset through iterative optimization method-which means,the LDA model based on the selected feature subsets should achieve high accuracy in disease prediction tasks.In the feature selection experiments based on simulation synthesis data,CDA can correctly pick the features required for prediction and outperforms the baseline algorithms such as orthogonal matching tracking algorithm(OMP)and truncated Rayleigh flow algorithm(TRifle)on the F1 score of feature subset selection task;In the dichotomous prediction simulation experiment based on the synthetic data,CDA manifests better prediction accuracy.Besides,in the experiment based on true High Dimension Low Sample Size datasets,such as Colon Gen Datasets and Lukemia Gen Datasets,CDA achieves better prediction accuracy by selecting a features subset that are strongly correlated with the prediction task.(3)Online Gaussian Graphical Model and Streaming Data Fisher Discriminant AnalysisThe third part of the study proposes an efficient Online Gaussin Graphical Model,to solve the training data storage and model iteration complexity problems in Fisher Linear Discriminant Analysis under streaming data setting.At first,OGM achieves the low complexity estimation to the inverse of covariance matrix of streaming data,which means,without storing historical data,OGM can update inverse covariance matrix estimation through one step iteration,based on the current streaming data sample as well as the inverse covariance matrix estimated in last iteration.Besides,OGM can infer the structure of the Gaussian model from the inverse covariance matrix estimation through the significance side inference algorithm,that is,find the significant conditional dependence between variables;we also insert the inverse covariance matrix into Fisher linear discriminant analysis to realize a predictive model for streaming data.In simulation experiment based on synthetic streaming data,OGM significantly reduces the time overhead while maintaining covariance inverse matrix estimation error,Gaussian diagram structure reasoning correctness,and prediction accuracy similar to that of offline algorithms.(4)Vertical Federated Learning based on Multi-Party Gradient BoostingThe forth part of the study proposes a vertical federated learning gradient boosting algorithm(evGBM)based on semi-homomorphic encryption and secure multi-party computation,to solve the data privacy issue under the federated modeling through the data distributed in different institutions.to achieve the participants of federated learning can train a model together without sharing their raw data.evGBM introduces a semi-honest dual party computation agreement and private intersection to obtain the shared samples between the two party.During training section,evGBM leverages the PHE(Partial Homomorphic Encryption)to protect the labels and features.To control the communication complexity due to the application of encryption,evGBM also adopt the Mini-batch sampling strategy,to achieve the balance between model accuracy and communication overhead.In the experiment based on Santander transaction data and the other medical data,evGBM achieve a similar model accuracy under the protection of PHE,compared to the other baselines whose traning data are totally exposed to the attackers.According to security analysis,evGBM can resist a variety of attacks in semi-honest environments;Experiments indicate that by adjusting the scale of small batch random sampling during evGBM training,the predictive performance of the model and the communication overhead of both parties during training can be weighed.In summary,the thesis includes the research of disease prediction model based on small sample medical data and the privacy protection technology in the process of predictive model training.This work verifies the authenticity and effectiveness of the above four work results through real data sets,simulation synthetic data sets and safety analysis.To sum up,the research content of the four parts above could be summarized into two parts:machine learning algorithms on medical data and privacy security protection for machine learning algorithms on medical data.To verify the algorithms proposed above,both of the real and simulated data are used in the experiments.
Keywords/Search Tags:Machine Learning, Privacy-Preserving Machine Learning(PPML), Healthcare Data, Fisher’s Linear Discriminant Analysis(LDA), Gradient Boosting Machine(GBM), Federated Learning
Related items