| With the development of medical information technology,data sharing has become an important part of the medical research and data utilization process.Medical data involves a lot of personal health-related information and has a high demand for privacy protection.Therefore,how to achieve effective privacy protection in the process of medical data sharing is an issue worthy of research and exploration.In recent years,governments and institutions have been strengthening laws and regulations on the protection of privacy in medical data sharing,and requiring data sharers to complete the anonymization of medical data at the data collection stage to ensure that anonymized data cannot be recovered and cannot be re-identified or re-associated.However,excessive anonymization can lead to degradation of the quality of medical data and affect the accuracy of medical research.Existing anonymization algorithms also suffer from insufficient ability to balance data security and usability at a high level of privacy protection,and the anonymization process result in more information loss.Therefore,there is an urgent need for an anonymization algorithm that can meet the high privacy protection requirements for medical data sharing while minimizing information loss,thus accelerating the medical research process and improving the efficiency and quality of research.Compared with other types of anonymization algorithms,clustering-based anonymization algorithms can anonymize medical data at the cell level,which can reduce the information loss caused by overgeneralization.However,due to the existence of multiple semantic sensitive attributes,large data scale and many outliers in medical data,there is much room for improvement in the existing clustering-based anonymization algorithms to reduce the information loss and privacy disclosure risks in the anonymization process of medical data.Therefore,this study proposes a clustering-based anonymization algorithm that can meet the privacy protection requirements and improve the quality of anonymized medical data,i.e.,Multi-semantic Sensitive Attributes K-anonymity Algorithm(MSAK),in view of the higher privacy protection requirements of medical data sharing.The main work of this study includes the following 3 parts:(1)This paper focus on the selection of four basic medical dataset standards and medical public databases commonly used in medical research,namely,clinical electronic medical records,infectious diseases,chronic diseases,and maternal and child health care,which are more typical and well used in practice guidance,and conduct the analysis of anonymization characteristics of medical datasets and key factors affecting the performance of anonymization algorithms,and find that the existence of medical data with multiple semantic sensitive attributes,larger data size,and more outliers The characteristics are found to be important factors affecting the performance of the clustering-based anonymization algorithm,and the foundation for the subsequent algorithm research is established.(2)Based on the results of the analysis of the key factors affecting the performance of the anonymization algorithm,and with the goal of meeting the privacy protection requirements while minimizing information loss,the MSAK anonymization algorithm is designed and proposed for the existing clustering-based anonymization algorithm,and the overall research framework and implementation process of the algorithm are described to improve the traditional clustering-based anonymization algorithm process.The MSAK anonymization algorithm focuses on the following problems.1)To address the problem that existing clustering-based anonymization algorithms lack to consider the multi-semantic characteristics of disease diagnostic attributes of medical data resulting in high risk of similarity attacks,this paper construct multi-semantic classification trees based on Medical Subject Headings(MeSH),etc.,and calculate the minimum variance of multi-semantic sensitive attributes for the l-Diversity model judgment,thus reducing the risk of similarity attacks.2)To address the problem of inefficient execution of clustering-based anonymization algorithms in large-scale data,a data set partitioning method is proposed to control the size of sub-data sets,so that the subsequent clustering process can achieve parallel and efficient computation and improve the performance of algorithm anonymization processing.3)To address the problem of poor clustering effect of the clustering-based anonymization algorithm due to more outliers,the clustering process is optimized based on the outlier detection algorithm,and the strategy of separation before clustering and then assignment is adopted to reduce the information loss of the anonymization process caused by outliers.(3)In order to validate the algorithm of this study,the dataset UCI Machine Learning adult dataset used for algorithm validation with reference to related studies is combined with Medical Information Mart for Intensive Care-Ⅳ(MIMIC-Ⅳ)database and constructed with simulation experiments dataset with sensitive attributes in the medical domain.The simulation experiments focus on evaluating MSAK anonymization algorithms in three dimensions:execution efficiency,information loss and privacy disclosure risk,and three representative clustering-based anonymization algorithms(kNN algorithm,k-member algorithm,OKA)and a global generalization algorithm with advanced performance(FLASH algorithm)are used as comparative algorithms to reflect the objectivity of the evaluation..The experimental results show that the MSAK anonymization algorithm performs more efficiently than other clustering-based anonymization algorithms when anonymizing larger-scale medical data at higher privacy protection levels;it outperforms all other algorithms in terms of suppression rate and overall information loss;and it can also significantly reduce the risk of linkage attacks and similarity attacks,which can better balance data security and usability.Therefore,compared with other anonymization algorithms,MSAK anonymization algorithm can better balance the security and usability of data when anonymizing medical data,so that the data can retain certain original features as much as possible after anonymization,and can be used for medical scientific research and analysis. |