| At present,there are a large amount of medical insurance records in the medical insurance industry.In order to ensure the legal use of medical insurance funds,it is necessary that optimizing the review methods for medical insurance fraud and increasing the supervision of medical insurance funds.Some clustering and classification algorithms are studied and improved in this thesis which is based on large-scale medical insurance data.These algorithms are applied to the medical insurance data set and the intelligent audit model of the medical insurance is designed and implemented.The main research contents of this thesis are as follows:1.In order to make more effective use of unlabeled data of medical insurance,the clustering algorithm is used in this thesis to cluster and analyze the unlabeled medical insurance data.Because it is possible for the traditional K-Means algorithm to fall into local optimization,in this thesis,a clustering algorithm is designed which is combined with the improved ant lion optimizer and K-Means.Firstly,this algorithm helps K-Means to select initial cluster centers by the ant lion optimizer algorithm.During the iterative process,the ant lion optimizer algorithm is used to update cluster centers of sample clusters,which weakens the sensitivity of K-Means to initial cluster centers.Furthermore,the improved random walk strategy is proposed,which is based on Gaussian distribution.This strategy can search the solution space more comprehensively and improve the searching capability of the ant lion optimizer algorithm.Experiments show that the proposed algorithm improves the partition purity and the clustering effect of unlabeled samples of medical insurance on multiple indicators.Besides,it effectively solves both the problem of low utilization rate of unlabeled medical insurance data and the problem that results of K-Means is possible to fall into local optimization.2.In order to make more efficient use of both unlabeled samples and labeled samples in the medical insurance data and to improve the ability to distinguish the behavior of medical insurance fraud,in this thesis,a combination algorithm called KM-LR is designed which is combined with K-Means and logistic regression.Firstly,the concept of feature distance vector in K-Means training process is proposed.Next,during the process of training,the feature distance vector is mapped to the regression coefficient of logistic regression.The model learned by logistic regression is used to divide the medical samples and cluster centers after division can be obtained and the next iteration is carried out.This kind of interactive training closely links two processes of clustering and classification,which effectively improves the utilization rate of medical insurance data.It is proved by experiment results that the KM-LR algorithm proposed can optimize the resolving capability of medical insurance data effectively and improve the classification accuracy greatly on various evaluation indicators.Eventually,achieve the purpose of using both unlabeled samples and labeled samples at the same time.3.In order to provide modern technical support for intelligent audit and informationization supervision of medical insurance,an intelligent audit system of medical insurance based on big data is constructed in this thesis.This system can make use of the medical insurance data to carry out a variety of model trainings,including the KM-LR model proposed in this thesis.The user can upload samples of the medical data that need to be reviewed to carry out the audit for the medical insurance fraud prevention.Medical insurance data can also be compared and analyzed by different statistical charts through this system.Finally,the visualization interface is provided in this system to show the results of each function module to the user. |