| The frequent occurrence of medical insurance fraud has seriously damaged the current medical insurance system.Early people used the advice of domain experts to implement fraud detection using manual solutions.With the development of machine learning,people use machine learning models for fraud detection.The introduction of integrated learning methods has led to the application of integrated learning models in the field of medical insurance fraud detection.Due to the increasing size of the medical insurance dataset,there are a large number of noise and class imbalances in the dataset,which increasingly affect the fraud detection performance of the model.The current medical insurance fraud detection model is difficult to achieve breakthrough in its fraud detection performance when faced with data sets with the above characteristics.This article solves the above problems by preprocessing data sets and integrating multiple RUS and diverse models.Firstly,this paper preprocesses the dataset to remove noise that interferes with model training in the dataset,and fills in some missing fields in the dataset.Use a label mapping algorithm to label the pre processed dataset.Secondly,in order to solve the problems of over fitting of models caused by random up sampling and random down sampling schemes,as well as the serious loss of information from most types of samples,and the insufficient performance of existing medical insurance fraud detection models,this paper proposes multiple RUS and diversified integration models.This model handles class imbalance data sets within the model.Using the DRUS sampling scheme,the majority of samples in the dataset are sampled multiple times.A base classifier is generated by using a dual difference degree based classifier generation algorithm.Multiple types of base classifiers can be provided,and the number of each base classifier can be reasonably allocated based on the performance differences of each base classifier.Finally,the experimental data set in this article is derived from real and publicly available medical insurance data from the United States Medical Insurance and Medical Services Center.This dataset is used for training and generalization ability testing of all models in this article.Experiments using this dataset show that the AUC and G-Mean values of multiple RUS and diverse integration models have significantly improved compared to Extra Trees,Cat Boost,Light GBM,and Ada Boost models. |