With the continuous development of Internet finance,the problem of financial fraud continues to be high.Compared with a large number of normal transactions,fraudulent transactions account for a small proportion of transactions,and the imbalance of data distribution is always a major bottleneck restricting the effectiveness of fraud detection model.The simulated data generated by the effective data simulation algorithm can improve the data imbalance phenomenon and improve the fraud detection performance of the trained model.The research on synthetic data generation for Internet financial anti-fraud is a hot topic in the industry,which has high theoretical significance and practical application value.It is an urgent need for the industry to further develop the research.According to the information of transaction aggregation and transaction network of fraudulent transaction data,the synthetic data generation algorithm of fraudulent transaction is deeply studied.The fraud simulation data generated by the synthetic data generation algorithm is added into the training of the fraud detection model to reduce the negative impact of data imbalance and improve the effectiveness of the fraud detection model.Based on multiple transaction data sets and effective fraud detection models,the effectiveness of the synthetic data algorithm is verified by using classification performance.The main contents and innovations of this thesis include:1.Proposed a synthetic data generation algorithm ADA-INCVAE based on variational autoencoder.A theoretical study is conducted from the perspective of multi-task learning to solve the posterior collapse for VAE.Afterward,by using the theoretical support,it proposed a novel training method by increasing the dimension of data to avoid the occurrence of posterior collapse.Aiming at restricting the range of synthetic data for different fraud samples,an adaptive reconstruction loss weight is proposed.In the data generation stage,the generation proportion of different sample points is determined by the local information of the minority class.The experimental results based on the fraud detection model accepted by the cooperative bank show that the synthetic data generated by ADA-INCVAE can effectively improve the accuracy of the fraud detection model for fraud samples and the F1-score.2.Proposed a synthetic data generation algorithm CCR-GSVM based on granular support vector machine.In the problem of fraud detection,the lack of data category markers and the lack of feature engineering construction process easily form the data noise of normal transaction samples,which limits the recognition performance of classification models such as support vector machine.In view of this phenomenon,the influence of noise samples on the decision boundary of support vector machine is studied,and a boundary simulation data generation algorithm CCR-GSVM is proposed.CCR-GSVM combines granular support vector machine and boundary information to filter noise samples,and generates boundary simulation data through iteration.Adding synthetic data to support vector machine for training can effectively improve the Recall and G-mean of fraud samples.3.Proposed a synthetic data generation algorithm SG-CGAN based on conditional generative adversarial network.According to the principle that Graph Sage relies on transaction network information for fraud detection,the generation of transaction network data of fraud samples is studied.In the hidden layer space,the hidden layer simulation data is generated for the fraud samples,and the transaction network data of the fraud samples is generated using the conditional generative adversarial network according to the generated results.The Graph Sage fraud detection model retrained with SG-CGAN generated data is superior to the traditional hidden layer data generation algorithm in MCC performance.4.Proposed a hybrid synthetic data generation algorithm MO-FRST based on fuzzy rough set theory.Aiming at the phenomenon of noise in synthetic data,the post-processing method of synthetic data is studied.Based on the fuzzy rough set theory,according to the local information of the data,the membership degree of the fraud category corresponding to the generated results of different simulation data algorithms is calculated in the upper and lower approximation,and the synthetic data with low membership degree is filtered by the corresponding threshold.Build a mixed synthetic data set for the filtered simulation data.Based on multiple transaction data sets and fraud detection models,the effectiveness of the hybrid synthetic data generation algorithm is verified. |