Nowadays,because of the popularization of bank credit card business,credit card consumption has become the most popular consumption pattern for young people.In this process,there are some illegal customers disturb the market order,they appear in the credit card fraud,which has brought huge economic losses to individuals,banks,countries.In order to solve this problem,this thesis constructs a credit card fraud detection system by using the credit card holder’s history consumption information and default record as experimental data,to found abnormal cardholders,timely remind them to reduce bank’s losses.However,there are some problems in credit card fraud detection,such as large amount of data,high dimension of data and extreme imbalance of data,and it is difficult to classify the trading data accurately.To solve these problems,we use Auto Encoding(AE)for feature extraction,Ensemble Learning and Cost-Sensitive Learning are used to solve the problem of data imbalance.The data comes from the Kaggle’s website,which collects transaction data from more than 800 merchants and 5,000 customers using credit cards from January 1,2019 to June 31,2020,covers both legal and fraudulent transactions,with nearly 1.3 million instances,the ratio of legitimate transactions to fraudulent transactions is 99.5:0.5,indicating a significant imbalance.In the data processing stage,the first step is data preprocessing,which includes missing value and repeated value checking,outliers processing,data format conversion and variable derivation.Then a exploratory data analysis is conducted,with descriptive and visual analyses of individual and bivariate variables in turn,to explore the structural characteristics of each variables and the relationship between the independent variables and the target variable.For feature extraction,AE is used to compress 9 numerical variables into 6,in order to eliminate the correlation and reduce the dimension.In this thesis,three Ensemble Learning models including Random Forest,Light GBM and Adacost are selected as prediction models,and recall rate,F2_Score,AUC are used as the evaluation criteria,in the end,the importance scores of each feature are analyzed on the optimal model.Show in final results: the predictive performance of AE+Light GBM model is the best,with all the indicators higher than those of the AE+RF model and the AE+Adacost model.Compared with Light GBM model,the accuracy,precision,F2_Score and AUC of AE+Light GBM model(98.70%,30.45%,0.6689,0.9962respectively)were improved slightly(0.05%,0.24%,0.0013,0.0003 respectively),the recall rate(95.45%)is reduced by 0.38%.In addition,the AE+Light GBM model has the best generalization ability,the difference of precision,recall,F2_Score and AUC between the train and test set(0.84%,3.67%,0.0225,0.0012,respectively)are lower than that on AE+RF Model(1.54%,9.75%,0.0493,0.0031,respectively)and AE+Adacost model(0.98%,4.24%,0.0265,0.0068,respectively).Based on the Light GBM model,the importance scores of all variables was obtained,among which numerical features such as the total payment,the population of the residential area,and the age have a significant impact on whether there is fraudulent behavior,as well as categorical features such as payment type and payment time period,which play an import role in the prediction process.In conclusion,AE+Light GBM model is the best in both predictive performance and generalization performance,and the predictive performance of the model is not reduced by using the AE to reduce the dimensionality of the numerical variables. |