Font Size: a A A

A Study On Fraud Detection Based On Machine Learning

Posted on:2021-10-30Degree:MasterType:Thesis
Country:ChinaCandidate:D GaoFull Text:PDF
GTID:2518306104989319Subject:Management Science and Engineering
Abstract/Summary:PDF Full Text Request
With the development of the Internet and communication technology,online payment has become one of the most commonly used payment methods by the Chinese people,with the number of transactions and transaction amounts occurring every day reaching hundreds of millions.With the popularity of online payment,fraudulent transactions also occur from time to time.How to detect fraudulent transactions has become one of the challenges faced by payment service providers.Under normal circumstances,fraudulent transaction data has the characteristics of imbalance and high dimensions,which brings great difficulties to the application of traditional machine learning algorithms for online fraud transaction detection.In order to solve the above problems,thesis combines the actual transaction data to study the process of using online sampling and dimensionality reduction technology to detect online fraud transactions.The data in this article comes from a data mining competition held by the Kaggle community.There are a total of 206,689 training data,of which fraudulent transactions account for 3.5%.This article first introduces related machine learning algorithms,including KNN,decision tree,neural network,etc.Then analyzed the characteristics of the transaction data,which are extremely unbalanced data and high data dimension.Data imbalance means that the amount of fraudulent transactions in transaction data is very small compared to non-fraud transactions.The distribution of unbalanced data in the feature space is more complicated,so that traditional machine learning algorithms perform poorly when dealing with unbalanced problems.Another characteristic of fraudulent transaction data,the higher data dimension also brings difficulties to the application of traditional machine learning algorithms on transaction data.In the high-dimensional space,when looking for neighbor points,a Data hub phenomenon will appear,which makes some points more likely to become neighbor points of other sample points.As a result,the traditional algorithm for finding new neighbors based on distance to synthesize new data has poor performance when applied.In response to the above problems,thesis combines data sampling technology and dimensionality reduction technology to improve the performance of the algorithm.In order to reduce the impact caused by data imbalance,a down-sampling Bad hub processing method is proposed,and those points in the Data hub that are inconsistent with their neighbor point labels are identified as Bad hubs,and these Bad hubs are likely to mislead the algorithm,so in This part of the data is eliminated during modeling.For high-dimensional data problems,thesis uses a combination of downsampling and Rfe dimensionality reduction and applies it to machine learning algorithms.The final experiment shows that the combination of the two can improve the performance of the random forest algorithm on fraudulent transaction data.The results of thesis can provide references for relevant researchers and practitioners.
Keywords/Search Tags:fraud detection, machine learning, data imbalanced, high dimension
PDF Full Text Request
Related items