A Study On Fraud Detection Based On Machine Learning

Posted on:2021-10-30

Degree:Master

Type:Thesis

Country:China

Candidate:D Gao

Full Text:PDF

GTID:2518306104989319

Subject:Management Science and Engineering

Abstract/Summary:

PDF Full Text Request

With the development of the Internet and communication technology,online payment has become one of the most commonly used payment methods by the Chinese people,with the number of transactions and transaction amounts occurring every day reaching hundreds of millions.With the popularity of online payment,fraudulent transactions also occur from time to time.How to detect fraudulent transactions has become one of the challenges faced by payment service providers.Under normal circumstances,fraudulent transaction data has the characteristics of imbalance and high dimensions,which brings great difficulties to the application of traditional machine learning algorithms for online fraud transaction detection.In order to solve the above problems,thesis combines the actual transaction data to study the process of using online sampling and dimensionality reduction technology to detect online fraud transactions.The data in this article comes from a data mining competition held by the Kaggle community.There are a total of 206,689 training data,of which fraudulent transactions account for 3.5%.This article first introduces related machine learning algorithms,including KNN,decision tree,neural network,etc.Then analyzed the characteristics of the transaction data,which are extremely unbalanced data and high data dimension.Data imbalance means that the amount of fraudulent transactions in transaction data is very small compared to non-fraud transactions.The distribution of unbalanced data in the feature space is more complicated,so that traditional machine learning algorithms perform poorly when dealing with unbalanced problems.Another characteristic of fraudulent transaction data,the higher data dimension also brings difficulties to the application of traditional machine learning algorithms on transaction data.In the high-dimensional space,when looking for neighbor points,a Data hub phenomenon will appear,which makes some points more likely to become neighbor points of other sample points.As a result,the traditional algorithm for finding new neighbors based on distance to synthesize new data has poor performance when applied.In response to the above problems,thesis combines data sampling technology and dimensionality reduction technology to improve the performance of the algorithm.In order to reduce the impact caused by data imbalance,a down-sampling Bad hub processing method is proposed,and those points in the Data hub that are inconsistent with their neighbor point labels are identified as Bad hubs,and these Bad hubs are likely to mislead the algorithm,so in This part of the data is eliminated during modeling.For high-dimensional data problems,thesis uses a combination of downsampling and Rfe dimensionality reduction and applies it to machine learning algorithms.The final experiment shows that the combination of the two can improve the performance of the random forest algorithm on fraudulent transaction data.The results of thesis can provide references for relevant researchers and practitioners.

Keywords/Search Tags:

fraud detection, machine learning, data imbalanced, high dimension

PDF Full Text Request

Related items

1	A Comparison Of Machine Learning Methods For Credit Card Fraud Detection
2	Design And Implementation Of Online Fraud Detection Algorithm
3	Research And Application Of Boundary Loss Function For Imbalanced Data Set
4	Identification Of Fraud Detection Based On High Dimensional And Unevenly Distributed Online Transaction Data
5	The Anaysis And Detection Of Fraud Android APPs Based On Big Data
6	Credit Card Fraud Detection Based On Fuzzy 2-norm Quadratic Surface Support Vector Machine
7	Research On Machine Learning Method For Discriminating Fraud Customers By Credit Investigation Data
8	An Imbalanced Approach Towards Credit Card Fraud Detection Using Proximity Based Resampling And Classifier Ranking
9	Research On Anti-fraud Of Network Lending Based On Machine Learning Algorithm
10	Research On Weighted Extreme Learning Machine Algorithm Based On Imbalanced Data Distribution