Font Size: a A A

Research And Application Of Boundary Loss Function For Imbalanced Data Set

Posted on:2022-12-29Degree:MasterType:Thesis
Country:ChinaCandidate:P WuFull Text:PDF
GTID:2518306611957659Subject:Electronic Science and Technology
Abstract/Summary:PDF Full Text Request
The so-called imbalanced classification problem refers to the pattern classification problem in which the number of training samples is not balanced among classes.In an imbalanced dataset,the class with a large number of samples is called the majority class,while the class with a relatively small number of samples is called the minority class.On imbalanced training data sets,machine learning models tend to shift toward the majority class,There are two common methods to solve imbalanced classification problems: one is to solve imbalanced classification problems from the perspective of data preprocessing,mainly including resampling,up-sampling,down-sampling,and other methods;the second is to solve the problem of imbalanced classification from the perspective of algorithms,mainly including integrated classifiers,cost-sensitive learning,and other methods.Cost-sensitive learning refers to assigning different weights to samples of different categories when training a classifier.In deep learning,cost-sensitive learning is generally achieved by improving the loss function of the neural network.Generally,some weight conditions are added based on the commonly used cross-entropy loss function(CrossEntropy Loss,CEL),thereby increasing the losses of the majority class samples,reducing the loss of the minority class samples,and achieving the purpose of solving the imbalanced problem,Focal Loss(FL)is one of the cost-sensitive functions with superior performance,which solves the problem of class imbalance learning by reducing the weight of easily classified samples.However,FL also has limitations,it focuses too much on hard-toclassify samples near the classification boundary.This paper designs a new loss function called Boundary Focal Loss(BFL),which mainly improves the learning performance by reducing the weight of particularly difficult samples(noise).This paper firstly conducts experiments on BFL in three domains: credit card fraud detection,enzyme function classification,and cancer classification.Credit card fraud is a common non-equilibrium classification problem in life.It is very important for financial institutions to properly solve the problem of credit card fraud detection.Enzyme is an essential protein for human body.It mainly participates in biological reactions in vivo as a catalyst and plays a vital role in regulating biological processes.There are seven kinds of enzyme functions,and the quantity distribution of each kind of functional enzyme is extremely uneven.In addition,cancer is a fatal disease all over the world.In recent years,computer-aided cancer diagnosis has made more and more progress.For these three non-equilibrium problems.In this study,the data features are extracted by corresponding methods,and appropriate deep learning classifiers are constructed.Then cel,FL and BFL are used as the loss functions of classifiers respectively.The experimental results verify that the effect of BFL is better than the other two loss functions.Finally,this study also applied BFL to the plant DNA promoter prediction problem and constructed a high-performance predictor: i PPT?BFL.A DNA promoter is a short DNA sequence close to the start codon,responsible for initiating the transcription of a specific gene in the genome.Accurate identification of the promoter will help to better understand its transcriptional regulation.This experiment uses a combination of long-term memory neural network and a fully connected neural network to build a deep neural network model.The five-fold cross-validation results show that the model uses BFL as the loss function to achieve the best performance.The performance of i PPT?BFL is also better than the two benchmark machine learning algorithm models of random forest and support vector machine;at the same time,compared with the latest reported DNA promoter prediction tool IPPT?CNN,the classification performance of i PPT?BFL is excellent in both the training set and the independent test set.This study proposes a new loss function model—BFL for data imbalanced machine learning problems,This model has achieved good performance in some typical problems,and BFL can also be used for other imbalanced data machine learning,which has a certain positive significance for improving deep neural network model research.
Keywords/Search Tags:Machine Learning with Imbalanced Data, Boundary Focal Loss, DNA Promoter, Credit Card Fraud, Enzyme Classification, Cancer Classification
PDF Full Text Request
Related items