Font Size: a A A

Research On Imbalanced Dataset Based On Neural Network

Posted on:2022-07-18Degree:MasterType:Thesis
Country:ChinaCandidate:Y M ZhuFull Text:PDF
GTID:2518306785976309Subject:Automation Technology
Abstract/Summary:PDF Full Text Request
In the research field of machine learning and data mining,the classification problem is a very important part.Although the existing classification algorithms have achieved great success in many practical applications,with the deepening of research,it is found that when the number of samples in different classes in dataset varies greatly,the problem of class imbalanced will occur.At this time,the decision-making surface obtained by directly using traditional algorithms tends to shift to a larger number of class,which greatly affects the performance of the model.In severe cases,the model may even completely fail.The main research results of this paper are as follows:First,this paper proposes a boosting tree classification model based on Focal Loss.By analyzing the meaning of the hard-to-separate region in the sample distribution,and the significance of samples in the hard-to-separate region to solve the problem of class imbalanced,it shows that by focusing on the samples in the hard-to-separate region,the problem of class imbalance can indeed be alleviated.By increasing the proportion of the loss of samples in the hard-to-separate region in the overall loss,the model gradually converges to the vicinity of the optimal decision-making surface in the hard-to-separate region.In addition,according to the idea that high-complexity model can be used to approximate linear model,this paper replaces the linear model in the original generalized linear model with a boosting tree model to fit the Log Odds.To optimize the boosting tree model through Focal Loss,this paper calculates the formula for updating the boosting tree model in each iteration.Then,this paper proposes an ensemble classification model based on the EM algorithm.The Gaussian mixture model is obtained through the EM algorithm on a minority of samples,and the cluster boundary division method of the mixture model is proposed to obtain more accurate anisotropic non-spherical cluster boundary,which can more accurately exclude regions meaningless for minority samples.According to the imbalance ratio of the clusters obtained by the previous division,different measures are taken to train the corresponding classifiers on each cluster dataset.In this paper,the classifier on each cluster data set is used to classify the samples at the same time,and all the results are weighted as the final class of the sample.Finally,this paper proposes an imbalanced sentiment analysis model based on the modified loss function to solve the class imbalanced in sentiment analysis.This model uses Bi-GRU to extract semantic information,and at the same time,we use the attention mechanism to re-weight the output of the hidden layer to obtain a new sentence vector that reflects the weight of each word in the sentence.Then,this paper uses Self-Attention to extract the influence of sentimental polarity between contextual sentences in the document and uses a fully connected layer to classify the sentiment of the sentence vector.This paper further refines the confidence of the classification model output.From the perspective of eliminating this part of the sample with too low confidence,the entire confidence interval is divided into a high confidence region,a low confidence region and a suitable confidence region.This limits the weight increase of Focal Loss to low-confidence samples,reduces the proportion of low-confidence samples in the overall loss,and improves the performance of the model.
Keywords/Search Tags:class imbalanced, hard samples, loss function, mixture model, sentiment analysis
PDF Full Text Request
Related items