Research On The Method Of Solving Imbalanced Classification Problems Based On Random Forest Algorithm

Posted on:2019-08-08

Degree:Master

Type:Thesis

Country:China

Candidate:M Y Li

Full Text:PDF

GTID:2428330605976158

Subject:Applied Statistics

Abstract/Summary:

PDF Full Text Request

Imbalanced data refer to the amount of samples in a dataset which is much more than that of other categories.As well as,there is a significant difference in the sample size of different categories.From the perspective of researchers,the minority samples are called positive samples,and the majority samples are called negative samples.Due to great difference between positive and negative samples,particularly,when the positive sample size is too small,it will lead to the information carried by sample can't be fully expressed.If the traditional classification algorithm is used to classify imbalanced data,the results are often not ideal.Because classifier is always intended to divide positive samples into the negative samples,so that classifier is lack of recognition capacity on the positive samples.The problem of imbalanced classification is intensively studied in this paper.And it's found that there exists a breakthrough point on solving the problem Finally,the random forest algorithm is used as classifier model.SMOTE(Synthetic Minority Over-sampling Technique)algorithm is a classic method to solve imbalanced data sets from the data level.However,the algorithm is prone to blindness and marginalization in the synthesis of new samples.It is the existence of these problems that often make some classification algorithms perform worse on this kind of problems.Firstly.aiming at the deficiency of SMOTE algorithm,a novel method of data balance is proposed,which is called CT-SMOTE algorithm(Central sample-Twice interpolation SMOTE).Then,considering the imbalanced data and the advantages of over sampling and under sampling,a CT-SMOTE+TL2 hybrid algorithm is proposed.The hybrid algorithm can not only effectively avoid the blindness of the samples,but also solve the problem of marginalization.Finally,based on the improved algorithm,a random forest classification model is established.The classification model provides a complete framework for solving imbalanced data classification problems.Experimental results show that the algorithm presented in this paper has some advantages in dealing with imbalanced data,and the classification performance of random forest algorithm can also be improved.As a result,the positive samples are well recognized by the classifier,achieving a desired classification result.The research content of this paper will produce important academic significance and application value in the future.Moreover,the improved algorithm has good stability.It can be applied to more fields with imbalanced problems,for example,medical diagnosis,abnormal detection and other fields.

Keywords/Search Tags:

imbalanced data, positive samples, random forest algorithm, SMOTE algorithm

PDF Full Text Request

Related items

1	Research On Random Forest Similarity Algorithm
2	Research On The Expansion And Classification Of Several Imbalanced Data Sets Based On C-SMOTE Algorithm
3	Classification Learning Of Imbalanced Data Sets Based On Sampling Processing
4	Research For Imbalanced Big Data Classification Algorithm On Random Forest
5	Research On Imbalanced Data Classification Method Based On Random Forest Algorithm
6	Research On Rotation Forest Algorithm For Imbalanced Data Classification Problem
7	Research On Imbalanced Data Classification Algorithm Based On Random Forest And Its Parallelization
8	Research On Parallel Random Forest And Fuzzy C-Means Algorithm For Imbalanced Data
9	Research And Application Of Classification Technology For Unbalanced Data
10	Research On Talent Turnover Prediction Model Based On Optimized Random Forest Algorithm