Font Size: a A A

Research On The Method Of Solving Imbalanced Classification Problems Based On Random Forest Algorithm

Posted on:2019-08-08Degree:MasterType:Thesis
Country:ChinaCandidate:M Y LiFull Text:PDF
GTID:2428330605976158Subject:Applied Statistics
Abstract/Summary:PDF Full Text Request
Imbalanced data refer to the amount of samples in a dataset which is much more than that of other categories.As well as,there is a significant difference in the sample size of different categories.From the perspective of researchers,the minority samples are called positive samples,and the majority samples are called negative samples.Due to great difference between positive and negative samples,particularly,when the positive sample size is too small,it will lead to the information carried by sample can't be fully expressed.If the traditional classification algorithm is used to classify imbalanced data,the results are often not ideal.Because classifier is always intended to divide positive samples into the negative samples,so that classifier is lack of recognition capacity on the positive samples.The problem of imbalanced classification is intensively studied in this paper.And it's found that there exists a breakthrough point on solving the problem Finally,the random forest algorithm is used as classifier model.SMOTE(Synthetic Minority Over-sampling Technique)algorithm is a classic method to solve imbalanced data sets from the data level.However,the algorithm is prone to blindness and marginalization in the synthesis of new samples.It is the existence of these problems that often make some classification algorithms perform worse on this kind of problems.Firstly.aiming at the deficiency of SMOTE algorithm,a novel method of data balance is proposed,which is called CT-SMOTE algorithm(Central sample-Twice interpolation SMOTE).Then,considering the imbalanced data and the advantages of over sampling and under sampling,a CT-SMOTE+TL2 hybrid algorithm is proposed.The hybrid algorithm can not only effectively avoid the blindness of the samples,but also solve the problem of marginalization.Finally,based on the improved algorithm,a random forest classification model is established.The classification model provides a complete framework for solving imbalanced data classification problems.Experimental results show that the algorithm presented in this paper has some advantages in dealing with imbalanced data,and the classification performance of random forest algorithm can also be improved.As a result,the positive samples are well recognized by the classifier,achieving a desired classification result.The research content of this paper will produce important academic significance and application value in the future.Moreover,the improved algorithm has good stability.It can be applied to more fields with imbalanced problems,for example,medical diagnosis,abnormal detection and other fields.
Keywords/Search Tags:imbalanced data, positive samples, random forest algorithm, SMOTE algorithm
PDF Full Text Request
Related items