Font Size: a A A

Study On Comprehensive Improvement Of Random Forests Algorithm

Posted on:2020-02-10Degree:MasterType:Thesis
Country:ChinaCandidate:X W LiuFull Text:PDF
GTID:2428330578456711Subject:Statistics
Abstract/Summary:PDF Full Text Request
Random forest(RF)algorithm is widely used in various fields due to its advantages such as better prediction accuracy,strong noise resistance,less adjustable parameters,strong adaptability and avoiding overfitting.But as the RF algorithm is applied more and more widely,some it's disadvantages are also gradually highlights,mainly include: The classification efficiency is not high when the data set is unbalanced and the data volume is large.In view of the above disadvantages of the algorithm,this paper proposes several better improved algorithms,the core idea is as follows:Improvement of balancing methods for imbalance data set.SMOTE algorithm without considering distribution information of the original data set,this may be lead to a data set to lose the meaning of practical.So this paper presents a HD_SMOTE algorithm,this algorithm not only can save the the distribution information of the original data set,but also can improve the imbalance of data sets.After balancing the 9 unbalanced data sets in UCI database by using HD_SMOTE algorithm,The random forest classifier is used to classify them.The results show that the algorithm can effectively improve the classification performance of the random forest algorithm to the unbalanced data sets.Improvement of random forest construction process.In order to solve the problem of poor classification performance of random forest algorithm,this paper makes several improvements to its own construction process,including: 1)Improvement of sampling methods.Because of the Bagging sampling methods are random very much that the original random forests used,the result of sampling not effective response information of the original sample set,therefore the validity of the model obtained by classifier training is reduced.Therefore,this paper proposes a C_Bootstrap sampling method based on the idea of group sampling.This method can ensure that the samples extracted are evenly distributed in each category in the classification problem,so as to preserve the data structure of the original data set as much as possible;2)Improvement of feature attribute selection method.In order to solve the problem of classifier performance degradation caused by completely random selection when selecting feature attribute set in random forest algorithm,this paper proposes a grouping feature selection method combined with factor analysis method,which can effectively reduce attribute redundancy and improve the classification performance of the algorithm;3)Improvement of Node splitting algorithm.The original random forest algorithm used the gini coefficient when the nodes split,but the gini coefficient can only be applied to the binary classification problem.However,the GainRatio can be applied to multiple classification problems,and both of them are based on information theory.Therefore,this paper combines these two indexes to form a node-splitting hybrid algorithm,so as toimprove the classification performance of the random forest algorithm;4)Improved classification voting methods.This paper introduces a weighted and integrated voting rule to participate in the final decision,and finally takes the result with the maximum confidence as the output.A Comprehensive improved random forest algorithm(CIRF algorithm)was proposed based on the above improvements,and its performance was verified on five UCI data sets including Blood.The results show that the performance of CIRF algorithm is much better than that of the original RF algorithm.Finally,the paper combined the data equalization technology mentioned in the paper with the CIRF algorithm and applied it to China's financial risk classification.The results show that the algorithm has practical application significance.
Keywords/Search Tags:RF Algorithm, HD_SMOTE, CIRF Algorithm
PDF Full Text Request
Related items