Font Size: a A A

Research On Application And Optimization Method Of Random Forests Algorithm

Posted on:2022-10-24Degree:MasterType:Thesis
Country:ChinaCandidate:W J WuFull Text:PDF
GTID:2518306527484764Subject:Applied Mathematics
Abstract/Summary:PDF Full Text Request
Random Forests(RF)algorithm is a kind of classical ensemble learning algorithm,whose model is easy to understand,strongly adaptable and difficult to overfit.It is used in various fields widely.But with the development of the information society and the extensive research of RF algorithm,its problems and difficulties are increasingly prominent,mainly include: the reduction of the processing efficiency with the increase of the feature number of the data set;the inadaptation to dynamic and changing data or the poor classification results on dynamic data stream;the higher requirements of algorithm's own performance.In view of the above problems and difficulties,this paper proposes different improvement schemes and conducts a large number of analytical experiments.The specific research contents are as follows:(1)Aiming at the problem with the high computational cost of the feature selection method based on the traditional random forest method of measuring feature importance with the increase of the feature number of the data set,a feature selection algorithm based on multi-feature permutation by random forests(MFPRF)is proposed.The algorithm clusters the original feature set.The importance score of the feature clusters is calculated by replacing the feature clusters obtained after clustering,thus the ranking among clusters is obtained.The features in the cluster rank by the correlation of themselves and classification information.A correlation threshold is used to choose the important features.The rule of ranking the remaining features is first between clusters,then within clusters.Experimental results show that the new algorithm can still achieve higher prediction accuracy when fewer features are selected,and it has higher time efficiency compared with the feature selection algorithm based on the traditional method.(2)Aiming at the problem with the low performance in new class detection of RF algorithm on classification of the dynamic data stream with new class,a completely randomized forest algorithm based on k-nearest neighbor(KCRForest)is proposed.The algorithm constructs completely randomized forest by known-class samples in dynamic data stream,and divides the sample space into normal or abnormal region according to the average path length of samples.The outlier of a sample is obtained based on its k-nearest neighbor when the sample falls into abnormal region.If the outlier is greater than the set threshold,the sample is judged to be new-class,otherwise it is judged to be known-class.When the known-class sample falls into abnormal region,class distribution is obtained based on its k-nearest neighbor,otherwise class distribution can be obtained during training period.The label of known-class sample is identified by voting.When new-class samples detected reach a certain number,the model is updated by the new-class sample information to detect other new class.Experimental results show that the new algorithm has better performance of new class detection and prediction accuracy.(3)Aiming at the problem with the low performance of RF algorithm,a RF algorithm based on ant colony optimization(ACO?RF)is proposed.The algorithm bases on the idea of increasing the intensity and diversity of decision trees in the integrated forest and applies the ant colony algorithm to individual selection of decision trees in the integrated forest.Experimental results show that the new algorithm reduces the forest scale and improves the prediction accuracy on the premise of ensuring the diversity of decision trees in the integrated forest,and achieves the optimized effect of RF algorithm.
Keywords/Search Tags:random forests, feature selection, dynamic data stream, ant colony optimization algorithm
PDF Full Text Request
Related items