Font Size: a A A

Research On Random Forest Algorithm Based On Feature Selection And Diversity

Posted on:2021-03-20Degree:MasterType:Thesis
Country:ChinaCandidate:S F LuoFull Text:PDF
GTID:2428330614958324Subject:Electronic and communication engineering
Abstract/Summary:PDF Full Text Request
With the advent of the Internet of everything,the development of all walks of life is inseparable from the Internet,and various fields related to the Internet are flooded with a large amount of complex data information,and the problem of "information overload" has occurred,causing users or some enterprise platforms.It is difficult to mine the key information needed from massive information data.For large data processing tasks,the integrated classification model based on machine learning training can effectively solve such problems,but the model has certain limitations due to its own difficulties in data feature fitting and generalization errors of the integrated model.Therefore,based on the random forest integration algorithm,this paper studies the integration model by improving the base classifier and integration.The main work and improvements of this paper are as follows:1.In the data preprocessing stage of the classification model,in view of the difficulties in the division of feature attributes and the difficulty of fitting the data set in the process of data processing and feature selection,this design takes features as an important basis and comprehensively analyzes data,features,and categories.The correlation between the two,through the feature importance measure and p-value test to filter out a high-efficiency feature subset,and finally use the random forest model to study the classification accuracy.Through simulation analysis of experimental data,it is concluded that a high-efficiency feature subset can effectively solve the feature attribute division problem,thereby improving the accuracy and recall rate of the random forest integrated model.2.In the large-scale data classification stage,in order to solve the problem of generalization error caused by the redundancy of the base classifier and the insignificant diversity within the random forest integration algorithm,this paper designs a limit random forest integration algorithm that combines feature information and diversity.The algorithm first uses the efficient feature subsets screened by the p-value test and uses the random tree as the base classification model to introduce more randomness.Secondly,it performs structural redundancy analysis on the random tree to avoid duplication of features at the nodes;Finally,a weighted majority vote is used to construct a framework of high classification accuracy and diversified extreme random forest models.Through the simulation verification of different experimental data sets,the experimental results show that the proposed algorithm can effectively solve the problem of generalization error of the model,and improve the fault tolerance and data fitting ability of the integrated algorithm.
Keywords/Search Tags:integrated classification model, random forest, feature selection, p-value validation, extreme random forest
PDF Full Text Request
Related items