| The advent of the Internet and the Internet of Things has led to an explosive growth in data volume.In the face of huge amounts of mass,diverse types and low value-density data,how to efficiently process and analyze data through machine learning has become the key to influencing the data owner’s core competitiveness.In view of the inaccuracy,incompleteness,high dimension and large sample of the massive original data,the rough set shows unique research value,and its application in integrated learning promotes the engineering of parallel computing.In this paper,we study the problems that the rough set lack integration with ensemble model,and is hard to discretize,high time complexity and its poor generalization.Firstly,the paper adopts the strategy of combining the selection of neighborhood rough set features with the random forest,and improves the classification ability of the base classifier in the random forest by using the feature subspace as a starting point.Based on the original subset search,The randomized method makes the neighborhood rough set generate a large number of different feature candidate subsets.The experimental results show that the introduction of multi-feature subset improves the overall performance of the system and is more interpretable than the fixed-size random feature subset determined by the rule of thumb.In the comparison of seven machine learning methods of five datasets,the values increased by a maximum of 11.2%and showed better stability as the feature dimension increased.Secondly,by analyzing the difficulty of manual selection of neighborhood radius in practical application of neighborhood rough set,and based on the too large related work,a clustering algorithm is proposed combined with rough set for continuous numerical division.It also gives guidance opinions about the number of neighborhood or clustering centers and reduces the manual tuning and the cost of feature evaluation when calculating the attribute importance.In addition,by decoupling the neighborhood rough set from the specific algorithm by integrating feature selection,lasso regression is added to reduce the correlation of nuclear feature subsets during the subset search.In the combination with the seven kinds of commonly used classifier shows the better stability and feature compression performance.The primary contributions and innovations of the paper are as follows:The proposed neighborhood rough random forest improves the feature subspace generation process of basic random forest.It has a significant improvement in the performance of high-dimensional data processing.The integrated feature selection combined with clustering reduces repetitive work of selecting the hyperparameters of the neighborhood rough set.The application scenario,meanwhile,is changed from wrapper-embeded method which is coupled with random forest to wrapper-filter method applied to more classification algorithms,which promotes the clustering-based neighborhood rough set metrics and integration feature selection.The proposed method reduces the use of the empirical rules. |