| In the era of big data with explosive growth of information and rapid development of the Internet,classification problems have attracted wide attention in data mining,pattern recognition and other research fields.Feature selection,as an indispensable and important step in data preprocessing,is used to improve the efficiency of classifier operation,increase classification accuracy,and reduce features by eliminating redundant and irrelevant features.The ReliefF algorithm is a filtered feature selection algorithm that operates efficiently and can handle multi-classification datasets.However,in practical applications ReliefF algorithm only considers feature weights,thus resulting in poor ability to remove redundant features and poor performance in unbalanced datasets.As a mathematical tool to deal with fuzzy and uncertain knowledge,rough set does not require any a priori knowledge and only define knowledge formally based on the information of the data itself using equivalence relations.Rough set has good performance in attribute simplification of decision tables and can effectively remove redundant attributes from decision tables.In this thesis,a study based on ReliefF algorithm,rough set and mutual information is conducted to analyze the classification effect of feature subsets using a plain Bayesian classifier and support vector machine.The main research findings are as follows:(1)The hierarchical sampling ReliefF algorithm is proposed to address the shortcomings of the ReliefF algorithm that performs poorly in unbalanced datasets.The algorithm improves the sampling method of ReliefF algorithm by replacing random sampling with categorical sampling,which ensures the consistency of the number of samples drawn by the algorithm in each sample category,so that the feature weights calculated by the algorithm will not be skewed toward the category with a larger number of samples,increasing the stability of the algorithm.The experimental result shows that the subset of features selected by the stratified sampling ReliefF algorithm has higher classification accuracy than the ReliefF algorithm on the unbalanced datasets.(2)The feature selection method based on the hierarchical sampling ReliefF algorithm and rough set is proposed for the drawback that the hierarchical sampling ReliefF algorithm cannot remove redundant features.The method introduces the attribute simplification algorithm of classical rough set based on the hierarchical sampling ReliefF algorithm,and uses the hierarchical sampling ReliefF algorithm as the search strategy and the attribute simplification algorithm of rough set as the evaluation criterion to remove redundant features.The experiment shows that this method can not only filter out feature subsets with fewer features,but also make the classification accuracy higher,while ensuring the classification accuracy is not lower than that of the hierarchical sampling ReliefF algorithm.(3)The selection of threshold is a tedious problem in the feature selection methods based on hierarchical sampling ReliefF algorithm and rough set,and the good or bad threshold can directly affect the performance of the method.The empirical-based selection will lead to the degradation of the classification performance of feature subsets,and the classification accuracy selection based on classifier will increase the time cost.The feature selection method based on hierarchical sampling ReliefF algorithm and mutual information is proposed by introducing the calculation of mutual information between features and features.The method first selects the features with larger feature weights to join the feature subset,then calculates the features with the smallest mutual information with them to join the feature subset,and finally uses the rough set as the evaluation criterion to remove the redundant features.The method does not require setting a threshold value in the calculation process.The experimental result shows that the feature subsets selected by this method have higher classification accuracy and better stability on the datasets. |