| With the rapid development of information technology and the wide application of high-tech products,the information data obtained by many industries are characterized by high dimension and high complexity in terms of quantity and content.These data contain not only rich and useful information features,but also a large number of irrelevant and redundant features and noise,which makes it difficult to guarantee the performance of the data analysis model,and the generalization ability is continuously reduced,which brings great pressure on the analysis and visualization of data.Therefore,it is necessary to eliminate irrelevant and redundant features in the data.Feature selection is the basic preprocessing before performing the actual learning task.It reduces the dimension of data by eliminating irrelevant or redundant features in order to reduce the time of model training and improve the accuracy of the model.On the other hand,most of the information data obtained in reality are unlabeled data.It is expensive and impractical to manually label a large amount of high-dimensional data in the later stage.When the data sample lacks the category label,the supervised feature selection methods are no longer applicable.Therefore,it is necessary to study unsupervised feature selection methods for unlabeled data.In recent years,due to the effectiveness and low computational complexity of the Hilbert-Schmidt independence criterion(HSIC),it has been proved that feature selection problems can be solved by HSIC.However,most HSIC-based feature selection methods are subject to the following limitations.First of all,these methods are usually only applicable to labeled data,which is not desirable because most data in real applications are unlabeled.Secondly,the existing unsupervised feature selection methods based on HSIC only solve the general correlation between the selected features and the output value expressing the basic clustering structure,while ignoring the redundancy between different features.To address these two problems,the specific work of this paper is as follows:1.The relevant basic knowledge of feature selection and HSIC are introduced in detail.At the same time,the typical methods of feature selection based on HISC are systematically reviewed,and their advantages and disadvantages are analyzed.2.A novel unsupervised feature selection algorithm named UFSHSIC is proposed.The algorithm first defines all input data as labels to better explore the correlation between the feature subset and the overall sample structure.Then HSIC is used as a correlation criterion to measure the correlation between features and the overall sample structure and the redundancy between features.Finally,the backward elimination strategy is used to select features.Experiments on seven UCI data sets verified the effectiveness of the proposd UFSHSIC algorithm.3.An unsupervised feature selection algorithm based on HSIC Lasso named UHSIC-Lasso is proposed.The algorithm introduces a nonlinear Lasso regression model,i.e.,HSIC Lasso,to transform the feature selection problem into a continuous optimization problem.Firstly,the algorithm uses a set of kernel functions to select the non-redundant features with the largest amount of information.Secondly,it introduces the robustness and sparsity of the 1norm constraint to improve the algorithm.Finally,it effectively calculates the global optimal solution by solving a Lasso optimization problem.In addition,the algorithm has a clear statistical interpretation,that is,it can find the minimum redundancy feature that has the greatest dependence on the output value according to the HSIC.The effectiveness of UHSIC-Lasso algorithm is verified by simulation experiments on UCI data sets. |