Font Size: a A A

Research On Feature Selection And Feature Representation Algorithm For High-dimensional Data

Posted on:2022-08-26Degree:MasterType:Thesis
Country:ChinaCandidate:N ZhangFull Text:PDF
GTID:2518306542463164Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
In the context of today's big data,with the development of information technology and the remarkable improvement of physical storage media capabilities,massive data sources are constantly emerging.These data usually have a very limited sample size but contain a large number of features,resulting in data redundancy and dimensional disaster problems.In the field of machine learning,such as regression and clustering,only the most relevant features contribute to the improvement of the machine learning model performance,while the irrelevant features such as noise,outliers and redundant data will greatly reduce the performance of the machine learning model.Therefore,it is of great significance and application value to study the most relevant and valuable features from thousands of features.Feature selection and feature representation have been powerful tools for dealing with high-dimensional data in machine learning field.The core of feature selection technology is to remove a great quantity of irrelevant features from high-dimensional data and select the most representative and discriminative feature subset in high-dimensional data,so as to obtain a group of the most effective features with low feature dimensionality but good feature quality;the core of feature representation technology is to obtain a new set of feature reconstruction of the original high-dimensional data through representation learning,and the data after feature representation has the most appropriate internal structure and the most valuable information.In the regression application,researchers usually implement feature selection of high-dimensional data by imposing L2,1 norm regularization constraint on the projection matrix,and then use the data after feature selection for regression applications.In the clustering application,researchers usually perform the feature representation processing on high-dimensional data in advance,and then use the data after the feature representation to do the clustering application.In the above two applications,the high-dimensional data processed by feature selection and feature representation technology and specific practical applications are separated into two independent processes to complete.There are few existing works dedicated to perform these two independent processes simultaneously and effectively.In response to this problem,this thesis studies the process of simultaneously dealing with high-dimensional data problems and doing data regression and clustering applications.The specific work includes the following two aspects:(1)A feature selection algorithm for high-dimensional data based on L2,0 sparse constraint is proposed.Based on previous research work,the projection matrix commonly used in feature selection algorithm is decomposed into two matrices that each complete the corresponding function,the feature selection process is realized by a simple and effective sorting method,and the class label regression reconstruction process is performed on the selected feature subset,which realizes the integration of feature selection and class label regression reconstruction process into one based on L2,0 sparse constraint model.This thesis uses the alternating iterative direction multiplier method to learn the variables of the algorithm for optimizing our proposed model.Finally,compared with the experimental results of other seven relevant feature selection algorithms,the feature selection algorithm based on L2,0 sparse constraint for high-dimensional data can maintain a high regression accuracy even when few features are selected.(2)A K-means clustering algorithm based on adaptive robust feature representation is proposed.Different from processing high-dimensional data in advance,and then using the processed data to do clustering applications.This thesis based on the principle of classical kmeans clustering algorithm and the robustness of L2,1 norm designs a k-means clustering algorithm model based on adaptive robust feature representation.This model simultaneously learns the feature representation of high-dimensional data and makes k-means clustering algorithm on the data after feature representation.This thesis uses the augmented Lagrange multiplier method to learn the variables of the algorithm for optimizing our proposed model.Finally,compared with the experimental results of other relevant clustering methods on ten different domain datasets,it is verified that the algorithm model proposed in this thesis can be used for both the adaptive robust feature representation and the k-means clustering algorithm when facing the high-dimensional data.In summary,starting from the feature selection and the feature representation technology,combined with specific practical applications in field of machine learning: regression and Kmeans clustering algorithms,the thesis studies and implements the feature selection algorithm of high dimensional data based on L2,0 sparse constraints and a K-means clustering algorithm based on adaptive robust feature representation.These works have practical reference significance in the field of machine learning.
Keywords/Search Tags:high-dimensional data, feature selection, feature representation, regression, k-means clustering
PDF Full Text Request
Related items