Font Size: a A A

Research On Feature Selection Algorithm Based On Similarity

Posted on:2022-08-03Degree:MasterType:Thesis
Country:ChinaCandidate:G B WangFull Text:PDF
GTID:2518306509969759Subject:Statistics
Abstract/Summary:PDF Full Text Request
With the continuous development of the Internet era,the scale of data we are facing has increased exponentially,presenting a situation where the sample size increases,the features contained in the sample increases,and the category to which the sample belongs continues to increase,causing a "dimensional disaster".Feature selection has gradually become a research hotspot in the field of machine learning and data mining with its huge potential in data dimensionality reduction.At present,great achievements have been made in feature selection based on similarity,but there are still some shortcomings,such as ignoring attribute weights,and single evaluation criteria,and the optimal feature subset cannot be selected according to user needs.In response to these two issues,this article has done the following two parts of research:(1)For the unsupervised feature selection algorithm based on similarity,attribute weights are assigned to the k nearest neighbors of feature f,a new k nearest neighbor density and average redundancy are defined,and a feature clustering based on similarity is proposed.Weighted unsupervised feature selection algorithm.The basic idea is to select cluster centers according to the newly proposed k-nearest neighbor density and average redundancy,and assign the remaining features in the feature space to the selected clusters according to the principle of the largest distance between clusters and the smallest distance within clusters.In the cluster where the center is located,kmeans clustering is performed on the selected features,and finally,experiments are performed on 7 UCI data sets.The experimental results show that the feature selection results of this algorithm are better.(2)Aiming at the classification algorithm based on feature selection and clustering and the traditional MRMR algorithm,a new maximum correlation minimum redundancy feature selection algorithm is proposed.This algorithm introduces two features in the evaluation criteria for measuring redundancy between features.There are four different evaluation criteria;4 different evaluation criteria are introduced in the correlation between the measurement feature and the category,and 8 different feature selection algorithms are derived,which increases the application scope of the algorithm.In addition,the traditional MRMR feature selection algorithm cannot perform feature selection according to the data dimensions of the user's actual needs.Therefore,we introduced indicator vectors to describe the actual data dimension requirements of users,and proposed a new objective function to solve the optimal feature subsets.We used support vector machines to perform experiments on the feature subsets of 4 UCI data sets.Finally,the validity of the algorithm is fully verified by the classification accuracy rate and the paired one-sided T test.
Keywords/Search Tags:unsupervised feature selection, data dimension, attribute weights, redundancy, correlation, clustering, support vector machine, feature subset
PDF Full Text Request
Related items