Font Size: a A A

Research On Feature Selection Method Based On Similarity

Posted on:2018-12-01Degree:MasterType:Thesis
Country:ChinaCandidate:L QiFull Text:PDF
GTID:2428330596454805Subject:Software engineering
Abstract/Summary:PDF Full Text Request
With the rapid development of information technology and the popularity of the network,a large number of high-dimensional data is full of our daily life,and becoming an important part.However,as the data's dimension is too high,how to get the information what we need from these data precisely and quickly is becoming more and more important.For the feature selection method has a great value in the dimensionality reduction of high dimensional data,it is getting more and more focus and the research on it is very meaningful.In this thesis,we will study the feature selection method based on similarity.Firstly,we introduce the basic concept of feature selection method,and classify the feature selection method into three categories which are supervised,unsupervised and semisupervised,according to whether the data is tagged.Then,by introducing five traditional feature selection methods based on similarity,we find a common phenomenon in the selection process,that is,the highly related features are repeatly selected.It is known that redundant features may adversely affect the performance of classification and cluster.Therefore,redundant features should be removed through feature selection to improve the learning performance.Based on the above problems,we introduce the concept of the similarity preserving feature selection algorithm framework SPFS.Through theoretical analysis,we show the relationship between the proposed framework and the existing feature selection algorithms which are related to the similarity retention.On the basis of the SPFS framework,three optimization algorithms,SPFS-SFS,SPFS-NES and SPFS-LAR,are proposed to overcome their shortcomings in redundant features.Finally,the advantages and disadvantages of different algorithms in dealing with redundant features are compared in supervised learning and unsupervised learning environments.The experimental results show that the algorithm based on SPFS framework can improve the classification accuracy by 7 percentage points,and reduce the redundancy rate by 16 percentage points.These prove that this algorithm has a better performance in various learning environments.In the end,for the majority of data with only a few tags,we propose a semisupervised feature selection algorithm based on attribute dependency to make a better use of few tags in data preprocessing.We combine the Laplacian score which is an unsupervised feature selection algorithm,with the Constraint score of the pairwise constraint,and introduce an attribute dependency matrix in the data reconstruction process.With the pairwise constraint,we calculate the average mutual information between each dimension feature and the influence of the feature on the other characteristic information,and get the objective function of each dimension feature score.At the same time,in order to optimize the objective function,we take the influence of the K-nearest neighbor local retention ability on the feature selection of each sample point into consideration.At last,we calculate the scores of each dimension according to the objective function by select the most useful and relevant features.The experiment results show that our proposed semi-supervised feature selection method can have a high accuracy.Furthermore,it has a lower computational complexity and a better performance than supervised and unsupervised feature selection methods.
Keywords/Search Tags:feature selection, preserving similarity, semi-supervised, attribute dependency
PDF Full Text Request
Related items