Font Size: a A A

Research On Feature Selection Algorithms Of High-dimensional Samples Based On Data Characteristics

Posted on:2022-04-28Degree:MasterType:Thesis
Country:ChinaCandidate:J Y YongFull Text:PDF
GTID:2518306485950169Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the booming development of emerging technologies such as the Internet,artificial intelligence and cloud computing,the data in these fields are generally characterized by high dimensions.At the same time,these high-dimensional data have the problems of uncoordinated feature dimension and sample size,and unbalanced class distribution.In order to fully mine valuable information from these massive data,feature selection,as a data preprocessing technology,plays an increasingly important role in the field of machine learning.In the face of high-dimensional sample data,many feature selection algorithms are able to select features with high correlation with labels and low redundancy with other features.However,the process of de-redundancy is complex,and it is easy to omit valuable features or retain redundant features.At the same time,it is also a serious problem that the important characteristics of small classes are easily ignored.In addition,the feature selection process of multi-label dataset is more complicated because of its multi-dimension tag space.Taking high-dimensional sample data as the research object,this paper focuses on the challenging problem of feature selection of high-dimensional sample data,and studies the feature selection algorithm of high-dimensional sample data under supervised learning mode.The main research contents include:(1)A multi-level feature selection algorithm based on mutual information is proposed to solve the problem caused by the mismatch between the high dimension of features and the sample size.Firstly,according to the correlation between the feature and the label,the feature is divided into strong correlation,sub-strong correlation and other features.Secondly,after the strong correlation features are selected,the features with low redundancy are selected from the sub-strong correlation features.Finally,the feature which can enhance the correlation between the selected feature set and the label is selected.Experimental results show the effectiveness of the proposed algorithm.(2)Aiming at the problem caused by unbalanced category distribution,an ensemble learning feature selection algorithm based on small classes is proposed.First of all,the important characteristics of small classes are selected from three different perspectives.Secondly,the three feature subsets are de-redundant according to the correlation between the features.Finally,three feature subsets are fused to get the final feature subset.Experimental results show that this algorithm can significantly improve the accuracy of small class prediction.(3)Considering the similarity and importance of labels,a multi-label feature selection algorithm based on similarity label clustering is proposed.First of all,K-means is used to cluster similar labels,and multiple groups of label clusters are obtained.Secondly,select the most important label in the label cluster according to the correlation between labels.Finally,according to the correlation between the feature and the label cluster,the feature subsets of each group are selected,and they are fused and de-redundant to get the final feature subset.Experimental results show that this algorithm has good classification performance.
Keywords/Search Tags:feature selection, Mutual information, High dimensional sample, Category imbalance, More labels
PDF Full Text Request
Related items