Font Size: a A A

Research On Feature Selection Algorithm Based On Multi-label

Posted on:2022-01-26Degree:DoctorType:Dissertation
Country:ChinaCandidate:P ZhangFull Text:PDF
GTID:1488306728982419Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Traditional supervised learning deals with single-label data where each instance is related to only one class label.However,in many learning tasks,it is not comprehensive and inapplicable to only consider the single-label data structure.Because objects in real-life may have multiple semantics and meanings simultaneously.With the diversity of data collected in modern applications,a large amount of multi-label data is obtained,such as multi-topic text classification data and multi-semantic image annotation data,etc.The main characteristic of multi-label data is that one instance is related to multiple class labels simultaneously.Multi-label learning predicts the proper label sets for unseen instances by training the classification model based on the multi-label data.The classification performance of multi-label learning is closely related to the quality of the input data.Faced with high-dimensional multi-label data,multi-label learning inevitably has the problem of the curse of dimensionality.High-dimensional multi-label data sets often include many redundant features and irrelevant features which increase the computational burden and are prone to over-fitting resulting in poor classification performance.To solve these problems,the research of feature selection based on multi-label has attracted more and more attention,and has become a frontier and hotspot.The task of multi-label feature selection is to eliminate redundant features and irrelevant features and retain useful features that can provide more classification information for classification learning.The multi-label based feature selection chooses the subset of features containing the most classification information,which provides high-quality input data for the multi-label learning model.An effective multi-label feature selection algorithm can reduce the computational cost of multi-label learning task and improve the classification performance.Existing information-theoretical based multi-label feature selection algorithms propose many effective feature evaluation criteria,but these algorithms still have some issues when evaluating feature relevance: 1.Existing algorithms use the accumulated mutual information between candidate features and each label to measure feature relevance,which ignores the impact of label redundancy on feature relevance evaluation;2.In the feature relevance measurement of existing algorithms,the different effects of label relationships on features and the dynamic changes of label relationships when measuring different candidate features are not distinguished;3.In the process of feature evaluation,the maximum contribution of labels with the supplementary relationship is not considered,and the effect of key label that provides the maximum supplementary information is ignored;4.In existing information-theoretical based multi-label feature selection algorithms,the high-order feature relevance is approximated using low-order mutual information.However,existing algorithms do not establish the theoretical underpinning and guarantee for the low-order approximation.This paper conducts research work on the above four issues that are ignored in the feature selection algorithms based on multi-label data.Focusing on the topic of selecting high-quality feature subsets for multi-label learning tasks,four feature selection algorithms are proposed.The main contributions and innovations are as follows:1.The algorithm distinguishing two types of labels for multi-label feature selection is proposed(LRFS).First,we analyze two label relationships,that is,label independence and label dependence.Second,the feature relevance based on label redundancy is proposed,which considers the impact of two label relationships and uses conditional mutual information to evaluate the importance of candidate features.Finally,the proposed algorithm designs a new feature evaluation criterion to select feature subset that is highly related to the label set.2.The Multi-label Feature Selection algorithm based on Label Supplementation is proposed(LSMFS).First,the additional information based on feature is defined to calculate the all additional information provided by all other labels with supplementation relationships for feature and each label.Second,a new feature relevance based on the additional information is proposed to calculate the information provided by the feature alone and the additional information captured from all other labels.Finally,a feature selection evaluation function based on label supplementation is proposed.3.The Multi-label Feature Selection algorithm considering Maximum Label Supplementation is proposed(MLSMFS),which improves the LSMFS algorithm.First,the conditional mutual information and the maximum operation are used to capture the maximum additional information provided by the key label.Then,a feature relevance measurement based on the maximum additional information is proposed.Finally,a reasonable feature evaluation criterion is designed to measure the importance of each feature.4.The Multi-label Feature Selection algorithm considering Join Mutual Information and interaction weight is proposed(MFSJMI).First,two underlying assumptions based on high-order label distribution are identified: Label Independence Assumption(LIA)and Paired-label Independence Assumption(PIA).Second,by analyzing the strengths and weaknesses of two assumptions,joint mutual information is introduced to satisfy more realistic label distribution.Furthermore,by decomposing joint mutual information,an interaction weight is proposed to consider multiple label correlations.Finally,a new algorithm considering join mutual information and interaction weight is proposed.In this paper,the proposed four feature selection algorithms perform experiments on many real-world multi-label data sets.The experimental results show that the proposed algorithms obtain excellent classification performance on multiple evaluation metrics.The theory of these algorithms enriches the research of feature selection,plays a role in promoting the development of feature selection technology.Therefore,these studies have important theoretical significance.In addition,these algorithms can be directly used in the preprocessing stage of multi-label learning tasks to process the collected high-dimensional data,which provides high-quality data input for the subsequent model learning stage.Therefore,these studies have comparatively higher practical value.
Keywords/Search Tags:Supervised learning, Multi-label learning, Multi-label feature selection, Information theory, Feature relevance, Label relationships
PDF Full Text Request
Related items