Font Size: a A A

Causality-based Feature Selection Research

Posted on:2024-02-26Degree:DoctorType:Dissertation
Country:ChinaCandidate:X Y WuFull Text:PDF
GTID:1528306932957629Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the continuous growth of data dimensions,high-dimensional ’data brings many challenges to machine learning and data mining tasks.To reduce the dimensionality and simultaneously maintain the physical meaning of the original feature space,the feature selection task has gradually attracted the attention of researchers,which could reduce the computational cost and improve the learning performance by selecting a small-scale feature subset with predictability.However,traditional feature selection algorithms select optimal feature subsets by evaluating the predicti ve efficacy of feature subsets or identifying correlations between variables,ignoring the underlying mechanism in the data,that is,the causal relationships between features and labels.Causal feature selection is proposed to improve the interpretability of features and robustness of the prediction model,which identifies the Markov Boundary(MB)of the class attribute as the selected feature subset.In a faithful distribution,the MB of a target contains its direct causes,direct effects,and other direct causes of its direct effects.Therefore,MB reveals the local causal structure around the target variable.Under the faithfulness assumption,MB is proved to be the optimal solution to the feature selection problem,and thereby,the causal feature selection algorithm is also called the MB discovery algorithm.In recent years,causal feature selection has received extensive attention.After years of development,numerous algorithm families have been proposed in this area to meet different data requirements or performance requirements,and these algorithms have been successfully applied to many real-world application scenarios.However,existing algorithms still perform poorly in some real-world application scenarios.On single-label data,existing algorithms have strict assumptions about data distribution,variable types,and the correctness of evaluation criteria,which cannot handle the nonlinear mixed data in the real world;Even on standard datasets that satisfy multiple assumptions,existing algorithms usually achieve good performance in precision,but imperfect performance in recall,leading to some critical features ignored.On multi-label data,the practicability of existing algorithms will be further weakened,as they cannot consider the impact of the causal relationship between labels and multiple MBs on the algorithm.As a result,they also cannot identify the common causal features of multiple labels and the label-specific causal features of each single label.This paper tackles the above challenges,and the specific contributions are as follows:1.This paper discovers a type of incorrect conditional independence test from extensive empirical studies,namely the PCMasking phenomenon,which is proven to cause the recall degradation.This paper analyzes the underlying mechanism of the PCMasking phenomenon and proves that it could break the symmetry between a pair of cause-effect variables.To improve the accuracy of the discovered MB,we propose a novel causal feature selection algorithm to tackle the PCMasking phenomenon.Experiments on standard Bayesian network data show that the proposed algorithm significantly improves the performance of causal feature selection and can identify more critical features.To further improve the time efficiency of the causal feature selection,this paper proposes a data structure to organize the results of conditional independence tests,called Pipeline Machine,which can accelerate the constraint-based MB discovery method.By embedding the pipeline machine into the MB discovery algorithm,the proposed method can improve both accuracy and efficiency.2.This paper further promotes the practical application of MB.learning algorithms in nonlinear mixed data for the practical needs of real-world applications.To identify nonlinear relationships,this paper associates MB with the conditional covariance operator,and proves the equivalence between the MB and the feature subset minimizing the conditional covariance operator.Based on this theoretical contribution,the MB discovery process can be transformed into an optimization process that directly minimizes the conditional covariance operator.To address the estimation error of the conditional covariance operator,this paper proposes a more practical MB learning strategy to identify causal features by evaluating the predictability of mapping features in kernel space.This strategy can maintain feasibility and effectiveness in real-world mixed data,where the data can include continuous or discrete variables,and the relationships between variables can be linear or nonlinear,pairwise or multivariate.We further verify the effectiveness of the above contributions based on numerous experiments.3.This paper focuses on the MB learning of multiple targets for the first time,aimingat identifying the MB of a target set and further distinguishing the common MB variables of multiple targets and the target-specific MB variables of each target.In this paper,we first study the impact of multiple MB on the common MB variable in the unfaithful distribution,and carry out theoretical analysis respectively with and without considering the direct relationship between targets.By the concept of equivalent information,we give the properties of the common MB variables and the target-specific MB variable.Based on these theoretical analysis,this paper proposes a multi-target MB discovery algorithm to identify and distinguish the above two types of MB variables.In addition,the algorithm is extended to a multi-label causal feature selection algorithm.We theoretically prove that the causal feature subset selected by the proposed algorithm possesses the maximum relevance and minimum redundancy,and employ extensive experiments to verify that the algorithm achieves better performance and simultaneously possesses interpretability.
Keywords/Search Tags:Causality, Feature Selection, Markov Blanket, Causal Learning, Multi-label Causal Feature Selection, Bayesian Network, Markov Boundary
PDF Full Text Request
Related items