Font Size: a A A

The Research On Causal Feature Selection Algorithm Based On AD-tree

Posted on:2022-08-28Degree:MasterType:Thesis
Country:ChinaCandidate:C F LiuFull Text:PDF
GTID:2518306560455124Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Since the 1960 s,feature selection,as a dimensionality reduction technology that reduces the number of features,removes redundant features and noisy data,has been widely replaced in various fields of machine learning and data mining.The traditional feature selection method selects features by calculating the correlation between features and the category variables,while ignoring the causal relationship between them.The causal feature selection method discovers the local causality of categorical variables by learning the local Bayesian network(BN)structure of categorical variables,that is,Markov boundary(MB),it can make the algorithm more interpretable and robust.At the same time,under certain assumptions,the MB(i.e.feature subset)of the categorical variables is unique in the Bayesian network,and it has been proven to be the theoretical optimal solution for feature selection,which is well suited for classification tasks.Therefore,the causal feature selection method has attracted more and more scholars' attention and research.In this dissertation,causal feature selection is taken as the core research content,and the main research work is as follows.(1)In consideration of the problems of large memory overhead and high computational cost of conditional independence(CI)testing,existing causal feature selection algorithms cannot perform MB learning on a huge number of data samples,a static AD-tree based causal feature selection algorithm(BDMB)is proposed.First,the algorithm uses the AD-tree structure to completely save the data in the form of statistical information under a relatively fixed memory overhead and only needs to scan the data once.Then,by connecting the AD-tree structure with the contingency table,the algorithm makes the CI test in the MB learning process more convenient and efficient,and we present the Make CT algorithm for obtaining the contingency table from the tree structure.This dissertation analyzes the complexity and performs algorithm tracking,and verifies the effectiveness of BDMB on the benchmark data sets and real data sets.(2)In view of the existing causal feature selection algorithms are concentrated on static data,ignoring the streaming data,this dissertation proposes a dynamic AD-tree based causal feature selection algorithm(SDMB).When each data block arrives,the algorithm aggregates the new data information by updating the existing AD-tree structure,so that the increasing data are saved in a certain memory;and then perform MB learning on this basis.In the learning process,the original MB set is used as the initial set,and then each new feature is added to it,the old features are removed once until the feature subset is selected.At the same time,the combination of tree structure and contingency table can also accelerate the calculation of CI test and improve the efficiency of the algorithm.This dissertation analyzes the complexity of the algorithm and performs algorithm tracking.Finally,on benchmark datasets and real datasets,we verify the effectiveness and accuracy of MB learning by SDMB on streaming data.
Keywords/Search Tags:Feature selection, Causal feature selection, Markov boundary, Bayesian network, Streaming data
PDF Full Text Request
Related items