Font Size: a A A

Study On Feature Subset Selection Algorithm And Its Recommendation Method

Posted on:2018-04-13Degree:DoctorType:Dissertation
Country:ChinaCandidate:G T WangFull Text:PDF
GTID:1318330533451681Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
In the filed of data mining,there is a famous principle “Garbage In,Garbage Out'.So the reliability of the mined restuls depends on the quality of the colleced data.And one critical factor affecting the data quality is the feature used to describe the data.Feature subset selection(or feature selection),which selects a set of features being closely related to the learning target,can be used to improve the data quality.The paper focuses on proposing novel feature subset selection algorithms and feature selection algorithm recommendation for a given problem.Feature interaction is a non-negligible issue in feature selection.However,most of the exisitng feature selection algorithms usually only focus on dealing with irrelevant and redundant features.In this paper,we first propose two novel algorithms which take the feature interaction meanwhile.For the data set whose dimension is not too high,a novel feature selection algorithm FEAST(Feature subset s Election Algorithm based a Sossica Tion rule mining)is proposed based on association rule mining.The proposed algorithm first mines two kinds of constraint association rules: classification and atomic association rules,and then use the classification association rule to detect the relevant and interactive features and atomic association rule to remove the redundant features.As the support and confidence thresholds are two important parameters in association rule mining and play a vital role in FEAST,a partial least square regression(PLSR)based threhold prediction model is presented as well.Finally,the resutls on synthetic data sets show that FEAST can effectively idenfiy the irrelevant and redundant features while taking the feature interaction into accout.The results on 35 real world data sets show that FEAST outperforms other commonly-used feature selection algorithms in terms of average classification accuracies of four well-known classifiers Na?ve Bayes,C4.5,PART and IB1.In addition,the results also show the PLSR based threshold prediction model works well in recommending proper thresholds for FEAST.Due to the fact that the time complexity of FEAST is high,it is not proper for high-dimensional data.So a propositional FOIL(First Order Inductive Leaner)rule based feature selection algorithm FRFS(FOIL Rule based Feature subset Selection algorithm)is proposed.The proposed algorithm introduces a constraint in the precoss of FOIL rule mining to exclude the redundant features while retaining the feature interaction.Meanwhile,it identifies and removes irrelevant features by evaluating features with a new proposed metric Cover Ratio.The effectiveness of FRFS is extensively tested both on synthetic and high-dimensional data sets.The results on synthetic data sets show that FRFS can effectively identify irrelevant and redundant features while reserving the interactive ones.The results on 35 high-dimensional data sets demonstrate that,comparing with the representive feature selection algorihtms,FRFS can not only significantly improve the peformance of the four well-known classifiers,Na?ve Bayes,C4.5,PART and IB1,in terms of average classification accuracy,but also be quite efficient for high-dimensional data sets,and the speedup factor is up to 10 times at least when comparing with the exisitng algorithms.For a given feature selection problem,the performance of different feature selection algorithms might be different as well.According to No Free Lunch theory,there does not exist a special algorithm performing well on all problems.So how to pick up the proper algorithms for the given problem is neccesary.In field of data mining,meta-learning is a topic which explores the interaction between the characteristics of the problems and the performance of the candidate algorithhms,and utilize this interaction to choose the proper algorithms for the given problem.Thus,the paper first proposes a meta-learning based feature selection algorithm automic recommendation method.The recommendation method firstly identifies the nearest data sets for the given data set,then ranks the candidate algorithms based on their perfoamnce on these neareset datasets and recommends the algorithms with top rank as the appropriate ones for the given data set.Meanwhile,the recommendation method evaluates the performance of candidate algorithms by a user oriented multi-criteria metric taking into account not only the classification accuracy over the selected features,but also the runtime of feature selection and the number of selected features.The recommendation method is extensively tested on 115 real world data sets and 22 feature selection algorithms.The results show the effectiveness the recommendation method.The recommendation hit ratio is up to 90% at least.The meta-learning based recommendation methods can be usally distinguished by two dimensions: meta-features which are a set of measures used to characterize the data set and meta-target which represents the relative performance of the candidate algorithms.The existing recommendation methods usually view the meta-target as a single-label form or a rank list of the candidate algorithms.And the ranking list based recommendation method usually can not tell us which algorithms should be recommended.However,both the theoretical analysis and experimental results demonstrate that it is natural to view the meta-target as a multi-label form since there would be multi-algorithm being appropriate for a given problemm,and the number of appropraite algorithms varies with different data sets.Therefore,a novel multi-label learning based algorithm recommendation method is proposed.Finally,by comparing the multi-label learning based recommendation method with the exisiting single-label and algorithm ranking based ones on the 115 data sets,five different kinds of data set characteristics and 22 feature selection algorithms,the results demonstrate that the multi-label learning based recommendation method is more effective in terms of average recommendation hit ratio on different kinds of data set characteristics.
Keywords/Search Tags:Feature Subset Selection, Algorithm Recommendation, Meta-learning, Multi-label Learning
PDF Full Text Request
Related items