Font Size: a A A

Selection Of Splitting Variable In CATR

Posted on:2020-07-24Degree:MasterType:Thesis
Country:ChinaCandidate:S L ZhangFull Text:PDF
GTID:2427330623458276Subject:Statistics
Abstract/Summary:PDF Full Text Request
Classification and regression tree(CART)model is favored by various fields of scientific research because of its strong readability and high classification efficiency.However,in the actual dataset with complex feature distribution,the classical CART algorithm has the disadvantages of low efficiency and poor classification accuracy when selecting the feature tree model,which leads to further research on the selection of CART stratification variables.This paper firstly combs and studies the development and theory of decision tree algorithm,and combines the characteristics of high-dimensional data features to illustrate the limitations of the CART algorithm and the necessity of feature selection.After elucidating the definition of three feature selection methods,the simple filtering method based on statistical characteristics,the filtering method based on variance analysis and the bagging method based on the random forest are further studied in detail.As an empirical study,the acute lymphoblastic leukemia dataset in the gene microarray data was tested.Three feature selection methods were used to select the genes with the importance ranking in the top 30 from the original feature set of 12625 genes as the final feature set.To select CART stratified variables for high-dimensional problems,this paper combines Repeated cross-validation with Nested cross-validation to propose an improved hierarchical cross-validation CART algorithm.Experimental validation was performed on the gene microarray dataset.The improved CART classification accuracy of 0.85 on the high-dimensional data of 3937 features is higher than 0.82 of CART established on 30 features.The improved CART is suitable for high-dimensional data,and the classification accuracy is improved.For the selection of CART stratification variables of conventional datasets,this paper combines the distance metric D(xi)of features and classification categories with theGini coefficients to obtain an improved index GD(S,xi=ximi)for selecting stratified variables,uses grid search to determine the optimal weights ???,therefore proposes an improved CART model based on grid search,and compares with the commonly used nine classification algorithms on the heart disease dataset of UCI database.The accuracy of the improved CART classification model on heart disease dataset is 0.94,which is higher than that of other 9 classification algorithms including CART,multilayer perceptron,Bernoulli Bayesian algorithm,logistic regression algorithm and support vector machine,etc.In this paper,the two aspects of dataset processing and feature selection index of the CART algorithm are innovated,and two improved algorithms are proposed,which have positive significance for improving CART in the context of actual classification.
Keywords/Search Tags:Decision Tree Classification, Feature Selection, Nested Repeated Cross-validation, Distance correlation
PDF Full Text Request
Related items