Selection Of Splitting Variable In CATR

Posted on:2020-07-24

Degree:Master

Type:Thesis

Country:China

Candidate:S L Zhang

Full Text:PDF

GTID:2427330623458276

Subject:Statistics

Abstract/Summary:

PDF Full Text Request

Classification and regression tree(CART)model is favored by various fields of scientific research because of its strong readability and high classification efficiency.However,in the actual dataset with complex feature distribution,the classical CART algorithm has the disadvantages of low efficiency and poor classification accuracy when selecting the feature tree model,which leads to further research on the selection of CART stratification variables.This paper firstly combs and studies the development and theory of decision tree algorithm,and combines the characteristics of high-dimensional data features to illustrate the limitations of the CART algorithm and the necessity of feature selection.After elucidating the definition of three feature selection methods,the simple filtering method based on statistical characteristics,the filtering method based on variance analysis and the bagging method based on the random forest are further studied in detail.As an empirical study,the acute lymphoblastic leukemia dataset in the gene microarray data was tested.Three feature selection methods were used to select the genes with the importance ranking in the top 30 from the original feature set of 12625 genes as the final feature set.To select CART stratified variables for high-dimensional problems,this paper combines Repeated cross-validation with Nested cross-validation to propose an improved hierarchical cross-validation CART algorithm.Experimental validation was performed on the gene microarray dataset.The improved CART classification accuracy of 0.85 on the high-dimensional data of 3937 features is higher than 0.82 of CART established on 30 features.The improved CART is suitable for high-dimensional data,and the classification accuracy is improved.For the selection of CART stratification variables of conventional datasets,this paper combines the distance metric D(xi)of features and classification categories with theGini coefficients to obtain an improved index GD(S,xi=ximi)for selecting stratified variables,uses grid search to determine the optimal weights ???,therefore proposes an improved CART model based on grid search,and compares with the commonly used nine classification algorithms on the heart disease dataset of UCI database.The accuracy of the improved CART classification model on heart disease dataset is 0.94,which is higher than that of other 9 classification algorithms including CART,multilayer perceptron,Bernoulli Bayesian algorithm,logistic regression algorithm and support vector machine,etc.In this paper,the two aspects of dataset processing and feature selection index of the CART algorithm are innovated,and two improved algorithms are proposed,which have positive significance for improving CART in the context of actual classification.

Keywords/Search Tags:

Decision Tree Classification, Feature Selection, Nested Repeated Cross-validation, Distance correlation

PDF Full Text Request

Related items

1	Research On The Cause Analysis And Classification Of Repeated Vagrants
2	Feature Weighting Method For Binary Classification In Machine Learning
3	Research On The Application Of Decision Tree In The Employment Guidance For College Students
4	The Applied Research Of Ordinal Decision Tree In The College Students Comprehensive Quality Evaluation
5	Feature Selection And Bias-reduced Consistent Inference For Several High Dimensional Models
6	A Study Of Copula-Based Decision Tree With Applications
7	Feature Selection Based On Original Data Correlation
8	The Prediction Of Time Spend On Automobile Test System Using Regression Analysis
9	Teaching Quality Evaluation System Based On Student Performance Classification Model:the Decision Tree Method
10	The Method Of Selecting Local Feature Words And Its Application In Text Classification