Research On Variable Selection In Data Mining

Posted on:2019-04-22

Degree:Master

Type:Thesis

Country:China

Candidate:B H Chen

Full Text:PDF

GTID:2428330563993056

Subject:Applied Statistics

Abstract/Summary:

PDF Full Text Request

In this information explosion era,how to quickly obtain the most valuable information from massive data has become a hot topic of concern.In this process,variable selection has become an indispensable and important part.Variable selection is an optimization of the original sample set.It is an effective means to reduce the sample dimensions and improve the performance of machine learning algorithms.The success of variable selection affects the effectiveness of model fitting to a great extent,so variable selection techniques have attracted more and more attention.In classification and regression problems,variable selection can eliminate redundant features to a certain extent and improve the accuracy of model prediction.This article mainly focuses on the regression of GDP forecast in major cities in China.It uses two methods respectively to screen 17 variables in the case and compares the differences in GDP forecast results between the two methods.(1)We use the Lasso method in L1 regularization to make variable selection for the original sample,establish a regression support vector machine model for the filtered variables,and compare it with the support vector machine model established by the original sample to find out the model built after variable selection.The prediction accuracy is indeed better than the prediction accuracy of the model built under the original sample.(2)Stepwise variable selection method of random forest was used to rank the 17 variables in the sample,and the least important variables were gradually deleted to establish a random forest regression model,and the best prediction model was selected finally.Through comparison,it is found that the prediction accuracy of the model built after stepwise variable selection is indeed better than that of the random forest model under the original sample.(3)Conduct comparative analysis.In this case,the results of the variables selected by the two methods are different,but the prediction effect of the random forest model established using the stepwise iteration method is better than the prediction effect of the Lasso-SVR model.Through the verification analysis,it is found that the selected variables are better through the random forest model,and most of the variables selected by the two methods have a strong correlation with the dependent variable,but some variables with weak correlation with the dependent variable are also selected.The probable reason is that only the linear correlation is measured in this paper,and in several variables with strong correlations between the dependent variables,the algorithm may choose only a few of them to avoid multicollinearity.

Keywords/Search Tags:

Variable selection, Lasso, Random forest, Regression analysis

PDF Full Text Request

Related items

1	Analysis And Forecast Of Mobile Phone Sales Based On Data Mining
2	Research On Adaptive Feature Selection And Parameter Optimization Algorithm For Random Forest
3	Research On Feature Screening And Regression Prediction Of Ligand Bioactivities Via Deep Learning
4	Research On Random Forest Algorithm Based On Feature Selection And Diversity
5	Study On The Theory And Application Of The Fused Lasso Penalty Model
6	Research On Regression Analysis Of Clinical Variable Based On Functional Brain Images
7	Research On Feature Selection And Classification Method Based On Random Forest For Medical Datasets
8	Application Research Of Group Lasso Penalty Regression Model And Algorithm
9	High-dimensional regression with random design, including sparse superposition codes
10	Novel Random Forest and Variable Importance Methods for Clustered Dat