Font Size: a A A

Research On Variable Selection In Data Mining

Posted on:2019-04-22Degree:MasterType:Thesis
Country:ChinaCandidate:B H ChenFull Text:PDF
GTID:2428330563993056Subject:Applied Statistics
Abstract/Summary:PDF Full Text Request
In this information explosion era,how to quickly obtain the most valuable information from massive data has become a hot topic of concern.In this process,variable selection has become an indispensable and important part.Variable selection is an optimization of the original sample set.It is an effective means to reduce the sample dimensions and improve the performance of machine learning algorithms.The success of variable selection affects the effectiveness of model fitting to a great extent,so variable selection techniques have attracted more and more attention.In classification and regression problems,variable selection can eliminate redundant features to a certain extent and improve the accuracy of model prediction.This article mainly focuses on the regression of GDP forecast in major cities in China.It uses two methods respectively to screen 17 variables in the case and compares the differences in GDP forecast results between the two methods.(1)We use the Lasso method in L1 regularization to make variable selection for the original sample,establish a regression support vector machine model for the filtered variables,and compare it with the support vector machine model established by the original sample to find out the model built after variable selection.The prediction accuracy is indeed better than the prediction accuracy of the model built under the original sample.(2)Stepwise variable selection method of random forest was used to rank the 17 variables in the sample,and the least important variables were gradually deleted to establish a random forest regression model,and the best prediction model was selected finally.Through comparison,it is found that the prediction accuracy of the model built after stepwise variable selection is indeed better than that of the random forest model under the original sample.(3)Conduct comparative analysis.In this case,the results of the variables selected by the two methods are different,but the prediction effect of the random forest model established using the stepwise iteration method is better than the prediction effect of the Lasso-SVR model.Through the verification analysis,it is found that the selected variables are better through the random forest model,and most of the variables selected by the two methods have a strong correlation with the dependent variable,but some variables with weak correlation with the dependent variable are also selected.The probable reason is that only the linear correlation is measured in this paper,and in several variables with strong correlations between the dependent variables,the algorithm may choose only a few of them to avoid multicollinearity.
Keywords/Search Tags:Variable selection, Lasso, Random forest, Regression analysis
PDF Full Text Request
Related items