Font Size: a A A

Variable Selection Algorithm Based On Variable Selection Deviation

Posted on:2017-05-08Degree:MasterType:Thesis
Country:ChinaCandidate:S B WangFull Text:PDF
GTID:2308330485986007Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With the development of the big data, data redundancy is increasing more and more and becomes expanding in more dimensions. Hence, extracting valuable information from data with a huge redundancy of information is extremely difficult. Therefore, variable selection is necessarily before modeling data. When the model is supposed to be a liner model, there are many variable selection algorithms, such as Lasso, MCP, SCAD etc. The model selected by Lasso generally contains many redundant variables and the model selected by MCP may be lack of some important variables. The distance between the model selected by SCAD and potential true or true model is too large. So the three variable selection algorithms are barely satisfying in some field.In this thesis, The concept that variable selection deviation which can delete redundant variables and preserve importance as we introduced. It can measure the distance between a model and the potential true model. In this thesis, Variable Selection algorithm based on Variable Selection Deviation(VS-Based-On-VSD) are introduced, and Variable Ranking algorithm based on Variable Selection Deviation(VR-Based-OnVSD) is also introduced. The VSD of the best variable subset selected by VS-Based-OnVSD is on the minimum value, and the symmetrical difference between the best variable subset selected by VS-Based-On-VSD and potential true model is smallest. The least redundant variable or information, and useful structure information hidden in the data as far as possible is included in the best variable subset selected by VS-Based-On-VSD. We also provide a method that solves the best variable subset which has the smallest VSD value, and demonstrates that it is global optimal by mathematical proof. Variables included in the model selected by VR-Based-On-VSD are weighted by Variable Selection Deviation and the weight of the variable in the best variable subset is larger than a threshold value. The variable subset is related to the threshold value, and when the threshold value is equal to 0.5, the best variable subset selected by VS-Based-On-VSD is the same of selected by VR-Based-On-VSD. Therefore, if the threshold value is less than 0.5, the variable subset selected by VR-Based-On-VSD will include more useful information which can contribute to prediction and classification for the unknown samples.The contrastive analysis is carried out between the two novel algorithms and three traditional variable selection algorithms(Lasso, MCP, SCAD). When the noise level is not high, the prediction ability of VS-Based-On-VSD is equal to Lasso, which is higher than MCP and SCAD, but the redundant variable included in the selected variable subset is less than Lasso. So the distance between the best variable subset selected by VS-BasedOn-VSD and the potential true model is nearer than VR-Based-On-VSD can effectively describe data set.
Keywords/Search Tags:Variable Selection Deviation, Variable Selection, Symmetric Difference, Variable Ranking
PDF Full Text Request
Related items