Font Size: a A A

Research On Feature Selection Method Without Model Constraints Under Ultra High Dimensional Data

Posted on:2020-04-16Degree:MasterType:Thesis
Country:ChinaCandidate:Y T LiuFull Text:PDF
GTID:2370330572991618Subject:Probability theory and mathematical statistics
Abstract/Summary:PDF Full Text Request
With the continuous innovation of computer technology and the deepening of data mining technology,the data used by scientific research institutes has experienced an unprecedented explosive growth in terms of quantity and complexity.Existing data analysis and mining methods are facing many challenges such as statistical accuracy and computational cost when used in ultra-high-dimensional data applications.To use existing models and methods for processing ultra-high-dimensional data,it is necessary to carry certain dimensionality reduction techniques.The result of variable selection directly affects the quality of statistical modeling.In turn,it has a great impact on the accuracy and interpretability of the model.Therefore,variable selection is a crucial step in data analysis and mining under ultra-high dimensional background.SIS and its improved methods are widely used in the selection of ultra-high-dimensional data variables.However,such methods are not satisfactory when selecting variables in ultra-high-dimensional data with more complicated relationships.Starting from the above background,this paper first briefly introduces some of the main ideas of traditional variable selection or model selection methods,and points out that these methods have many problems in the processing of ultra-high-dimensional data,such as worse theoretical properties or computational index growth of calculation.Secondly,we introduced the SIS method which is widely used in ultra-high-dimensional data,and maximum information coefficient MIC which is able to judge more complicated relationships between variables.We explain the main ideas and principles of SIS and analyze the most critical part of the SIS method,it's correlation measure.At the same time,some other correlation measures used in some SIS improvement methods are briefly mentioned,and these correlation measures are systematically compared with MICs in combination with examples.It is found that the correlation measures used by SIS and its improved methods have certain limitations.Even if the improved method which relax the model requirement,it can only perform well in linear models,generalized linear models or other models with monotonic function relations,but MIC can be used in various functional relationship models maintaining good performance.It results in a conclusion that the MIC can measure the correlation between variables more comprehensively than other correlation coefficients.Then a MIC-based ultra-high-dimensional data model-free variable selection method MIC-SIS is proposed.In theory,it can select variables in all functional relationship models and even non-function relation models which is a powerful complement and perfection of the SIS method family,filling the blank that SIS and its improved methods almost can't select the correct variables in complex function model.Compared with other SIS and its improved methods,MIC-SIS can handle a wider,more flexible and more complex model relationship,which has great advantages.This will be visually expressed in the numerical simulation study and real data research in Chapter 3.The last chapter summarizes the MIC-SIS method and the main research results obtained in this paper,points out some problems in the research and gives some directions of the future research.
Keywords/Search Tags:MIC, SIS, Variable Screening, Ultra-high dimensional data
PDF Full Text Request
Related items