| With the development of the automatic acquisition data technology,the concept of big data have been accepted widely.Traditional statistical methods face to the tough challenges.For example,the number of gene expression profiles may tens of thousands,but the number of sample may only several hundred in the medical research.There is much noise in big data,and it is mixed with many explanatory variables which are independent of the response variables.Therefore,the variable selection method and dimensionality reduction technique are developed.The variable selection methods based on penalized functions are usually to deal with high dimensional data.There are three types methods based on penalized functions:Single variable selection,group variable selection and Bi-level Variable Selection.In this paper,we study the effect of variable selection for different data structures and different linear models.The research contents are divided into two parts as follows:1、The variable selection methods based on penalty function are extended to the generalized linear regression.(1)Research on the variable selection method based on penalty function on Logistic regression model.The methods are verified by computational simulation in six different types of data and the Arrhythmia data set in UCI.The result showed that Group Bridge(Bi-level variable selection method)on the Logistic regression model has better characteristics--It has higher model prediction accuracy,more stable and accurate selection variables.Therefore,thepenalty function variable selection method may improve the accuracy of disease diagnosis.It may help to assist doctors diagnosing disease and predicting the risk status of patients.(2)Research on variable selection method based on penalty function on Cox proportional hazards regression model.The methods are verified by computational simulation in seven different data structures.Setting five different censoring ratios,the breast cancer data set is used as an example.On the Cox proportional hazards regression model,Composite MCP based on the penalty function(Bi-level Variable Selection method)have excellent performance--Composite MCP on Cox proportional hazards regression model was better than another methods to choose the variables under low censored proportions.2、The variable selection methods of penalty function are applied to the analysis of high dimensional complex genetic data--Quantitative Trait Loci(QTL)mapping and Genome Wide Association Studies(GWAS).The effectiveness and feasibility of methods are verifiedby computational simulation research and examples.The results of Composite MCP for Bi-level Variable Selection method are compared with that of the random forest method,the single variable selection method based on penalty function.It indicates that Composite MCP for Bi-level Variable Selection method is a viable method to QTL mapping and genome-wide association Studies.It is able to accurately locate genetic loci that are significantly associated with trait and genetic disorders.In summary,the Bi-level Variable Selection method based on penalty function can keep excellent properties for the Logistic regression model,Cox proportional hazards regression model,QTL mapping or genome-wide association studies.It is more stable and accurate for selection variables and high-forecasting precision of model. |