Font Size: a A A

Clustering Knockoff Method Controlling The FDR For High-dimensional Selective Inference

Posted on:2020-12-04Degree:MasterType:Thesis
Country:ChinaCandidate:H Y YuFull Text:PDF
GTID:2428330572980656Subject:Statistics
Abstract/Summary:PDF Full Text Request
Nowadays,with the development of society,the dimension of data collected is also increasing.Therefore,people pay more attention on how to extract the effective information efficiently from big data.Especially in the field of Biostatistics and gene research,the dimensions of data are usually much higher than the number of samples.At present,most methods in statistics and many algorithms in machine learning can only be applied to low-dimensional data,while the research in high-dimensional and ultra-high-dimensional fields is relatively few.In the field of ultra-high-dimensional research,the dimension of data is usually reduced to the range that can be processed.and then the next step of calculation.Therefore,how to improve the accuracy of variable selection has become an urgent problem to be solved.Among them,multiple test is one of the methods to solve this problem,but in recent years,there is little research in this field.The most common method of multiple test is to control type one error rate by controlling family-wise error rate(FWER)or the flase discovery rate(FDR),that is,to select variables when FWER or FDR do not exceed a certain threshold.Barber and Candes(2014)[1]proposed knockoff method to control FDR for the first time.They found that knockoff method is more effective than the classical BH method and has made some breakthroughs in the field of multiple tests.However,this method can only be applied to low-dimensional data,i.e.,constrained n<p,which makes it not very good in the application of biological statistics.In this paper,we propose a clustering-based knockoff method,which extends the knockoff method to ultra-high dimensional data by clustering.Firstly,all variables are clustered and divided into m groups of variables.Then,the corresponding knockoff variables are calculated for each group of variables.Then,the calculated knockoff variables are combined and brought into LASSO model.W statistics are constructed according to the corresponding parameters of the original variables and knockoff variables when they are selected into the model.Finally,a similar method is used to obtain the final result of variable selection under the control of FDR.We find that this method can effectively control the FDR under the condition of ultra-high dimension,and the effect is better than that of Barber and Cannes,which will adopt two-stage method in 2016.Finally,this method is applied to variable selection of gene microarray data.
Keywords/Search Tags:Multiple test, FDR, knockoff
PDF Full Text Request
Related items