Font Size: a A A

Implications Of P Value Variability And Other Factors For Statistical Feature Selection

Posted on:2019-07-28Degree:MasterType:Thesis
Country:ChinaCandidate:Wei WangFull Text:PDF
GTID:2370330596967133Subject:Pharmacy
Abstract/Summary:PDF Full Text Request
Statistical feature-selection techniques are extensively used in comparative genomics and proteomics analysis for feature selection.Feature selection refers to the progress of identifying the meaningful variables(e.g.genes or proteins)from high-dimensional omics data,with the expectation that the selected features are associated with biomarker identification,drug-target predication,phenotype characterization and treatment potential.For commonly used statistical feature-selection methods(i.e.Student's t-test(t-test),Wilcoxon Rank-Sum test(U test),Limma,etc),P value is always deemed as the indicator of significance.However,recent studies indicate that there is a wide range of Student's t-test P value variability.For further exploration,we examined the variability of P value and effect size for t-test and U test,which is based on simulated data including a variety of distributions(normal distribution,poisson distribution,exponential distribution and mixed distribution)and different sample sizes.For both t-test and U test,we find that the estimated effect size will be convergence to the true effect size as the sample sizes increases.While the P value does not exhibit reduction of its variability,even in the high power scenario(great effect size and/or large sample size).Since the P value is extremely unstable and performs poorly in predicting effect size,selecting the features(i.e.genes or proteins)based on the P value ranking may be a terrible idea.We verified this using t-test on a real data consisted of two classes(12 normal people and 12 renal cancer samples),and then selected the top 500 features based on the P value ranking.The top 500 features are extremely instable actually.Moreover,the P value variability has implications for multiple testing analysis,for example,it may give rise to irreproducible features without diagnostic power(i.e.it has negative influence on reproducibility of science findings).The P value instability is an inevitable and insolvable problem.Variability is a natural attribute to P value.The regulation of P value variability is a more important and useful question.We applied t-test to numeric simulated scenarios including varying effect sizes and sample sizes to observe the P value variability.Our consequences indicates that given the increasing samplings,the power will increase,and the corresponding inferred effect size will be close to the true effect size.Interestingly,the P value variability does not reduce,and even increases with the increment of effect size.This phenomenon suggests that sometimes effect size is more helpful for making an informed decision whether the gene or protein is really different or not.We should consider other metrics along with the P values aiming at the accurate decision,such as confidence intervals and cross-validation accuracies.On the other hand,the P value instability may be reduced via increasing power.As we all known that the improvement for sampling size and measurement accuracy can increase power,but in practice,both of them are unattainable.Some avenues have great ability for increasing power,i.e.Signal Boosting Transformation techniques and Network-based Statistical Testing.Finally,there are some overlooked issues – upstream process(normalization methods),downstream processes(multiple testing corrections)and diverse heterogeneity across different data – for the performance statistical feature selection in high-throughput data.To illustrate this point,we compared the performance of each combination of five univariate statistical feature-selection methods(t-test,U test,Limma,Rank product and Kolmogorov-Smirnov test)and various datasets(simulated data and real data),given different normalization techniques or multiple testing corrections.We further investigated the implications of normalization methods and multiple testing corrections for the performance of the five univariate statistical feature-selection methods,given data with varying degrees of heterogeneity.Our findings demonstrate that the normalization methods have great influence on high-throughput data analysis.And multiple testing corrections have negative impact on sensitivity.The performance of statistical feature-selection methods are also strongly confounded with different heterogeneity.Therefore,in the benchmarking statistical feature-selection methods,we should consider these factors.
Keywords/Search Tags:Statistical feature-selection method, P value, Variability, Reproducibility, Normalization methods, Multiple testing correction, Heterogeneity
PDF Full Text Request
Related items