Font Size: a A A

Application Of Multiple Hypothesis Testing Technology In Big Data

Posted on:2022-11-20Degree:MasterType:Thesis
Country:ChinaCandidate:H DuFull Text:PDF
GTID:2480306761463754Subject:Computer Software and Application of Computer
Abstract/Summary:PDF Full Text Request
In big data analysis,hypothesis testing is more and more widely used.In the hypothesis test of multiple research objects,the traditional hypothesis test method does not regard all the test objects as a whole,but tests each hypothesis separately.Therefore,when analyzing the whole,the probability of making the first type of error is greater than the selected significance level.The rapid development of multiple hypothesis testing provides a solution to the above problems.This paper continues to study the gene selection problem proposed by predecessors.Because there is no multiple test for the results in the original algorithm of the problem,the selected genes are more likely to make the first kind of error,that is,the false positive rate is high.This paper improves the original algorithm and makes multiple tests on the selected data by controlling different error metrics.The results show that the power is too low because the method of controlling the family-wise error rate is too strict;Controlling the false discovery rate and positive false discovery rate is more suitable for the research of this problem.Compared with each multiple test algorithm,the q-value method takes into account the prior information of the original hypothesis,so it not only controls the false positive rate,but also has high power.This paper is divided into four parts.The first part mainly introduces the research background and significance.The second part describes the basic theory of hypothesis test and multiple hypothesis test,three error measurement standards and control methods of multiple hypothesis test.In the third part,the multi test technology is applied to big data.Based on the gene selection problem proposed by predecessors,the improved ORIOGEN algorithm and ORIOGEN-Hetero algorithm are used to select the genes under the homovariance and heteroscedasticity respectively.Firstly,two groups of experimental data are simulated by MATLAB software,and the p-value of each simulated gene is calculated by using the improved algorithm.Secondly,the R software is used to carry out multiple tests on the simulation data,and the p-value of each simulation gene is used to calculate the results of each multiple test algorithm.Then calculate the false positive rate and power of different algorithms at the significance level,and analyze the results.Finally,the control methods of each error metric are summarized and compared with the original algorithm.The fourth part summarizes and prospects the problems.
Keywords/Search Tags:big data, multiple hypothesis test, family-wise error rate, false discovery rate, q-value
PDF Full Text Request
Related items