Application Of Multiple Hypothesis Testing Technology In Big Data

Posted on:2022-11-20

Degree:Master

Type:Thesis

Country:China

Candidate:H Du

Full Text:PDF

GTID:2480306761463754

Subject:Computer Software and Application of Computer

Abstract/Summary:

In big data analysis,hypothesis testing is more and more widely used.In the hypothesis test of multiple research objects,the traditional hypothesis test method does not regard all the test objects as a whole,but tests each hypothesis separately.Therefore,when analyzing the whole,the probability of making the first type of error is greater than the selected significance level.The rapid development of multiple hypothesis testing provides a solution to the above problems.This paper continues to study the gene selection problem proposed by predecessors.Because there is no multiple test for the results in the original algorithm of the problem,the selected genes are more likely to make the first kind of error,that is,the false positive rate is high.This paper improves the original algorithm and makes multiple tests on the selected data by controlling different error metrics.The results show that the power is too low because the method of controlling the family-wise error rate is too strict;Controlling the false discovery rate and positive false discovery rate is more suitable for the research of this problem.Compared with each multiple test algorithm,the q-value method takes into account the prior information of the original hypothesis,so it not only controls the false positive rate,but also has high power.This paper is divided into four parts.The first part mainly introduces the research background and significance.The second part describes the basic theory of hypothesis test and multiple hypothesis test,three error measurement standards and control methods of multiple hypothesis test.In the third part,the multi test technology is applied to big data.Based on the gene selection problem proposed by predecessors,the improved ORIOGEN algorithm and ORIOGEN-Hetero algorithm are used to select the genes under the homovariance and heteroscedasticity respectively.Firstly,two groups of experimental data are simulated by MATLAB software,and the p-value of each simulated gene is calculated by using the improved algorithm.Secondly,the R software is used to carry out multiple tests on the simulation data,and the p-value of each simulation gene is used to calculate the results of each multiple test algorithm.Then calculate the false positive rate and power of different algorithms at the significance level,and analyze the results.Finally,the control methods of each error metric are summarized and compared with the original algorithm.The fourth part summarizes and prospects the problems.

Keywords/Search Tags:

big data, multiple hypothesis test, family-wise error rate, false discovery rate, q-value

Related items

1	On Three Error Measurements In Multiple Hypothesis Testing
2	Multiple Hypothesis Testing Method And Applications
3	Some Studies On Multiple Comparisons
4	Research Of False Discovery Rate Control Based On Improved SDA Method
5	The Non-parametric Estmation Of False Discovery Rate And Its Application
6	Construction And Analysis Of Knockoff Based Variable Selection Algorithm
7	Multiple Hypothesis Test Error Rate Control Analysis Of The Process
8	Applications Of Change Point Selection Process Via FDR Multiple Test Method Based On The Semi-parametric Model
9	The Study Of The False Discovery Rate For The Generalized Linear Model
10	An Optimal FDR Control Method Under The Three-group Model