Genome-Wide Association Studies Based On Random Forest

Posted on:2017-02-25

Degree:Master

Type:Thesis

Country:China

Candidate:K Cui

Full Text:PDF

GTID:2180330482996148

Subject:Software engineering

Abstract/Summary:

PDF Full Text Request

Single nucleotide polymorphism(SNP) is caused by a single DNA nucleotide mutation, which can help locating genes associated with disease.With the rapidly developing high-throughput genotyping techniques, SNP are distributed densely in the entire genome, and it’s possible to detect SNP loci associated with diseases within the scope of the whole genome. Genome-wide association study(GWAS) is based on SNP loci analysing, hoping to find factors associated with complex diseases. However, the data of genome-wide association studies are super high dimensional with small samples, and the task of identification of disease related SNP is highly challenging.Random forests(RF) is an advanced machine learning method that has been applied to some genome-wide association studies of complicated diseases. RF has good prediction precision with small data set. However, it’s hard to establish accurate prediction with large data set. This paper proposes a subset classification method based on Random Forest with two steps- TSRF(Two Steps SNP subset classification method-based on the Random Forest), which can be applied to establish accurate prediction model for GWAS data set. In the first step, for each SNP we calculate an Importance Score(IS) and a p-value by Wilcoxon rank test. Then, we set a threshold to remove the SNP whose p-values are above the threshold, and the remained SNP are regarded as a related SNP set. In the second step, the chi-square test(χ2) is used to calculate the statistical significance of the each SNP in the related SNP set, and set another threshold according to the test results. Then, the associated SNP will be partitioned into the two subsets: the high related SNP subset and the low one. When building each decision tree in RF, the procedure of sampling feature subset for node split will only extract SNP from the two subsets in proportion. Therefore, the final prediction results will always just depend on relevant SNP. TSRF can reduce the dimension effectively of GWAS data, and generate more accurate random forest models with less generalization error as well as avoiding over fitting.We have tested TSRF on the real data of Parkinson’s disease and Alzheimer. Comparing with traditional RF, GRRF and WSRF methods in recent years, the results show that when the number of case-control objects is much smaller than the number of SNP, TSRF is the best method with higher prediction precision and less generalization error. The most significantly associated SNP identified by TSRF in the GWAS of Parkinson’s disease may provide a guidance for subsequent biological validation experiments.

Keywords/Search Tags:

Genome-wide association study, complex diseases, Random forests, SNP

PDF Full Text Request

Related items

1	Construction And Analysis Of Complex Networks For Genome-wide Association Data
2	Application Of Complex Network Analytical Method To Genome-Wide Association Studies
3	The Research On Epistasis Detection Algorithm In Genome-wide Association Study
4	Application Of Tag SNP-set Analytical Method In Genome Wide Association Study
5	The Research Of Rare Variants Based On The Genome-wide Association Study
6	An Application Of Tabu Table Based Negative Feedback Ant Colony Optimization Algorithm In Genome-wide Association Analysis
7	Genome-Wide Interaction Study Of Single Nucleotide Polymorphisms
8	MAX Precision Test Of The Genome-Wide Association Study Under Two-Stage Design
9	Association Studies On Missing Heritability Of Complex Phenotypes
10	Hierarchical Mixed Model For Genome-wide Association Analysis Of Animal Growth Trajectories