Font Size: a A A

Genome-Wide Association Studies Based On Random Forest

Posted on:2017-02-25Degree:MasterType:Thesis
Country:ChinaCandidate:K CuiFull Text:PDF
GTID:2180330482996148Subject:Software engineering
Abstract/Summary:PDF Full Text Request
Single nucleotide polymorphism(SNP) is caused by a single DNA nucleotide mutation, which can help locating genes associated with disease.With the rapidly developing high-throughput genotyping techniques, SNP are distributed densely in the entire genome, and it’s possible to detect SNP loci associated with diseases within the scope of the whole genome. Genome-wide association study(GWAS) is based on SNP loci analysing, hoping to find factors associated with complex diseases. However, the data of genome-wide association studies are super high dimensional with small samples, and the task of identification of disease related SNP is highly challenging.Random forests(RF) is an advanced machine learning method that has been applied to some genome-wide association studies of complicated diseases. RF has good prediction precision with small data set. However, it’s hard to establish accurate prediction with large data set. This paper proposes a subset classification method based on Random Forest with two steps- TSRF(Two Steps SNP subset classification method-based on the Random Forest), which can be applied to establish accurate prediction model for GWAS data set. In the first step, for each SNP we calculate an Importance Score(IS) and a p-value by Wilcoxon rank test. Then, we set a threshold to remove the SNP whose p-values are above the threshold, and the remained SNP are regarded as a related SNP set. In the second step, the chi-square test(χ2) is used to calculate the statistical significance of the each SNP in the related SNP set, and set another threshold according to the test results. Then, the associated SNP will be partitioned into the two subsets: the high related SNP subset and the low one. When building each decision tree in RF, the procedure of sampling feature subset for node split will only extract SNP from the two subsets in proportion. Therefore, the final prediction results will always just depend on relevant SNP. TSRF can reduce the dimension effectively of GWAS data, and generate more accurate random forest models with less generalization error as well as avoiding over fitting.We have tested TSRF on the real data of Parkinson’s disease and Alzheimer. Comparing with traditional RF, GRRF and WSRF methods in recent years, the results show that when the number of case-control objects is much smaller than the number of SNP, TSRF is the best method with higher prediction precision and less generalization error. The most significantly associated SNP identified by TSRF in the GWAS of Parkinson’s disease may provide a guidance for subsequent biological validation experiments.
Keywords/Search Tags:Genome-wide association study, complex diseases, Random forests, SNP
PDF Full Text Request
Related items