Font Size: a A A

Study On Genome-wide Association Analysis And Comparable Performance Using Decision-tree-based Methods

Posted on:2022-05-16Degree:MasterType:Thesis
Country:ChinaCandidate:D F ShenFull Text:PDF
GTID:2530307133986779Subject:Applied Mathematics
Abstract/Summary:PDF Full Text Request
Genome-wide association studies(GWAS)have been widely used in the genetic dissection and gene mining of quantitative traits in human,animal,and plant genetics.Complex traits are usually controlled by quantitative trait nucleotides(QTNs)that are large in number,small in effect,and susceptible to environmental influences,and mining these QTNs plays a vital role in analyzing complex traits of animals and plants.Currently,the most popular GWAS method is the mixed linear model(MLM).However,if there is little understanding about the genetic structure of quantitative traits,using mixed linear models tend to mine false-positive QTNs,and probably unable to detect those QTNs that have nonlinear associations with quantitative traits.Decision-tree-based methods are machine learning algorithms that are widely used in various fields.It can express the complex relationship between quantitative traits and SNPs in binary-tree structures without making assumptions about the genetic structure of quantitative traits.Therefore,the decision tree methods are expected to achieve better performance in detecting important QTNs.In this paper,we investigated the accuracy and efficiency of the four state-of-the-art decision-tree-based methods(XGBoost,LightGBM,CatBoost,and RandomForest)in detecting QTNs.First,we divided the QTNs into two categories: QTNs that have linear relationships with traits(linear QTNs)and QTNs that have non-linear relationships with traits(nonlinear QTNs).By analyzing five simulation experiments(1000replications)and seven flowering real data from Arabidopsis natural dataset,we evaluated the accuracy and efficiency of four decision-tree-based methods in detecting linear and nonlinear QTNs.The main results are as following:1)The results of five simulation experiments showed that LightGBM,CatBoost,and RandomForest have higher importance scores,lower FPR,higher AUC,and better medium ranks than XGBoost,and these four methods can significantly distinguish QTNs from other non-associated SNPs.In the Arabidopsis real data,CatBoost detected the largest number of genes(103),followed by XGBoost(100),LightGBM(82),RandomForest(81),and Lasso(35).Besides,we compared the genes detected by the four decision-tree-based methods with those by MLM and verified that the four methods can detect those genes that cannot be detected by MLM.It is worth mentioning that LightGBM has higher detection capability and faster calculation speed in both simulations and real datasets.2)Generally,population structure and polygenetic background correction or random effects correction can improve accuracy and efficiency of some GWAS methods.However,the results from simulation and real datasets showed that higher accuracy and efficiency can also be maintained by simply using decision-tree-based methods.3)The previously widely used decision-tree-based methods in GWAS are XGBoost and RandomForest.This study found that LightGBM can further improve the accuracy and calculation speed in detecting QTNs,which provides new ideas for genome-wide association analysis and linkage analysis.
Keywords/Search Tags:genome-wide association study, LightGBM, XGBoost, CatBoost, random forest, nonlinear
PDF Full Text Request
Related items