Font Size: a A A

Improved Genome-wide Association Analysis Strategies Based On Machine Learning

Posted on:2024-08-16Degree:MasterType:Thesis
Country:ChinaCandidate:Q WangFull Text:PDF
GTID:2530307172468144Subject:Agricultural Information Engineering
Abstract/Summary:PDF Full Text Request
With the development of sequencing technology,Whole-Genome Association Analysis has become a mainstream method to study the genetic regulation mechanism of corresponding traits.As one of the most important food crops in the world,wheat is widely distributed in the world and is the main source of energy for human beings.In order to mine more genetic information of wheat,improve the efficiency of whole-wheat genome association analysis and promote wheat breeding,two new Whole-Genome Association Analysis methods based on machine learning were explored in this paper.The new method combines statistics and data mining knowledge,uses K-Means as subgroup grouping algorithm,combines Fisher exact test with random forest and XGBoost models,respectively,to form two new methods for genome-wide association analysis of wheat,and are named "Fisher_RF" and "Fisher_XGB".The innovation lies in the following aspects:(1)In the data preprocessing,singular value points were screened for phenotypic data by box diagram.The subgroup grouping algorithm was not limited to complex group structure analysis,but clustering phenotypic data by simple K-Means algorithm according to the contour coefficient and the sum of squares of clustering errors;(2)Dimensionality reduction was performed using Fisher’s exact algorithm plus a nonlinear machine algorithm for dimensionality reduction and re-screening to reduce the number of ultimately retained relevant single nucleotide polymorphism variants.In order to verify whether the two new methods can be applied to genome-wide association analysis of wheat,the results of genome-wide association analysis of wheat plant height were compared between the traditional genome-wide association analysis method based on generalized linear model and the two new methods.Results show that:(1)continuity of both traditional control and Fisher_RF methods are superior to Fisher_XGB methods.(2)A total of 2,967 height-related variants were screened out from traditional control groups,of which 140 were by Fisher_RF and 256 were by Fisher_XGB.(3)Before initiation,the explanatory rates of phenotypic variance for both Fisher_RF and Fisher_XGB were 48.27% and 40.11%,respectively,versus 65.18% for the traditional control group.(4)However,the Fisher_RF results are best for sites known to be significantly associated with plant height,because the overall ranking of the sites selected by the method is much better than that of the control group.Therefore,I conclude that the Fisher_RF method is valid for Genome-Wide Association Analysis of wheat,but both of Fisher_XGB are poor in all aspects.Whether all Fisher_XGB is valid for Genome-Wide Association Analysis of wheat is controversial.
Keywords/Search Tags:Genome-Wide Association Analysis, Machine Learning, Fisher’s exact test
PDF Full Text Request
Related items