Font Size: a A A

Modeling Of A New Model For Genome Wide Association Study And Its Application In Genomic Prediction

Posted on:2018-12-11Degree:DoctorType:Dissertation
Country:ChinaCandidate:Y ZhouFull Text:PDF
GTID:1360330542492214Subject:Botany
Abstract/Summary:PDF Full Text Request
Genome-wide association analysis(GWAS)has been widely used for dissecting the regulatory mechanism of complex traits like human diseases,agronomic traits of plant and animals for nearly 10 years.The first genome selection(GS)model was proposed about 17 years ago.Genome selection has played a significant role in animal breeding,especially in dairy cattle breeding.With the reduction of genotyping prices,GS will also occupy an important position in plant breeding.Genetic transformation technology,genomic editing technology and GS will be the three major technologies for future breeding.GS will be the necessary technical basis for precision agriculture.GWAS and GS have also encountered some problems in the application process.With the deep understanding of complex traits,the current association models have some limitations:complex traits were controlled by polygenes,but most of currently wildly used models are single locus model;most models fail to detect rare alleles,they usually directly ignore them;genetic effects usually include additive effects,dominant effects and interactions,but most models are additive only model,a few model could detect the interactions due to the time complexity;population structure and other potential unknown factors would cause false positive;the p value inflation caused by polygenic effects and linkage disequilibrium;missing heritability and so on.Linear models,the Bayesian models and the machine learning models are the most important three classes in the development of GS models.The most used linear mixed model is genomic best linear unbiased prediction(gBLUP),and most of the others are the optimization of this model with improvement under certain conditions;Bayesian models are more accurate than linear models,or similar to linear models,the time complexity of Bayesian models are related to the number of markers,and the time required for a million mark will be too long to lose the value of the application;machine learning methods also have the same problem;there are currently a variety of R packages,or Linux-based command line software,but we still need graphical interface(GUI)software which is easy to use for breeders.The aims of this research are:1)to develop a multi-loci association model to improve the power of additive model and reduce the false positive rate.Through the algorithm optimization,the additive effect and the interaction effect analysis will be finished in a reasonable time.We use the R language to program the model,and the R package is public available;2)evaluation of GS accuracy.There are two different understandings on the accuracy of GS,and the bias of the accuracy is poorly understood.We will redefine the concept and invest the bias of the accuracy in detail;3)to develop a GUI software for GS.We will use JAVA language to develop the software,and provide a R package at the same time.In this study,a yeast F2 population was used to test the newly GWAS model,and the performance of the model was analyzed systematically.We also analyzed the flowering time related traits of maize NAM population composed of 36 recombinant inbred lines(RILs).We used the data of Arabidopsis thaliana,maize,mice and pine to study the accuracy of GS.1.Newly developed multi-locus model for High Dimensional GENE tic analysis(HDGENE).The model firstly detected the association using a single locus model by stepwise regression,and then the significant loci were analyzed by a multi-locus mixed model EM-Bayesian LASSO,which could control the false positive;the significant loci detected by EM-Bayesian LASSO will be added as a covariate into the stepwise regression model iteratively,which can improve the model power.Therefore,HDGENE model can improve power and reduce the false positive rate.In order to improve the ability of the model for big data,we first select partial markers by using the linkage disequilibrium of the genome.Secondly,we optimized the stepwise regression and reduced the running time.The optimized model could be used to analyze the pair-wise interaction among the whole genome.2.The power of EM-Bayesian LASSO model.EM-Bayesian LASSO model has the power of 80.6%on average under different scenarios of simulation,and the EM-Bayesian LASSO model has nearly 100%power to detect loci which can explain more than 5%phenotypic variance.The lower of phenotypic variance explained,the lower of the power;We also noticed that the marker effect estimated by EM-Bayesian LASSO model is biased.3.HDGENE additive model has high power.Using the phenotypic analysis of the yeast F2 population,we found that the average power of HDGENE was 71,9%,which was lower than the theoretical maximum of 80.6%,and similar to EM-Bayesian LASSO model,had good ability for high-effect loci.The lower the effect of the locus,the lower of the power.At the same time,the false positive rate(FDR)of the HDGENE additive model was low,only 7.0%;and most of the false positive loci explained phenotypic variance less than 1%.4.Comparison with QTCAT model.The power of QTCAT model is 52.2%,which is significantly lower than that of the HDGENE model,and the FDR is 8.8%,which is slightly higher than that of the HDGENE model.5.The results of HDGENE pair-wise interaction model.Simulation analysis showed that the power of HDGNENE pair-wise interaction model effect model was 87.8%,much higher than power of the interaction model implemented in R/QTL package,which is 75.7%.However,the FDR was 13.9%for HDGENE and was higher than that of R/QTL,which is only 3.2%.6.The epistatic effect contributes to the flowering time of maize.Analysis of flowering data using maize NAM population showed that although the heritability of interaction estimated by mixed model was almost zero,HDGENE model could find some interaction with minor effects.We observed that there exist 11 large effect epistatic interactions loci that could explain the phenotypic variance more than 10%.7.The accuracy of genome selection.According to the characteristics of cross validation,we redefined two types of accuracy,namely Hold and Instant accuracy.And we found that both Hold and Instant accuracy could cause biased estimation under certain conditions.But we can use corrected Instant accuracy to get an unbiased estimation.8.iGS software development:a JAVA GUI software was developed and three GS models(gBLUP,EM-Bayesian LASSO and polygenic model)were implemented into the software.We also programmed an R package.The newly developed association model and iGS software would help to improve the process of animal and plant breeding.
Keywords/Search Tags:association analysis, mixed model, epistatic interaction, genomic selection, softwares
PDF Full Text Request
Related items