Font Size: a A A

Methodological Establishment And Software Development Of Multi-locus Genome-wide Association Study With Polygenic Background Control And Kruskal-Wallis Test

Posted on:2017-04-24Degree:DoctorType:Dissertation
Country:ChinaCandidate:W L RenFull Text:PDF
GTID:1480306011986879Subject:Crop Genetics and Breeding
Abstract/Summary:PDF Full Text Request
Most important traits in animals and plants are quantitative traits,controlled by several major loci plus numerous undetectable loci with small effects.Genetic dissection is essential to improve and utilize these traits in animal and plant breeding.Genome-wide association study(GWAS)is an important method to conduct genetic dissection.However,the statistical power of these methods to detect QTN(quantitative trait nucleotide)is influenced by distribution of quantitative trait phenotypes,allele frequency and the significant threshold for single marker hypothesis testing.In order to improve the power of QTN detection,nonparametric methods have been paid more and more attention.Although there are many nonparametric GWAS methods available,none of these methods implement polygenic background control,resulting in a higher false positve rate.To overcome this problem,a new matrix transformation was carried out for the mixed linear model including the effect of polygenic background,so that the new model contained only QTN mutation and normal residual error.In the new genetic model,the Kruskal-Wallis(KW)test was performed to screen for all the markers potentially associated with the quantitative trait.A small number of markers were selected and placed into a multi-locus genetic model.The empirical Bayes method was used to estimate these effects,and the non-zero effects were tested by likelihood ratio test to identify markers associated with the quantitative trait.To validate the new method,five test data with 1000 replicates were simulated:1)6 QTNs and normal residual error;2)6 QTNs,additive polygenic background effect and normal residual error;3)6 QTNs,3 pairs of epistatic effects and normal residual error;4)6 QTNs and lognormal residual error;5)6 QTNs and logistic distribution residual error.To explore the performance of the new method,each simulated experiment was analyzed by Kruskal-Wallis test,the integration of KW test with empirical Bayes(KWeB),KWeB with polygenic background control(KWeBP)and efficient mixed model association(EMMA).Ten flowering related traits of Arabidopsis thaliana were used to further confirm the effectiveness of the new method.An interactive interface R software package was developed after the computer simulation and the real data analysis to validate the new method.The main results are as follows:1.The genetic model of GWAS included population structure,QTN effect,polygenic background effect and normal residual error.If the effect of population structure existed,it could be eliminated by regression analysis of quantitative trait phenotypic observation and population structure.And then,the matrix B=?gZKZT+In was implemented with spectral decomposition.Made B=(Q1?r1/2Q1T)(Q1?r1/2Q1T),and C=Q1?r1/2Q1T was multiplied to the left of genetic model excluding population structure,so that the new model only contained population mean,QTN effect and normal residual error,indicating that the effect of polygenic background was eliminated.2.In the new genetic model with polygenic effect removed,the coefficients of QTN effect were not binary,but continuous.In order to make the KW nonparametric test proceed normally,the continuous coefficients were converted to binary variables.If the larger coefficient was 1,the smaller the coefficient was-1,it became a binary variable.In this paper,we compared the criteria of the coefficients using the mean and median coefficients.And it indicated that the power of QTN detection was higher and the parameter estimation error was smaller using mean coefficient as criteria,so that the mean was taken as a conversion criteria.3.In the multi-locus genetic model,the number of effects was also an important parameter.Markers placed into the multi-locus model should be those were potentially associated with quantitative traits,which were the lowest probability ones in the single maker whole genome scan.In computer simulation studies and real data analysis,it was suggested to select the 100 and 1000 potentially associated markers with the lowest probability into the multi-locus genetic model.Of course,the AIC criteria could also be used to select the number of effects into the multi-locus model.4.Monte Carlo simulation studies showed:The average power of the six simulated QTNs obtained from the new method KWeBP was 8.2%,10.9%and 22.9%higher than KW,KWeB and EMMA respectively in the simulation test 1);8.4%,13.3%and 24.8%higher respectively under the polygenic background;5%,13.3%and 20.8%higher respectively under the epistatic background;7.1%,11.3%and 23.9%higher respectively under logistic error distribution;12.9%and 22.8%higher than those of KWeB and EMMA respectively under lognormal error distribution,and only 3.3%lower than KW.The accuracy of the parameter estimation was measured by mean square error(MSE).The smaller the MSE was,the higher the accuracy of parameter estimation was.The mean squared errors of the six simulated QTN effects from the new method KWeBP were less than 0.1,those from KWeB were slightly higher than the new method and were less than 0.1,and those from EMMA were more than 0.4.In order to control the high false positive rate of association analysis,a stringent significance criteria was often used in single marker whole genome scan,such as 0.05 for EMMA,divided by the number of markers.If the unit of false positive rate was 0.1 ‰,although the significance level of KWeBP was le-4,the false positive rates of all the simulations were less than 2.0,those of EMMA were less than 5.0,and those of KW were greater than 45.0.It indicated that the new method was effective in controlling the false positive rate.5.Ten flowering related traits of Arabidopsis thaliana were reanalyzed by KW,KWeB,KWeBP and EMMA.The results showed:The KWeBP method detected 179 significant SNPs,59 and 141 more than the KWeB and EMMA methods respectively,268 fewer than the KW method.If multiple regression analysis was performed on these significantly associated markers and the corresponding traits,the BIC(Bayesian information criterion)value of the corresponding model could be calculated.Among these BIC values,the BIC value of the new method KWeBP was the lowest,indicating that the model had the best suitability.Among flowering related traits of Arabidopsis thaliana,the number of genes in the proximity of the significantly associated SNPs detected by KWeBP was 57,which was 14,17 and 51 more than that of KW test,KWeB and EMMA respectively.This showed that the new method had the strongest ability to detect genes.In addition,the new method also uncovered a number of new genes which were not found in other methods,such as ARF6 and UFO located on chromosome one,ARP5 and FLK located on chromosome three and so on.6.In R environment,the new method KWeBP was developed into the corresponding interactive interface R software package,which was based on the additional packages Gtk2 and gWidgetsRGtk2,and with the aid of GTK+graphics tools.The package,known as the KWeBP package,could run on Windows,Linux and Mac operating systems with good platform adaptability.At the same time,KWeBP package could visualized the analysis results,and it had powerful drawing capibilities,such as drawing Manhattan plot and QQ plot widely used in GWAS.KWeBP package provides a friendly graphical user interface(GUI)for interactive operation,which greatly facilitates the use of genetic breeding researchers.
Keywords/Search Tags:genome-wide association study, Kruskal-Wallis test, multi-locus model, empirical Bayes, R package
PDF Full Text Request
Related items