Font Size: a A A

Assessment Of Imputation Methods For Whole Genome Sequence Data And Longitudinal Genome-wide Association Studies For Milk Production Traits In Holstein

Posted on:2022-11-07Degree:DoctorType:Dissertation
Country:ChinaCandidate:J TengFull Text:PDF
GTID:1483306749997659Subject:Biology
Abstract/Summary:PDF Full Text Request
Whole genome sequence data includes all variations in the whole genome,which would help to identify causal variations in quantitative traits or diseases,and strengthen livestock and poultry breeding.However,high-coverage sequencing of large cohorts is still too expensive for large-scale genomic analysis,especially for livestock.An alternative strategy is imputing SNP chip data to sequence data.This starategy is to genotype the target population with SNP chip and then to impute SNP chip data to the sequence level based on a reference panel with highcoverage sequencing.This strategy has been widely used in previous studies.Low-coverage sequencing has been proposed as a cost-effective genotyping approach for obtaining genotypes of whole-genome variants.Imputation performance is essential for the effectiveness of this approach.Several imputation methods have been proposed and successfully applied in genomic studies in human and other species.However,there are few reports on the performance of these methods in livestock.In this study,we first evaluated the performance of different genotype imputation methods to obtain sequence data.In addition,to identify the single nucleotide polymorphisms influencing the milk production traits and improve the power to identify traitsrelated variants,we performed longitudinal genome-wide association studies(GWAS)based on random regression model using imputed whole-genome sequence data.The specific results are as follows:(1)In this study,different medium density SNP chip data of Holstein were directly imputed to the sequence data,and the performance of three imputation methods(Beagle v5.1,IMPUTE5v1.1.3 and Minimac4 v1.0.2)was evaluated.Our results indicated that Beagle5 would be an optimal imputation strategy to impute SNP chip data to whole genome sequence data.For50Kv1 and 50Kv2,the imputation accuracy was over 0.82;for 80 K,100K and 150 K chip,it was over 0.94.(2)This study evaluated six imputation methods,including Beagle v4.1,Gene Imp v1.3,GLIMPSE v1.1.0,QUILT v1.0.0,Reveel,and STITCH v1.6.5,with varying sequencing depth,sample size,reference panel size,and minor allele frequency using low-coverage sequencing data of Holstein cattle(1× or less).Our results indicated that Reveel was not suitable for our data due to its very low imputation accuracy.On the whole,Beagle was not competitive with Gene Imp,GLIMPSE,QUILT,and STITCH although its imputation accuracy was acceptable(over 0.90)in most cases.Gene Imp,GLIMPSE,QUILT,and STITCH each had their advantages in relevant situations.When a large reference panel was available,Gene Imp and QUILT were very robust to sequencing depth and sample size and produced imputation accuracies near(Gene Imp)or higher than(QUILT)0.95 even for very low-coverage(0.1×)sequencing data and very small size(100).GLIMPSE performed in general very well when the sequencing depth was higher than 0.1×.STITCH,with or without a reference panel,produced the highest accuracy when the sequencing depth was higher than 0.4× and the sample size was larger than 400.Beagle was the slowest and spent much more time than the other methods,followed by QUILT,which was about 20-30% faster than Beagle.Gene Imp was the fastest and took only about 1/5 of the time of Beagle.GLIMPSE took nearly double time than Gene Imp.The running time of STITCH and STITCH_REF was between that of GLIMPSE and Gene Imp.Taken overall,considering imputation accuracy,number of SNPs produced,and computing time,STITCH followed by Beagle would be an optimal strategy in the absence of a reference panel,while QUILT would be the method of choice in the case of a reference panel.(3)We performed longitudinal GWAS for milk production traits(milk yield,fat percentage,and protein percentage)using imputed sequence data in Chinese Holstein cattle.First,the exisited SNP chip data of 6,470 were imputed to sequence data using Beagle5.After imputation and filtering,we obtained 11,153,375 SNPs.Using these SNPs,we peformed GWAS for milk yield,fat percentage,and protein percentage based on random regression model.The longitudinal GWAS revealed 26,39,and 75 QTL regions associated with milk yield,fat percentage,and protein percentage,respectively.We focused on 49 QTL regions and estimated the 95% confidence intervals(CI)for these QTL regions using the log P drop method.In total,we identified 581 genes involved in these CIs,including 39 for milk yield,65 for fat percentage and 495 for protein percentage.Among them,two genes(DGAT1 and HSF1)were common for all of the three traits and five additional genes(ADCK5,SLC52A2,FBXL6,TMEM249,and SCRT1)were common for MY and PP.Further,we focused on the CIs which covered or overlapped with only one gene or CIs which contained an extremely significant top SNP.28 candidate genes were identified in these CIs.Most of them have been reported in the literature to be associated with milk production traits,such as DGAT1,HSF1,MGST1,GHR,ABCG2,ADCK5,and CSN1S1.Among the unreported novel genes,some also showed good potential as candidate genes,such as CCSER1,CUX2,SNTB1,RGS7,OSR2 and STK3,and are worth being further investigated.Our study provided not only new insights into the candidate genes for milk production traits,but also a general framework for longitudinal GWAS based on random regression model using sequence data.
Keywords/Search Tags:Holstein, Whole genome sequence data, Genotype imputation, Longitudinal genome-wide association studies, Random regression model
PDF Full Text Request
Related items