| Objective:Converging evidence suggests that common complex diseases with the same or similar clinical manifestations could have different underlying genetic etiologies.While current research interests have shifted toward uncovering rare variants and structural variations predisposing to human diseases,the impact of heterogeneity in genetic studies of complex diseases has been largely overlooked.Most of the existing statistical methods assume the disease under investigation has a homogeneous genetic effect and could,therefore,have low power if the disease undergoes heterogeneous pathophysiological and etiological processes.In this study,we propose a heterogeneity weighted U(HWU)method for association analyses considering genetic heterogeneity.HWU can be applied to various types of phenotypes(e.g.,binary and continuous)and is computationally eff-icient for high-dimensional genetic data.Through simulations,we showed the advantage of HWU when the underlying genetic etiology of a disease was heterogeneous,as well as the robustness of HWU against different model assumptions.Methods:In order to verify the performance of heterogeneity weighted U statistics,three simulations were done and a study for nicotine dependence of SAGE was carried out to verify that heterogeneity weighted U method has higher power than the method without considering heterogeneity.In simulation Ⅰ and simulation Ⅱ,we simulated various cases of genetic heterogeneity and compared the proposed heterogeneity weighted U test with two other tests,non-heterogeneity weighted U test and the likelihood ratio test using the conventional generalized linear model(GLM).In simulation Ⅲ,we investigated the robustness of heterogeneity weighted U test to non-normal distributions and mis-specified weight functions.In all sets of simulations,unless otherwise specified we used Euclidian-distance-based ki,j by setting R=I and cross product-based f(gi,gj)to form the weight function.For each simulation setting,we simulated 1000 replicate datasets,each having a sample size of 1000.Power and type I error of the methods were first calculated based on the proportion of p-values in the 1000 replicates smaller than or equal to 0.05Samples from the SAGE were selected and we use the Fagerstrom Test for Nicotine Dependence(FTND)item 4,the number of cigarettes smoked per day(CPD),which has been commonly used in genetic studies of nicotine use and dependence.SAGE genotyping was performed by using the Illumina Human 1M DNA Analysis BeadChip.Genotype imputation was performed by using the BEAGLE software.The minimum posterior probability required to call a genotype is 0.9.Prior to the statistical association analysis,we assessed the quality of the genotype data.As a first step in quality assessment,we examined the proportion of genotype calls for each marker(across all individuals)and for each individual(across all markers).Markers with less than 90%of successful calls were removed Similarly,individuals with 10%missing genotypes were also excluded from the analysis For the remaining missing genetic data,we used the average number of minor alleles of the marker to impute the missing values.Markers showing excessive deviations from Hardy-Weinberg equilibrium in the controls were marked and the individuals with unexpected relationships were removedIn this paper,we use cross-product kernel to measure the genetic similarity f(Gi,Gj)and use gender to infer the latent population structure by defining ki,j = 1-|xi-xj|,where xi=0,1 for male and female,respectively.By applying the method to each of 26 candidate genes,we evaluated their association with ND with the consideration of possible heterogeneous effects in gender.To consider the potential confounding effects,we adjusted the analysis for gender,race,study sites and the top four principal components calculated from the genome-wide dataResults:1)In the first simulation,two population subgroups were set up.Four genetic heterogeneity models were simulated according to the size and direction of the effects.To compare the type Ⅰ errors and power of three methods:heterogeneity weighted U,non-heterogeneity weighted U and GLM,the results showed that heterogeneity weighted U test had higher power and lower type Ⅰ error.Considering potential phenotypic distribution,simulation results showed that heterogeneity weighted U test was more robust than parameter-based GLM when potential distribution was non-normal.In order to verify the performance of the new method when the genetic model was not clear,seven different genetic models were set up in the two subpopulations.The results showed that the heterogeneity weighted U was better than the other two methods.When there was obvious genetic heterogeneity in the subpopulations,the power of the new method was more higher.2)In the second simulation,a more complex potential population structure was set up,and the number of sub-groups was increased to 20,and 25 covariates were simulated to get closer to the real scene.The results showed that the type Ⅰ errors of the three methods were less than 0.05.Three methods were used to simulate binary and continuous phenotype respectively.The results showed that heterogeneity weighted U was significantly better than non-heterogeneity weighted U and GLM for any phenotypic data.When genetic heterogeneity was negligible and timeless,the performance of heterogeneity weighted U was similar to that of the other two methods.However,when noise parameters were included in the model,the effectiveness of heterogeneity weighted U was reduced as the other two methods.Simulation for individual genetic effects are different.The results showed that heterogeneity weighted U had higher test efficiency.And the greater the genetic heterogeneity,the better the performance of heterogeneity weighted U.3)In the third simulation,the robustness of heterogeneity weighted U and variance component score test(VCscore)is tested.Considering the confounding effects,We simulated three types of non-normal distribution:1)t distribution with df=2,2)Cauchy distribution,and 3)a mixture of normal and chi-squared distribution.The results showed that no inflated type Ⅰ error was found for heterogeneity weighted U,with or without confounding effects.When VCscore was used for Cauchy distribution with confounding effects,type Ⅰ errors are inflated.When the weighted function was specified by error,the type Ⅰ error of heterogeneity weighted U was controlled well,but the power was low.When the covariates were missing or noise covariates were added,the power of heterogeneity weighted U was low too.Considering that the p-value cut-off was more stringent than 0.05 in large data analysis,a simulation study of 1 million simulation data is carried out.The results showed that when the p-value cut-off was 5 × 10-5,the type Ⅰ error of heterogeneity weighted U was 4.0×10-5.In the simulation study for multi-locus model,the results showed that,compared with VCscore,heterogeneity weighted U could control type Ⅰ errors well,and had higher power.4)The results of a study considering genetic heterogeneity of gender showed that for 26 nicotine dependence candidate genes,17 genes were correlated with nicotine dependence by heterogeneity weighted U,while only one was detected by non-heterogeneity weighted U.In the association analysis of CHRNA5-CHRNA3-CHRNB4 gene cluster and CHRNB3-CHRNA6 gene cluster,significant correlation between the gene clusters and nicotine dependence was obtained by both methods.The results for CHRNA6 and CHRNB3 gene showed that CHRNA6 gene was correlated significantly with nicotine dependence in women,but not in men;CHRNB3 gene showed the opposite result.The results for CYP gene showed that nicotine dependence was significantly associated with CYP2B6 gene.Considering genetic heterogeneity,the heterogeneity weighted U method performed better than the method without considering genetic heterogeneity.Traditional statistic methods assume that the effects of genetic variations are the same to the heterogeneous distribution of genetic variations in different sub-populations,but the heterogeneity weighted U method allows the effects of genetic variations to be different.Although heterogeneity weighted U is an additive model based on genetic heterogeneity and mainly used for single locus test,it can be extended easily to multi-locus model or other genetic model by adjusting the weighting function,and the method is flexible for constructing potential structure.The ND measurement,such as CPD,does not have a known underlying distribution.This complexity,however,has not been carefully considered in the existing analytic method.In our analysis,we adopted a new non-parametric method,HWU,which made no assumptions on phenotype distribution and therefore provided a robust and powerful performance for association analysis.Conclusion:By developing heterogeneity weighted U statistic,it is possible to solve the problem on unknown phenotype distribution.At the same time,this method can be flexible to apply to a variety of genetic effect models,and to a variety of distribution.The results of three simulations show that heterogeneity weighted U can control type Ⅰ errors well,and its power is higher than other methods.Even for more complex genetic effects and potential structures,it still shows superior performance.However,when the weighting function is specified by error or covariates are missing,the power of heterogeneity weighted U will be lower.The results of nicotine dependence study are consistent with those of existing biological studies or reported association analysis.The results support that heterogeneity weighted U test is superior to non-heterogeneity weighted U test.While this study reveals potential heterogeneous effects of several ND-associated genes in gender,this is an initial effort to study genetic heterogeneity in ND.Future studies are required to replicate the findings from our analysis and further investigate genetic heterogeneity in ND. |