-
PDF
- Split View
-
Views
-
Cite
Cite
Julian Hecker, Dmitry Prokopenko, Christoph Lange, Heide Loehlein Fier, PolyGEE: a generalized estimating equation approach to the efficient and robust estimation of polygenic effects in large-scale association studies, Biostatistics, Volume 19, Issue 3, July 2018, Pages 295–306, https://doi.org/10.1093/biostatistics/kxx040
- Share Icon Share
SUMMARY
To quantify polygenic effects, i.e. undetected genetic effects, in large-scale association studies, we propose a generalized estimating equation (GEE) based estimation framework. We develop a marginal model for single-variant association test statistics of complex diseases that generalizes existing approaches such as LD Score regression and that is applicable to population-based designs, to family-based designs or to arbitrary combinations of both. We extend the standard GEE approach so that the parameters of the proposed marginal model can be estimated based on working-correlation/linkage-disequilibrium (LD) matrices from external reference panels. Our method achieves substantial efficiency gains over standard approaches, while it is robust against misspecification of the LD structure, i.e. the LD structure of the reference panel can differ substantially from the true LD structure in the study population. In simulation studies and in applications to population-based and family-based studies, we illustrate the features of the proposed GEE framework. Our results suggest that our approach can be up to 100% more efficient than existing methodology.
1. Introduction
While genome-wide association studies (GWAS) led to the discovery of many genetic risk loci for complex diseases and traits, the overall magnitude of the combined genetic effects on the disease phenotype of interest that is explained by GWAS findings is much smaller than classical heritability studies suggested (Manolio and others, 2008). This motivated large-scale efforts, e.g. Psychiatric Genomics Consortium (PGC), to combine as many available GWAS as possible for any particular disease and trait, and perform meta-analyses across all available studies. The large-scale efforts identified additional disease loci, but the overall genetic effect that is attributable to all significant GWAS signals still remains substantially smaller than the estimated heritability.
At the same time, in many of the meta-analyses and large-scale GWAS, a genomic control factor (Devlin and others, 2001) (global inflation factor of the association test statistics) that is clearly greater than one was observed, raising concerns about the validity of the statistical analysis. The inflation of the association test statistics can be explained by differences in terms of ancestry and/or phenotypic characteristics between the study populations of the meta-analysis. Such differences would lead to study heterogeneity which is difficult to account for in such analysis. Current efforts to maximize sample size by including as many studies as possible in the meta-analysis could amplify this effect. An alternative explanation for the observed inflation of association test statistics in such large-scale analyses is that the inflation of the genomic control factor is caused by large numbers of true positive association signals that do not formally reach the level of genome-wide significance (polygenic effects). While initial simulation studies suggest that polygenic effects can be a plausible explanation for such inflations of the genomic control factor (Devlin and others, 2001), statistical methodology is required to address this important research question.
For population-based studies of unrelated individuals, LD Score regression (Bulik-Sullivan and others, 2015b) proposed a marginal mean model for the single-variant association test statistics and applied a weighted linear regression technique to estimate the corresponding model parameters, quantifying the amount of population stratification and polygenic effects. Standard errors for the estimates are obtained through a jackknife procedure, as classical regression assumptions are violated due to the complex correlation structure between association test statistics.
Here, we derive a general marginal model for single-variant association test statistics of complex diseases and traits in large-scale studies that is valid for arbitrary study designs, e.g. population-based studies, family-based studies or any combination of those. For the special case of population-based association test statistics, our general mean model is equivalent to the mean model of LD Score regression. For the general mean model, we propose an efficient and robust framework to estimate the parameters of the marginal model. Our approach extends the idea of GEEs and utilizes LD matrices estimated from external reference panels/populations. If the LD matrices from the reference panels approximate the true correlation structure of test statistics reasonably well, our framework achieves a substantially increased efficiency over LD Score regression estimation. In contrast to existing methods as VEGAS (Liu and others, 2010) or ImpG (Pasaniuc and others, 2014), our framework does not require that external LD matrices exactly match the correlation structure of the study and is therefore robust against LD misspecification due to population differences or small sample sizes of reference panels. Furthermore, our GEE based approach provides asymptotic valid standard errors, making simulation or bootstrap methods redundant. In simulation studies, we examine the efficiency of our approach and its robustness against misspecification of the LD structure. In application to the population-based studies of the PGC, we observe an efficiency increase compared to LD Score regression. We also illustrate the approach by an application to a family-based association study for autism.
2. Methods
In this section, we derive a flexible marginal model for association test statistics in large-scale association studies. Based on that, we define the PolyGEE framework for the estimation of the polygenic effects.
2.1. General marginal model for association test statistics of complex diseases
To identify and quantify the different sources of inflation, we include two parameters in equation (2.1). The first parameter |$ \beta_1$| measures the amount of global population stratification in the study that causes the inflation of the test association statistics at a genome-wide level. The second parameter |$ \beta_2$| measures the overall polygenic effects. We will see that that the second parameter can be linked to the heritability of the disease/trait. The specification of the constant |$ C_{\rm study}$| depends only on the design of the association study by a priori-known study parameters such as sample size, disease prevalence etc. Below, we describe that the already known mean models for case-control and quantitative trait studies of unrelated individuals are special cases of equation (2.1) and derive the study parameter |$ C_{\rm study}$| for these scenarios. Furthermore, for family-based association studies with dichotomous traits, we derive the corresponding special case of equation (2.1) and, based on these findings, we can derive the parameter |$ C_{\rm study}$| for the more general case of association studies with mixed designs, i.e. combinations of population-based and family-based designs.
2.1.1. Link between relative risk and explained variance.
2.1.2. Population-based association studies.
Equations (2.4) and (2.5) are linear approximations of the mean of association statistics. These formulas can be derived using a Taylor approximation of the non-centrality parameter (NCP) (Yang and others, 2011b) with respect to the relative risk |$ \lambda$| (utilizing the small effect size assumption of a polygenic architecture), applying equation (2.3), using the definition of the LD Scores |$ l_l$| and assuming constant variances of causal variants that are uniformly distributed along the genome.
2.1.3. Family-based case-control studies.
We consider a large-scale family-based association study with |$ N$| independent nuclear families, all families are assumed to be of the same type, e.g. the same number of affected or unaffected offspring (|$ n_a$| and |$ n_u$|, respectively), and with parental genotype information available. We relax this assumption later. We assume that transmissions to offspring are independent given the parental genotypes. We assume that the variant |$ l$| is tested by the general association test approach which is implemented in FBAT (Lake and others, 2001; Lange and Laird, 2002). We denote the offset parameter by |$ z$|. Based on a second order Taylor approximation in |$ \lambda$| around |$ \lambda=1$|, we find the following expression for the expectation value of the squared test statistic |$ \chi^{2}_l$|.
The extension of the mean model to multiply family types is straightforward (Appendix A.2 of the supplementary materials available at Biostatistics online). Equation (2.6) characterizes the approximate power of family-based studies for all combinations of affected and unaffected offspring with reference to small effect sizes (small relative risks). Our derivation thereby does not rely on an explicit analytic formula for the exact NCP. The latter is only known for some special cases as trios and sibling pairs (Deng and Chen, 2001; Knapp, 1999). Our results are in line with corresponding second order Taylor approximations of these NCPs in |$ \lambda$| around |$ \lambda=1$|.
2.1.4. Equation (2.2) and |$ C_{\rm study}.$|
Obviously, equations (2.4), (2.5), and (2.7) are covered by model equation (2.1) with an appropriate choice for |$ C_{\rm study}$| and identifying |$ \beta_1$| with |$ 1+Na$| and |$ \beta_2$| with |$ h^{2}$|. For a meta-analysis of population- and family-based designs, |$ C_{\rm study}$| is constructed by the corresponding weighted sum.
If we utilize the assumption of normally distributed genetic effects for the moment, we can conclude that the equation (2.2) is valid with |$ \phi=2$|. The assumption of normally distributed effects was used in both the derivation of the linear mixed model of GCTA (Yang and others, 2011a) and the motivation for the heteroscedasticity weights of the LD Score regression (Bulik-Sullivan and others, 2015b). We relax assumption equation (2.2) by assuming an arbitrary nuisance parameter |$ \phi$|.
2.2. PolyGEE framework
Under the assumption of sparsely correlated clusters and reasonable technical conditions (Appendix B of the supplementary materials available at Biostatistics online), we can reestablish the following results for the asymptotic existence of the estimator |$ \hat{\beta}_n$|, which are oriented at previous work (Wang, 2011). The proofs are sketched in Appendix B of the supplementary materials available at Biostatistics online. The two key points in the proofs of Propositions 3 and 5 are the utilization of a central limit theorem for weakly dependent data and the modification of the sandwich-variance estimator to account for the specified dependence structure of the clusters.
There exists a root |$ \hat{\beta}_n$| of |$ g_n ( \beta_n )=0$| satisfying |$ ||\hat{\beta}_n-\beta_{n0} ||=O_p \bigg(\frac{\sqrt{p_n}}{\sqrt{n}}\bigg)$|.
Since the clusters are only sparsely correlated, we still observe the asymptotic normality of |$ g_n ( \beta_{n0}) $|.
For |$ \alpha_n \in \mathrm{R}^{p_n}$| with |$ ||\alpha_n ||=1$|, we have |$ \alpha^{T}_n M^{-\frac{1}{2}}_n (\beta_{n0} ) g_n (\beta_{n0} ) \to N(0,1)$|, as |$ n \to \infty$|, in distribution, where |$ M_n(\beta_{n0})=\text{Cov}(g_n(\beta_{n0}))$|.
With these results, we can establish the asymptotic distribution of the estimator |$ \hat{\beta}_n$|.
|$ \alpha^{T}_n M^{-\frac{1}{2}}_n (\beta_{n0} ) H_n (\beta_{n0} )(\hat{\beta}_n-\beta_{n0}) \to N(0,1)$|, as |$ n \to \infty$|, in distribution.
Finally, we analyze the features of the estimators variance and the corresponding hypothesis testing. Therefore, define the covariance matrix |$ \Sigma_n=H^{-1}_n (\beta_{n0} ) M_n (\beta_{n0} ) H^{-1}_n (\beta_{n0} )$|.
For an |$ l \times p_n$| matrix |$ C_n$| such that |$ G=C_n C^{T}_n$| with |$ G$| positive definite, we have |$ C_n \hat{\Sigma}_n C^{T}_n-C_n \Sigma_n C^{T}_n= o_p (n^{-1} )$|, where |$ \hat{\Sigma}_n =H^{-1}_n (\hat{\beta}_n) \hat{M}_n (\hat{\beta}_n) H^{-1}_n (\hat{\beta}_n)$| and |$ \hat{M}_n (\beta)=\sum\limits_{i=1}^{n} g_{ni} (\beta) g^{T}_{ni} (\beta) + \sum\limits_{i \ne j \text{ correlated}} g_{ni} (\beta) g^{T}_{nj} (\beta)$|. |$ \hat{\Sigma}_n$| describes the sandwich-variance estimator.
If we are interested to test hypotheses of the form |$ H_0: \beta_{n0b}=0$| vs. |$ H_1: \beta_{n0b} \ne 0$| for |$ b=1,\dots,p_n$|, we can use the immediate consequence (Wang, 2011):
Under |$ H_0$|, |$ W_{nb}=\frac{\hat{\beta}^{2}_{nb}}{\hat{\Sigma}_{nbb}} \to \chi^{2}(1)$|, as |$ n \to \infty$|, in distribution, where |$ \chi^{2}(1)$| denotes the chi-squared distribution with one degree of freedom.
To solve the GEEs, we implemented a C/C++ tool to extract the working-correlation matrices from the 1000 Genomes data and solve the GEEs via a Newton-Raphson algorithm. For the covariates, we used the pre-computed LD Scores for the European sample of 1000 Genomes (1000 Genomes Project Consortium and others, 2012).
3. Results
In the following section, we present the results of a simulation study and the analysis of real data examples.
3.1. Simulation study
Empirical standard errors (emp. s.e.) over 1000 replications and the estimated standard errors (est. s.e.) for PolyGEE are listed for the full and shrinked information scenario. Standard errors for the first parameter in |$ 10^{-3}$| and |$ 10^{-5}$| for the second
PolyGEE standard error estimation . | ||||||||
---|---|---|---|---|---|---|---|---|
. | 98% . | 64% . | ||||||
. | emp. s.e. . | est. s.e. . | emp. s.e. . | est. s.e. . | ||||
|$ (\beta_1,\beta_2)$| . | |$ \beta_1$| . | |$ \beta_2$| . | |$ \beta_1$| . | |$ \beta_2$| . | |$ \beta_1$| . | |$ \beta_2$| . | |$ \beta_1$| . | |$ \beta_2$| . |
|$ (1.00,0.0000)$| | 5.11 | 4.11 | 5.08 | 4.06 | 5.67 | 5.25 | 5.62 | 5.21 |
|$ (1.00,0.0001)$| | 4.98 | 4.22 | 5.14 | 4.23 | 5.50 | 5.33 | 5.68 | 5.36 |
|$ (1.00,0.0005)$| | 5.37 | 4.95 | 5.35 | 4.83 | 5.95 | 6.06 | 5.89 | 5.88 |
|$ (1.00,0.0010)$| | 5.67 | 5.58 | 5.60 | 5.50 | 6.20 | 6.54 | 6.12 | 6.48 |
|$ (1.02,0.0000)$| | 5.20 | 4.07 | 5.18 | 4.16 | 5.76 | 5.25 | 5.74 | 5.33 |
|$ (1.02,0.0001)$| | 5.25 | 4.31 | 5.23 | 4.31 | 5.77 | 5.38 | 5.80 | 5.46 |
|$ (1.02,0.0005)$| | 5.50 | 4.82 | 5.45 | 4.91 | 6.02 | 5.87 | 6.00 | 5.99 |
|$ (1.02,0.0010)$| | 5.53 | 5.59 | 5.60 | 5.50 | 6.13 | 6.61 | 6.13 | 6.48 |
PolyGEE standard error estimation . | ||||||||
---|---|---|---|---|---|---|---|---|
. | 98% . | 64% . | ||||||
. | emp. s.e. . | est. s.e. . | emp. s.e. . | est. s.e. . | ||||
|$ (\beta_1,\beta_2)$| . | |$ \beta_1$| . | |$ \beta_2$| . | |$ \beta_1$| . | |$ \beta_2$| . | |$ \beta_1$| . | |$ \beta_2$| . | |$ \beta_1$| . | |$ \beta_2$| . |
|$ (1.00,0.0000)$| | 5.11 | 4.11 | 5.08 | 4.06 | 5.67 | 5.25 | 5.62 | 5.21 |
|$ (1.00,0.0001)$| | 4.98 | 4.22 | 5.14 | 4.23 | 5.50 | 5.33 | 5.68 | 5.36 |
|$ (1.00,0.0005)$| | 5.37 | 4.95 | 5.35 | 4.83 | 5.95 | 6.06 | 5.89 | 5.88 |
|$ (1.00,0.0010)$| | 5.67 | 5.58 | 5.60 | 5.50 | 6.20 | 6.54 | 6.12 | 6.48 |
|$ (1.02,0.0000)$| | 5.20 | 4.07 | 5.18 | 4.16 | 5.76 | 5.25 | 5.74 | 5.33 |
|$ (1.02,0.0001)$| | 5.25 | 4.31 | 5.23 | 4.31 | 5.77 | 5.38 | 5.80 | 5.46 |
|$ (1.02,0.0005)$| | 5.50 | 4.82 | 5.45 | 4.91 | 6.02 | 5.87 | 6.00 | 5.99 |
|$ (1.02,0.0010)$| | 5.53 | 5.59 | 5.60 | 5.50 | 6.13 | 6.61 | 6.13 | 6.48 |
Empirical standard errors (emp. s.e.) over 1000 replications and the estimated standard errors (est. s.e.) for PolyGEE are listed for the full and shrinked information scenario. Standard errors for the first parameter in |$ 10^{-3}$| and |$ 10^{-5}$| for the second
PolyGEE standard error estimation . | ||||||||
---|---|---|---|---|---|---|---|---|
. | 98% . | 64% . | ||||||
. | emp. s.e. . | est. s.e. . | emp. s.e. . | est. s.e. . | ||||
|$ (\beta_1,\beta_2)$| . | |$ \beta_1$| . | |$ \beta_2$| . | |$ \beta_1$| . | |$ \beta_2$| . | |$ \beta_1$| . | |$ \beta_2$| . | |$ \beta_1$| . | |$ \beta_2$| . |
|$ (1.00,0.0000)$| | 5.11 | 4.11 | 5.08 | 4.06 | 5.67 | 5.25 | 5.62 | 5.21 |
|$ (1.00,0.0001)$| | 4.98 | 4.22 | 5.14 | 4.23 | 5.50 | 5.33 | 5.68 | 5.36 |
|$ (1.00,0.0005)$| | 5.37 | 4.95 | 5.35 | 4.83 | 5.95 | 6.06 | 5.89 | 5.88 |
|$ (1.00,0.0010)$| | 5.67 | 5.58 | 5.60 | 5.50 | 6.20 | 6.54 | 6.12 | 6.48 |
|$ (1.02,0.0000)$| | 5.20 | 4.07 | 5.18 | 4.16 | 5.76 | 5.25 | 5.74 | 5.33 |
|$ (1.02,0.0001)$| | 5.25 | 4.31 | 5.23 | 4.31 | 5.77 | 5.38 | 5.80 | 5.46 |
|$ (1.02,0.0005)$| | 5.50 | 4.82 | 5.45 | 4.91 | 6.02 | 5.87 | 6.00 | 5.99 |
|$ (1.02,0.0010)$| | 5.53 | 5.59 | 5.60 | 5.50 | 6.13 | 6.61 | 6.13 | 6.48 |
PolyGEE standard error estimation . | ||||||||
---|---|---|---|---|---|---|---|---|
. | 98% . | 64% . | ||||||
. | emp. s.e. . | est. s.e. . | emp. s.e. . | est. s.e. . | ||||
|$ (\beta_1,\beta_2)$| . | |$ \beta_1$| . | |$ \beta_2$| . | |$ \beta_1$| . | |$ \beta_2$| . | |$ \beta_1$| . | |$ \beta_2$| . | |$ \beta_1$| . | |$ \beta_2$| . |
|$ (1.00,0.0000)$| | 5.11 | 4.11 | 5.08 | 4.06 | 5.67 | 5.25 | 5.62 | 5.21 |
|$ (1.00,0.0001)$| | 4.98 | 4.22 | 5.14 | 4.23 | 5.50 | 5.33 | 5.68 | 5.36 |
|$ (1.00,0.0005)$| | 5.37 | 4.95 | 5.35 | 4.83 | 5.95 | 6.06 | 5.89 | 5.88 |
|$ (1.00,0.0010)$| | 5.67 | 5.58 | 5.60 | 5.50 | 6.20 | 6.54 | 6.12 | 6.48 |
|$ (1.02,0.0000)$| | 5.20 | 4.07 | 5.18 | 4.16 | 5.76 | 5.25 | 5.74 | 5.33 |
|$ (1.02,0.0001)$| | 5.25 | 4.31 | 5.23 | 4.31 | 5.77 | 5.38 | 5.80 | 5.46 |
|$ (1.02,0.0005)$| | 5.50 | 4.82 | 5.45 | 4.91 | 6.02 | 5.87 | 6.00 | 5.99 |
|$ (1.02,0.0010)$| | 5.53 | 5.59 | 5.60 | 5.50 | 6.13 | 6.61 | 6.13 | 6.48 |
Relative efficiency of the LD Score regression estimator in comparison with PolyGEE estimator, based on 1000 replications. The efficiencies are listed for the full and shrinked information scenario
. | 98% . | 64% . | ||
---|---|---|---|---|
|$ (\beta_1,\beta_2)$| . | |$ \beta_1 (\%)$| . | |$ \beta_2 (\%)$| . | |$ \beta_1 (\%)$| . | |$ \beta_2(\%)$| . |
(1.00,0.0000) | 61.2 | 49.0 | 75.4 | 80.0 |
(1.00,0.0001) | 61.9 | 53.1 | 75.5 | 84.7 |
(1.00,0.0005) | 63.3 | 55.7 | 77.7 | 83.5 |
(1.00,0.0010) | 66.6 | 61.6 | 79.6 | 84.6 |
(1.02,0.0000) | 59.9 | 47.1 | 73.5 | 78.4 |
(1.02,0.0001) | 63.9 | 52.6 | 77.4 | 81.8 |
(1.02,0.0005) | 63.2 | 55.5 | 75.7 | 82.4 |
(1.02,0.0010) | 61.5 | 61.1 | 75.6 | 85.5 |
. | 98% . | 64% . | ||
---|---|---|---|---|
|$ (\beta_1,\beta_2)$| . | |$ \beta_1 (\%)$| . | |$ \beta_2 (\%)$| . | |$ \beta_1 (\%)$| . | |$ \beta_2(\%)$| . |
(1.00,0.0000) | 61.2 | 49.0 | 75.4 | 80.0 |
(1.00,0.0001) | 61.9 | 53.1 | 75.5 | 84.7 |
(1.00,0.0005) | 63.3 | 55.7 | 77.7 | 83.5 |
(1.00,0.0010) | 66.6 | 61.6 | 79.6 | 84.6 |
(1.02,0.0000) | 59.9 | 47.1 | 73.5 | 78.4 |
(1.02,0.0001) | 63.9 | 52.6 | 77.4 | 81.8 |
(1.02,0.0005) | 63.2 | 55.5 | 75.7 | 82.4 |
(1.02,0.0010) | 61.5 | 61.1 | 75.6 | 85.5 |
Relative efficiency of the LD Score regression estimator in comparison with PolyGEE estimator, based on 1000 replications. The efficiencies are listed for the full and shrinked information scenario
. | 98% . | 64% . | ||
---|---|---|---|---|
|$ (\beta_1,\beta_2)$| . | |$ \beta_1 (\%)$| . | |$ \beta_2 (\%)$| . | |$ \beta_1 (\%)$| . | |$ \beta_2(\%)$| . |
(1.00,0.0000) | 61.2 | 49.0 | 75.4 | 80.0 |
(1.00,0.0001) | 61.9 | 53.1 | 75.5 | 84.7 |
(1.00,0.0005) | 63.3 | 55.7 | 77.7 | 83.5 |
(1.00,0.0010) | 66.6 | 61.6 | 79.6 | 84.6 |
(1.02,0.0000) | 59.9 | 47.1 | 73.5 | 78.4 |
(1.02,0.0001) | 63.9 | 52.6 | 77.4 | 81.8 |
(1.02,0.0005) | 63.2 | 55.5 | 75.7 | 82.4 |
(1.02,0.0010) | 61.5 | 61.1 | 75.6 | 85.5 |
. | 98% . | 64% . | ||
---|---|---|---|---|
|$ (\beta_1,\beta_2)$| . | |$ \beta_1 (\%)$| . | |$ \beta_2 (\%)$| . | |$ \beta_1 (\%)$| . | |$ \beta_2(\%)$| . |
(1.00,0.0000) | 61.2 | 49.0 | 75.4 | 80.0 |
(1.00,0.0001) | 61.9 | 53.1 | 75.5 | 84.7 |
(1.00,0.0005) | 63.3 | 55.7 | 77.7 | 83.5 |
(1.00,0.0010) | 66.6 | 61.6 | 79.6 | 84.6 |
(1.02,0.0000) | 59.9 | 47.1 | 73.5 | 78.4 |
(1.02,0.0001) | 63.9 | 52.6 | 77.4 | 81.8 |
(1.02,0.0005) | 63.2 | 55.5 | 75.7 | 82.4 |
(1.02,0.0010) | 61.5 | 61.1 | 75.6 | 85.5 |
3.2. Application to real data
In this section, we compare the performance of PolyGEE with LD Score regression based on the association results for public available summary statistics.
3.2.1. Population-based association data.
We applied both methods, PolyGEE and LD Score regression, to the summary association statistics of three traits from the PGC. The PGC provides the results of large-scale meta-analyses for bipolar disorder (BIP), schizophrenia (SCZ), major depressive disorder (MDD), attention deficit disorder (ADHD), and autism spectrum disorder (ASD). We analyzed the data from the PGC-Cross-Disorder Group for SCZ (Cross-Disorder Group of the Psychiatric Genomics Consortium and others, 2013) and the meta-analysis results for BIP (Psychiatric GWAS Consortium Bipolar Disorder Working Group and others, 2011) and MDD (Ripke and others, 2013). The samples for these diseases are unrelated case-control cohorts.
We filtered the summary statistics with the LDSC software (Bulik-Sullivan and others, 2015b) using the default parameters. This procedure includes removing ambiguous SNPs and variants with an imputation info-score |$ <$| 0.9, if the info-score is available. If no imputation-info score is available it is suggested to restrict the input to the set of HapMap3 (International HapMap 3 Consortium and others, 2010) SNPs, since these SNPs are usually well-imputed. We used both filter options for the PGC studies. LD Score regression was performed with LDSC, based on the pre-calculated LD Scores for the European Sample of the 1000 Genomes project. For PolyGEE, we took the same input data set as for LD Score regression, but additionally excluded SNPs in perfect LD and variants with very low p-values (|$ <$| e-16) (if exist). This excludes model outliers and avoids the construction of degenerated correlation matrices. Table 3 provides the estimated parameters and the corresponding number of variants included in this analysis. The polygenic term corresponds to an estimate of the heritability on the observed scale, using the number of SNPs |$ M$| for common SNPs (see above) provided by LDSC and the reported sample size from the corresponding publication. Therefore, the polygenic term quantifies |$ h^2 \frac{c^2 n_{\rm cases} n_{\rm controls}}{N^2 (1-K)^2 }$|. Since we compare the performance of estimators in terms of accuracy, we renounce to transform the heritability to the liability scale, since this is a constant factor transformation of the estimate. The estimated standard errors of the LD Score regression estimates were calculated by LDSC using the default settings. For PolyGEE, the estimated standard errors were calculated by our sandwich-variance estimator (Proposition 5). We observe that the PolyGEE estimates are more accurate and that the variance of the parameter estimates for the LD Score regression is around 50% higher.
Results for the population-based meta-analyses for BIP, SCZ, and MDD of the PGC. We provide the estimates for LD Score regression and PolyGEE. Standard errors depicted in brackets
Disorder . | LDSC . | PolyGEE . | ||||
---|---|---|---|---|---|---|
. | #SNPs . | |$ 1+Na$| . | Polygenic term . | #SNPs . | |$ 1+Na$| . | Polygenic term . |
BIP | 803 530 | 1.0495 (0.0105) | 0.3046 (0.0379) | 667 734 | 1.0448 (0.0083) | 0.2953 (0.0312) |
SCZ | 840 450 | 1.0418 (0.0117) | 0.4190 (0.0455) | 761 854 | 1,0428 (0.0091) | 0.3709 (0.0380) |
MDD | 891 172 | 1.0241 (0.0079) | 0.1130 (0.0265) | 812 873 | 1,0181 (0.0065) | 0.1093 (0.0223) |
Disorder . | LDSC . | PolyGEE . | ||||
---|---|---|---|---|---|---|
. | #SNPs . | |$ 1+Na$| . | Polygenic term . | #SNPs . | |$ 1+Na$| . | Polygenic term . |
BIP | 803 530 | 1.0495 (0.0105) | 0.3046 (0.0379) | 667 734 | 1.0448 (0.0083) | 0.2953 (0.0312) |
SCZ | 840 450 | 1.0418 (0.0117) | 0.4190 (0.0455) | 761 854 | 1,0428 (0.0091) | 0.3709 (0.0380) |
MDD | 891 172 | 1.0241 (0.0079) | 0.1130 (0.0265) | 812 873 | 1,0181 (0.0065) | 0.1093 (0.0223) |
Results for the population-based meta-analyses for BIP, SCZ, and MDD of the PGC. We provide the estimates for LD Score regression and PolyGEE. Standard errors depicted in brackets
Disorder . | LDSC . | PolyGEE . | ||||
---|---|---|---|---|---|---|
. | #SNPs . | |$ 1+Na$| . | Polygenic term . | #SNPs . | |$ 1+Na$| . | Polygenic term . |
BIP | 803 530 | 1.0495 (0.0105) | 0.3046 (0.0379) | 667 734 | 1.0448 (0.0083) | 0.2953 (0.0312) |
SCZ | 840 450 | 1.0418 (0.0117) | 0.4190 (0.0455) | 761 854 | 1,0428 (0.0091) | 0.3709 (0.0380) |
MDD | 891 172 | 1.0241 (0.0079) | 0.1130 (0.0265) | 812 873 | 1,0181 (0.0065) | 0.1093 (0.0223) |
Disorder . | LDSC . | PolyGEE . | ||||
---|---|---|---|---|---|---|
. | #SNPs . | |$ 1+Na$| . | Polygenic term . | #SNPs . | |$ 1+Na$| . | Polygenic term . |
BIP | 803 530 | 1.0495 (0.0105) | 0.3046 (0.0379) | 667 734 | 1.0448 (0.0083) | 0.2953 (0.0312) |
SCZ | 840 450 | 1.0418 (0.0117) | 0.4190 (0.0455) | 761 854 | 1,0428 (0.0091) | 0.3709 (0.0380) |
MDD | 891 172 | 1.0241 (0.0079) | 0.1130 (0.0265) | 812 873 | 1,0181 (0.0065) | 0.1093 (0.0223) |
3.2.2. Family-based association data.
We also applied PolyGEE and LD Score regression to the summary statistics from the Autism Genome Project (AGP) Consortium association study for autism spectrum disorder (ASD) (Anney and others, 2012). The analysis was based on 2,359 affected offspring of European ancestry. Quality control of the input data was performed as for the population-based association studies. It is important to note again that, despite that LD Score regression was developed for population-based association statistics, the derivation of equation (2.7) implies the possibility to also analyze family-based association by the underlying estimation framework of LD Score regression. Since the output options of the LDSC software assume population-based association data, we had to modify the results and the output according to equation (2.7). Table 4 contains the results of the analysis with PolyGEE and LDSC. The polygenic term describes |$ hc^2$| (|$ c^2$| as defined in equation (2.3)), assuming 2,359 trios as an accurate approximation of the family structure in the sample (Anney and others, 2012). The findings in terms of efficiency gains are substantial and of similar magnitude as for the population-based studies.
Results for the family-based association study for ASD. We provide the estimates for LD Score regression and PolyGEE. Standard errors depicted in brackets
Disorder . | LDSC . | PolyGEE . | ||||
---|---|---|---|---|---|---|
. | #SNPs . | |$ 1+Na$| . | Polygenic term . | #SNPs . | |$ 1+Na$| . | Polygenic term . |
ASD | 806 205 | 1.0396 (0.0073) | 0.5754 (0.4052) | 701 164 | 1.0387 (0.0056) | 0.5750 (0.3251) |
Disorder . | LDSC . | PolyGEE . | ||||
---|---|---|---|---|---|---|
. | #SNPs . | |$ 1+Na$| . | Polygenic term . | #SNPs . | |$ 1+Na$| . | Polygenic term . |
ASD | 806 205 | 1.0396 (0.0073) | 0.5754 (0.4052) | 701 164 | 1.0387 (0.0056) | 0.5750 (0.3251) |
Results for the family-based association study for ASD. We provide the estimates for LD Score regression and PolyGEE. Standard errors depicted in brackets
Disorder . | LDSC . | PolyGEE . | ||||
---|---|---|---|---|---|---|
. | #SNPs . | |$ 1+Na$| . | Polygenic term . | #SNPs . | |$ 1+Na$| . | Polygenic term . |
ASD | 806 205 | 1.0396 (0.0073) | 0.5754 (0.4052) | 701 164 | 1.0387 (0.0056) | 0.5750 (0.3251) |
Disorder . | LDSC . | PolyGEE . | ||||
---|---|---|---|---|---|---|
. | #SNPs . | |$ 1+Na$| . | Polygenic term . | #SNPs . | |$ 1+Na$| . | Polygenic term . |
ASD | 806 205 | 1.0396 (0.0073) | 0.5754 (0.4052) | 701 164 | 1.0387 (0.0056) | 0.5750 (0.3251) |
4. Discussion
At a genome-wide level, we derived a general marginal model for single-variant association test statistics of complex diseases. We proposed the PolyGEE methodology to estimate the amount of polygenic effects and confounding biases from the association test statistics of large-scale association studies using this model. We showed via simulation studies and the application to real data that our approach is substantially more efficient than the existing LD Score regression framework, i.e. the estimates are more precise. The increased efficiency of our approach is achieved by incorporating detailed local LD information from the external reference panels, e.g. 1000 Genomes project, into the estimation step. However, our GEE approach does not require that the reference panel and the study data are exact matches in terms of the LD structure. The approach is robust against deviations of the sample LD structure from the reference panel and can compute asymptotic valid standard errors. Our theoretical derivations and assumptions lead to general valid results for the estimation framework. For further research, this makes it possible to extend the mean model to incorporate more components or estimate the genetic correlation between two traits.
Supplementary material
Supplementary material is available at http://biostatistics.oxfordjournals.org.
Acknowledgments
Conflict of Interest: None declared.
Funding
Cure Alzheimers Fund; National Institute of Mental Health [R01MH08162, R01MH087590]; National Genome Institute [R01HG008976]; National Heart, Lung, and Blood Institute [U01HL089856, U01HL089897, PO1HL120839]. National Research Foundation of Korea Grant funded by the Korean Government [NRF-2014S1A2A2028559] to C.L.