Optimizing genomic control in mixed model associations with binary diseases

Song, Yuxin; Yang, Li’ang; Jiang, Li; Hao, Zhiyu; Yang, Runqing; Xu, Pao

doi:10.1093/bib/bbab426

Abstract

Complex computation and approximate solution hinder the application of generalized linear mixed models (GLMM) into genome-wide association studies. We extended GRAMMAR to handle binary diseases by considering genomic breeding values (GBVs) estimated in advance as a known predictor in genomic logit regression, and then reduced polygenic effects by regulating downward genomic heritability to control false negative errors produced in the association tests. Using simulations and case analyses, we showed in optimizing GRAMMAR, polygenic effects and genomic controls could be evaluated using the fewer sampling markers, which extremely simplified GLMM-based association analysis in large-scale data. Further, joint association analysis for quantitative trait nucleotide (QTN) candidates chosen by multiple testing offered significant improved statistical power to detect QTNs over existing methods.

binary disease, genomic control, generalized linear mixed model, joint association analysis, computational efficiency

Introduction

Although usually expressed as binary phenotypes, many disease traits in plants and animals are thought to be controlled by a number of loci each having a small effect [1, 2]. Thus, random polygenic effects excluding the tested markers should be considered in genome-wide association study (GWAS) for disease traits to correct the population stratification and cryptic relatedness, as linear mixed model (LMM) does for quantitative traits [3, 4]. Logistic regression, as a kind of generalized linear model (GLM) [5, 6], has been used earlier for genome-wide association analysis with the disease traits [7]. Despite correction for fixed-effect covariates [8–10], logistic regression still produces inflation of association test statistics. Therefore, it is necessary to introduce a generalized linear mixed model (GLMM) [11] that considers random polygenic effects to increase the power to map quantitative trait nucleotides (QTNs) for disease traits.

However, genome-wide GLMM association consumes much more computing time than mixed model association with either restricted maximum likelihood estimation (REML) [12] or Markov Chain Monte Carlo (MCMC) iteration [13]. If using the maximum likelihood in estimation and approximations to avoid numerical integration [14], the GLMM yields a problem of serious bias induced by the approximations [15], especially solutions tending toward the positive/negative infinity. In the GWAS for case–control studies, a non-random sampling of cases from the population results in biased estimation for genomic heritability. Analyzed by the LMM for binary phenotypes, genomic heritability estimate is transformed to a liability scale by adjusting both for scale and for ascertainment of the case samples [16]. The genomic heritability of liability is estimated in a biased manner for disease traits, even though it is done by GLMM via MCMC iteration. Based on the calibrated genomic hereditability for case–control ascertainment, a Chi-squared score statistic for GWAS of disease traits is computed from posterior mean liabilities (PMLs) under the liability-threshold model [17]. The individual PMLs are estimated with a multivariate Gibbs sampler, which increases the computational demand. For simplification to GLMM-based association analysis, the GMMAT [18] and SAIGE [19] separately extend the EMMAX [20] and BOLT-LMM [21], respectively, for normally distributed traits to binary phenotypes.

Owing to the computational intensity and approximate solutions obtained, GLMM can hardly be employed in GWAS for disease traits. Moreover, genomic heritability cannot be accurately estimated for complex diseases, especially in ascertained case–control studies. For simplification to genome-wide mixed model association analysis for quantitative traits normally distributed, GRAMMAR [22] replaced polygenic effects excluding the tested marker with the genomic breeding values (GBVs) estimated under null mixed model, and then associated phenotype residuals removing GBVs with genetic markers by least square method in the lowest computational complexity. However, removing partial marker effects from phenotype residuals made GRAMMAR deflating association test statistics and decreasing statistical power to detect QTNs. In this study, we extended GRAMMAR to handle binary traits by considering GBVs estimated in advance as a known predictor in genomic logit regression. Further, we optimized the genomic control for GRAMMAR for binary traits by regulating the downward genomic heritability to estimate the residual phenotypes. The complicated GLMM does not need to be directly solved by the Optim-GRAMMAR for binary traits, and it only repeatedly estimates GBVs with genomic best linear unbiased prediction (GBLUP) [23] for GLMM, achieving genome-wide GLMM association analysis rapidly. Finally, we jointly analyzed the QTN candidates chosen by multiple testing to improve the statistical power to detect QTNs.

Methods

Genomic logit regression

Complex disease traits, as binary ones, usually follow binomial or Poisson distributions, so the generalized linear model (GLM) [5, 6] is used to map QTLs controlling the traits. Assume that n individuals are recorded for phenotypic values and genotyped for m genetic markers. Distinguishing these markers from major and common alleles in a magnitude of effects of the markers on quantitative traits, we describe the relationship between all markers (predictors) and the mean of the exponential distribution family in the following logit regression:

$$\begin{equation} \ln \left(\frac{\mu }{1-\mu}\right)=\mathbf{Xb}+{\mathbf{Z}}_{\mathbf{1}}{\mathbf{a}}_{\mathbf{1}}+{\mathbf{Z}}_{\mathbf{2}}{\mathbf{a}}_{\mathbf{2}} \end{equation}$$

(1)

where |$\mu $| denotes the expectations of phenotypic distribution, b is the systematic environment effect; the population structure (stratification), which results in phenotypic differences among subpopulations is always considered here, except for sex, age and some initial experimental conditions. a₁ is the large genetic effect of q markers on phenotype, a₂ is the minor or zero effect of the m-q markers on phenotype, and X, Z₁ and Z₂ are the corresponding design matrices of b, a₁ and a₂, respectively.

GRAMMAR for binary diseases

We define the GBVs as

$$\begin{equation} \mathbf{g}={\mathbf{Z}}_{\mathbf{1}}{\mathbf{a}}_{\mathbf{1}}+{\mathbf{Z}}_{\mathbf{2}}{\mathbf{a}}_{\mathbf{2}} \end{equation}$$

(2)

Then, model (1) becomes

$$\begin{equation} \ln \left(\frac{\mu }{1-\mu}\right)=\mathbf{Xb}+\mathbf{g} \end{equation}$$

(3)

where a is the GLMM [11]. Under the assumption that |$({\mathbf{a}}_{\mathbf{1}},{\mathbf{a}}_{\mathbf{2}})\sim{N}_m(\mathbf{0},\mathbf{I}{\sigma}_a^2)$| with minor |${\sigma}_a^2$| for each marker, the GBVs are turned into random effects and |$\mathbf{g}\sim{N}_n(\mathbf{0},\mathbf{K}{\sigma}_g^2)$| with genomic variance of traits |${\sigma}_g^2=m{\sigma}_a^2$| and the genomic relationship matrix (GRM) K [24]. Without need to invert GRM, the GBVs can be estimated with the following GBLUP equations [4]

$$\begin{equation} \left[\begin{array}{@{}cc@{}}{\mathbf{X}}^{\mathrm{T}}\mathbf{W}\mathbf{X}& {\mathbf{X}}^{\mathrm{T}}\mathbf{W}\\{}\mathbf{KW}\mathbf{X}& \mathbf{KW}+\frac{1-{h}^2}{h^2}\end{array}\right]\left[\begin{array}{@{}c@{}}\hat{\mathbf{b}}\\{}\hat{\mathbf{g}}\end{array}\right]=\left[\begin{array}{@{}c@{}}{\mathbf{X}}^{\mathrm{T}}\mathbf{W}{\mathbf{y}}^{\ast}\\{}\mathbf{KW}{\mathbf{y}}^{\ast}\end{array}\right] \end{equation}$$

(4)

with

$$ \mathbf{W}=\mu \left(\mathbf{1}-\mu \right)\;\mathrm{and}\;{\mathbf{y}}^{\ast}=\ln \left(\frac{\mu }{1-\mu}\right)+\frac{\mathbf{y}-\mu }{\boldsymbol{\mu} \left(\mathbf{1}-\mu \right)}. $$

where y is a binary phenotype, and |${h}^2=\frac{\sigma_g^2}{\sigma_g^2+1}$| is the unknown genomic heritability of liability with the residual variance of 1 to be assumed in GLMM, which can be estimated in advance using the REML for GLMM [11, 18].

Unlike the normally distributed quantitative traits, the residuals for binary traits cannot be directly obtained due to the difference in scale between the predictors and response variables. Within the framework of GRAMMAR, thus, we eliminated polygenic effects on the binary phenotype by regarding the estimated GBVs |$\hat{\mathbf{g}}$| as a known predictor in the following GLM:

$$\begin{equation} \ln \left(\frac{\mu }{\mathbf{1}-\mu}\right)={\mathbf{z}}_{\mathrm{SNP}}{a}_{\mathrm{SNP}}+\hat{\mathbf{g}} \end{equation}$$

(5)

with a regression item |${\mathbf{z}}_{\mathrm{SNP}}{a}_{\mathrm{SNP}}$| of the SNP tested.

With the iteratively re-weighted least square method [5], we obtained the maximum likelihood estimate for the SNP effect as:

$$\begin{equation} {\hat{a}}_{\mathrm{SNP}}={\left({\mathbf{z}}_{\mathrm{SNP}}^{\mathrm{T}}\mathbf{W}{\mathbf{z}}_{\mathrm{SNP}}\right)}^{-1}{\mathbf{z}}_{\mathrm{SNP}}^{\mathrm{T}}\mathbf{W}\left({\mathbf{y}}^{\ast}-\hat{\mathbf{g}}\right) \end{equation}$$

(6)

where W is calculated based on model (5).

The test statistic to infer the association of the SNP with binary traits is generally formulated by

$$\begin{equation} {\chi}^2=\frac{{\hat{a}}_{\mathrm{SNP}}^2}{{\mathbf{z}}_{\mathrm{SNP}}^{\mathrm{T}}\mathbf{W}{\mathbf{z}}_{\mathrm{SNP}}} \end{equation}$$

(7)

which is subject to the chi-squared distribution with the 1 degree of freedom.

Optimization of genomic control

In GRAMMAR for binary traits, replacement of polygenic effects excluding QTNs with GBVs deflates the test association statistics, which yields a high false-negative rate. By regulating the downward genomic heritability, we can more accurately estimate the polygenic effects with the GBLUP Equation (4). The polygenic heritability less than genomic heritability is determined by optimizing genomic control for association tests. Such an optimization for GRAMMAR can be summarized in the following steps:

(1) set the searching open interval of h² to (0, 1);
(2) estimate the GBVs |$\hat{\mathbf{g}}$| using Equation (4);
(3) statistically infer the genetic effect for each SNP by the chi-squared statistic (7);
(4) calculate the genome-wide chi-squared mean or statistical probability for each SNP;
(5) plot the quantile-quantile (Q-Q) profile for genome-wide statistical probabilities;
(6) update h² with Brent’s method [25];
(7) repeat step (2)–(6) until the genome-wide chi-squared mean reaches 1.0 plus or yields a satisfactory Q-Q plot, as proposed in [26].

Joint association analysis

After optimizing genomic control for GRAMMAR, we jointly analyzed multiple QTN candidates to improve the statistical power to detect QTNs for binary disease traits. Multiple QTN candidates can be chosen within the interval of significance level 0.05 to the Bonferroni corrected criterion [27], but the number of QTN candidates should be limited to be no greater than the population size. Taking the polygenic effects estimated above as a known predictor, we adopted backward regression to select the QTNs from the following multiple GLM:

$$\begin{equation} \ln \left(\frac{\mu }{\mathbf{1}-\mu}\right)=\mathbf{Xb}+{\mathbf{Z}}_{\mathbf{1}}{\mathbf{a}}_{\mathbf{1}}+\hat{\mathbf{g}} \end{equation}$$

(8)

Given the Bonferroni corrected significance level, non-significant QTN effects would be eliminated stepwisely from the model (8) according to the corrected statistic (7).

Optimizing genomic control in mixed model associations with binary diseases was written in C language on R platform. A user-friendly Optim-GRAMMAR software with the options of a test at once and joint association analysis was developed, which is freely available at https://github.com/RunKingProgram/Optim-GRAMMAR.

Simulations

Two genomic datasets of human [28] and maize [11] samples were used to simulate the adaptability of GRAMMAR for binary traits to population structure. The maize population has a more complex structure than the human population. Then, 300 000 SNPs for both 12 000 people and 2640 maize were extracted through higher quality control. In whole simulations, control and case samples were constrained to 1:1 for the maize population, and 3000 cases were selected from the human population with low incidence rate of 5% simulated in advance. QTNs were distributed randomly over the entire SNPs, whose additive effects were sampled from a gamma distribution with shape = 1.66 and scale = 0.4. Given the genomic heritability of liability, phenotypes of control (0) and case (1) can be generated from the genomic logit model (1).

In addition to population structure, the number of QTNs, genome heritability and sampling number of SNPs were considered as experimental factors in the simulations. Under the optimized genomic control infinitely close to 1.0, the ROC profiles can be plotted by statistical powers to detect the QTNs relative to a given series of Type I errors. Statistical powers are defined as the percentage of identified QTNs that have the maximum test statistic among their 20 closest neighbors over the total number of simulated QTNs. Simulations were repeated 50 times, and in each simulation, the positions and effects of QTNs simulated were varied and the average results were recorded.

Results

Statistical properties of Optim-GRAMMAR for binary traits

Based on the two genomic datasets, we simulated phenotypes controlled by 40, 200 and 1000 QTNs at the low (0.2), moderate (0.5) and high (0.8) genomic heritability, respectively. The statistical properties of Optim-GRAMMAR using a test at once for binary traits were investigated by comparing it with GRAMMAR, GMMAT, LTMLM and SAIGE. The Q-Q and ROC profiles are displayed selectively in Figures 1 and 2, respectively, and in Supplementary Figures 1S and 2S, respectively, in detail. The genomic controls are estimated in Supplementary Table 1S. Making genomic control infinitely close to 1.0, Optim-GRAMMAR achieved almost the same statistical power to detect QTNs as the GMMAT, which approximates the exact GLMM, irrespective of how many QTNs and heritabilities are simulated. Among Optim-GRAMMAR and the four competing methods, GRAMMAR had the lowest genomic controls and statistical power, and for GRAMMAR, the population structure was more complex and the false negative rate was larger. Although LTMLM achieved the highest statistical power to detect QTNs for all simulated phenotypes for the maize dataset, and SAIGE demonstrated a higher statistical power for the dataset controlled by 1000 QTNs at the genomic heritability of 0.2. A strong false positive error rate was observed for SAIGE. In the human dataset, there were no distinct differences in the statistical properties between GRAMMAR-lambda and the four competing methods, although GRAMMAR provided some false negative errors.

Figure 1

Comparison in the Q-Q profiles between Optim-GRAMMAR and the four competing methods. The simulated phenotypes are controlled by 200 QTNs with the low, moderate and high heritabilities in human and maize. The Q-Q profiles for all simulated phenotypes are reported in Supplementary Figure 1S.

Open in new tab Download slide

Figure 2

Comparison in the ROC profiles between Optim-GRAMMAR and the four competing methods. The ROC profiles are plotted using the statistical powers to detect QTNs relative to the given series of Type I errors. Here, the simulated phenotypes are controlled by 200 QTNs with the low, moderate and high heritabilities in human and maize. The ROC profiles for all simulated phenotypes are reported in Supplementary Figure 2S.

Open in new tab Download slide

After optimization for genomic control, Optim-GRAMMAR jointly analyzed multiple QTN candidates chosen from a test at once at a significance level of 0.05. For convenience for comparison, we analyzed the statistical powers obtained with one test at a time and joint analyses together. By backward regression analysis, Optim-GRAMMAR evidently exhibited improved statistical power. In contrast, LTMLM was inferior to joint analysis of Optim-GRAMMAR in the terms of statistical power, even with the highest false positive rates.

Calculation of GRMs and GCs with the sampling markers

To investigate the effects of sampling markers on Optim-GRAMMAR, we randomly drawn 3 K, 5 K, 10 K, 20 K and 25 K SNPs from the entire genomic markers to calculate GRM. Changes in the genomic control at the varied sampling levels of SNPs are depicted in Figure 3 for Optim-GRAMMAR, GRAMMAR, GMMAT and SAIGE. Because LTMLM cannot sample SNPs, it was not included in the comparison. No competing method stably controlled the positive/negative false errors using less than 25 K sampling SNPs, besides SAIGE for human phenotypes. Specifically, GMMAT gradually controlled the positive false errors as the sampling markers increased; GRAMMAR controlled the negative false rate by sampling less markers, while SAIGE produced serious false negative errors in the complex maize population. In comparison, Optim-GRAMMAR still retained a high statistical power to detect QTNs through almost perfect genomic control, even using less than 3000 sampling markers (see Supplementary Figure 3S and Figure 4). In optimizing genomic control (GC), additionally, to avoid genome-wide scan, we also evaluated the GC each round by sampling 10 K, 30 K, 50 K and 100 K SNPs randomly. It was concluded that at those optimal GCs calculated from sampling markers, genome-wide GCs were all closed to or slightly above 1.0 and corresponding Q-Q profiles did not change (results not shown).

Figure 3

Changes in genomic controls with the number of sampling SNPs for Optim-GRAMMAR and the three competing methods. Genomic control is calculated by averaging genome-wide test statistics. The simulated phenotypes are controlled by 40, 200 and 1000 QTNs with the moderate heritability in human and maize.

Open in new tab Download slide

Figure 4

Changes in ROC profiles with the number of sampling SNPs for Optim-GRAMMAR. The simulated phenotypes are controlled by 40, 200, 1000 QTNs with the moderate heritability in human and maize.

Open in new tab Download slide

Application of Optim-GRAMMAR to WTCCC study

We were authorized to re-analyze the Wellcome trust case–control consortium (WTCCC) study 1 [29]. There were the 11 985 cases from six common diseases and 3004 shared controls, genotyped at a total of 490 032 SNPs. For each dataset, a standard quality control (QC) procedure was performed: SNPs with MAFs <0.01 and HWE > 0.05 were excluded, and individuals with missing rates >0.01 were also excluded. After the QC process, the number of samples and SNPs used for generalized mixed model association analyses were 5002 individuals (1998 cases and 3004 controls) and 409 642 SNPs for bipolar disorder (BD), 4992 individuals (1988 cases and 3004 controls) and 409 516 SNPs for coronary artery disease (CAD), 5003 individuals (1999 cases and 3004 controls) and 409,924 SNPs for rheumatoid arthritis (RA), 5005 individuals (2001 cases and 3004 controls) and 409 742 SNPs for hypertension (HT), 5004 individuals (2000 cases and 3004 controls) and 40 9674 SNPs for type I diabetes (T1D), and 5003 individuals (1999 cases and 3004 controls) and 409 805 SNPs for type II diabetes (T2D). All data analyses were performed in a CentOS Linux sever with 2.60 GHz Intel(R) Xeon(R) 40 CPUs E5–2660 v3 and 512 GB memory.

For the six common diseases, we implemented Optim-GRAMMAR using entire genomic markers and 5000 sampling SNPs, respectively, to estimate the GRM. The Q-Q and Manhattan profiles for the six common diseases are depicted in Supplementary Figures 4S and 5S obtained with the Optim-GRAMMAR using a test at once and the four competing methods used in simulations, while in Supplementary Figure 5S with the Optim-GRAMMAR using joint association analyses. The association analyses illustrated that (1) under perfect genomic control, Optim-GRAMMAR found the QTNs for each disease on each chromosome, and detected slightly more the QTNs than all the competing methods, except for GMMAT, of the detected QTNs, most were overlapped among the competing models; and (2) in Optim-GRAMMAR, joint association analyses detected more QTNs than a test at once, and several new genes were found to be associated with the analyzed diseases by gene annotation. As compared to Optim-GRAMMAR, GRAMMAR detected less QTNs with the lowest genomic control among all the methods, while GMMAT yielded more SNPs whose -log(p) exceeded the Bonferroni corrected thresholds for CAD, T1D, T2D and HT, but it produced the highest false positive errors (genomic controls). Additionally, LTMLM estimated the abnormal genomic heritabilities for CAD, BD, T2D and HT, with unstable genomic controls.

Further, we conducted strict QC for each dataset, as done in [16] for estimating genomic heritability. Despite this, the missing heritabilities could not be normally estimated for BD and HT. As shown in Supplementary Figures 6S and 7S, all methods exhibited clear and comparable association results, except for GRAMMAR. Interestingly, both LTMLM and GMMAT seriously underestimated the genomic heritability for each disease after strict QC. In summary, Optim-GRAMMAR could efficiently and robustly map QTNs for binary diseases and did not depend on estimation of genomic heritability and QC for genomic datasets.

Computational costs (time complexities) were summarized in Table 1 for the five competing association methods. Theoretically, Optim-GRAMMAR and SEIGE had distinct advantage over other GLMM-based association methods in computational complexity. For each dataset with standard QC, we recorded the running times from input of genotypes and phenotype to output of mapping QTNs for all the methods. Table 2 shows that Optim-GRAMMAR reduced the computing time by dozens of times with the lowest memory footprint.

Table 1

Open in new tab

Computational costs of the five competing association methods

Method	Calculating GRM	Estimating heritability	Estimating GBVs	Association test
GMMAT	O(mn²)	O(in³)		O(m²n)
LTMLM	O(mn²)	O(in³)	O(smn)	O(m²n)
SAIGE		O(il²n)	O(iln)	O(imn)
GRAMMAR	O(mn²)	O(in³)	O(mn)	O(imn)
Optim-GRAMMAR	O(k²n)		O(ikn)	O(imn)

Method	Calculating GRM	Estimating heritability	Estimating GBVs	Association test
GMMAT	O(mn²)	O(in³)		O(m²n)
LTMLM	O(mn²)	O(in³)	O(smn)	O(m²n)
SAIGE		O(il²n)	O(iln)	O(imn)
GRAMMAR	O(mn²)	O(in³)	O(mn)	O(imn)
Optim-GRAMMAR	O(k²n)		O(ikn)	O(imn)

k and l are numbers of sampling markers for Optim-GRAMMAR and SAIGE, respectively; s and i are rounds of Bayesian sampling for LTMLM and iterative solving for GLM or GLMM.

Table 1

Open in new tab

Computational costs of the five competing association methods

Method	Calculating GRM	Estimating heritability	Estimating GBVs	Association test
GMMAT	O(mn²)	O(in³)		O(m²n)
LTMLM	O(mn²)	O(in³)	O(smn)	O(m²n)
SAIGE		O(il²n)	O(iln)	O(imn)
GRAMMAR	O(mn²)	O(in³)	O(mn)	O(imn)
Optim-GRAMMAR	O(k²n)		O(ikn)	O(imn)

Method	Calculating GRM	Estimating heritability	Estimating GBVs	Association test
GMMAT	O(mn²)	O(in³)		O(m²n)
LTMLM	O(mn²)	O(in³)	O(smn)	O(m²n)
SAIGE		O(il²n)	O(iln)	O(imn)
GRAMMAR	O(mn²)	O(in³)	O(mn)	O(imn)
Optim-GRAMMAR	O(k²n)		O(ikn)	O(imn)

k and l are numbers of sampling markers for Optim-GRAMMAR and SAIGE, respectively; s and i are rounds of Bayesian sampling for LTMLM and iterative solving for GLM or GLMM.

Table 2

Open in new tab

Computing time/memory footprint (Min/Gb) of the four competing methods for the six diseases

Disease	LTMLM	SAIGE	GMMAT	Optim-GRAMMAR^a	Optim-GRAMMAR^b
CAD	26.14/33.1	15.60/1.2	6.78/2.1	2.47/0.4	0.91/0.6
RA	27.46/33.1	15.19/1.2	6.82/2.1	2.56/0.4	0.96/0.6
T1D	27.31/33.1	14.16/1.2	6.78/2.1	2.38/0.4	0.93/0.6
T2D	25.83/33.1	15.22/1.2	6.54/2.1	2.41/0.4	0.94/0.6
HT	28.61/33.1	14.67/1.2	6.57/2.1	3.02/0.4	0.89/0.6
BD	26.88/33.1	14.85/1.2	6.74/2.1	2.04/0.4	0.91/0.6

Disease	LTMLM	SAIGE	GMMAT	Optim-GRAMMAR^a	Optim-GRAMMAR^b
CAD	26.14/33.1	15.60/1.2	6.78/2.1	2.47/0.4	0.91/0.6
RA	27.46/33.1	15.19/1.2	6.82/2.1	2.56/0.4	0.96/0.6
T1D	27.31/33.1	14.16/1.2	6.78/2.1	2.38/0.4	0.93/0.6
T2D	25.83/33.1	15.22/1.2	6.54/2.1	2.41/0.4	0.94/0.6
HT	28.61/33.1	14.67/1.2	6.57/2.1	3.02/0.4	0.89/0.6
BD	26.88/33.1	14.85/1.2	6.74/2.1	2.04/0.4	0.91/0.6

^aRepresents using entire markers; ^bSampling markers for Optim-GRAMMAR. All computations were performed on a computing platform with a 3.20 GHz Intel(R) Core(R) 6-core CPU i7 8700 and 64 GB memory.

Table 2

Open in new tab

Computing time/memory footprint (Min/Gb) of the four competing methods for the six diseases

Disease	LTMLM	SAIGE	GMMAT	Optim-GRAMMAR^a	Optim-GRAMMAR^b
CAD	26.14/33.1	15.60/1.2	6.78/2.1	2.47/0.4	0.91/0.6
RA	27.46/33.1	15.19/1.2	6.82/2.1	2.56/0.4	0.96/0.6
T1D	27.31/33.1	14.16/1.2	6.78/2.1	2.38/0.4	0.93/0.6
T2D	25.83/33.1	15.22/1.2	6.54/2.1	2.41/0.4	0.94/0.6
HT	28.61/33.1	14.67/1.2	6.57/2.1	3.02/0.4	0.89/0.6
BD	26.88/33.1	14.85/1.2	6.74/2.1	2.04/0.4	0.91/0.6

Disease	LTMLM	SAIGE	GMMAT	Optim-GRAMMAR^a	Optim-GRAMMAR^b
CAD	26.14/33.1	15.60/1.2	6.78/2.1	2.47/0.4	0.91/0.6
RA	27.46/33.1	15.19/1.2	6.82/2.1	2.56/0.4	0.96/0.6
T1D	27.31/33.1	14.16/1.2	6.78/2.1	2.38/0.4	0.93/0.6
T2D	25.83/33.1	15.22/1.2	6.54/2.1	2.41/0.4	0.94/0.6
HT	28.61/33.1	14.67/1.2	6.57/2.1	3.02/0.4	0.89/0.6
BD	26.88/33.1	14.85/1.2	6.74/2.1	2.04/0.4	0.91/0.6

^aRepresents using entire markers; ^bSampling markers for Optim-GRAMMAR. All computations were performed on a computing platform with a 3.20 GHz Intel(R) Core(R) 6-core CPU i7 8700 and 64 GB memory.

Discussion

Development of the GRAMMAR for GLMM association was an essential prerequisite for extending the Optim-GRAMMAR to rapidly optimize mixed model association analysis for binary traits. For genomic GLMM, however, no binary residuals could be produced because of the scale difference between binary phenotype and predictors. Thus, we considered the GBVs estimated in advance as a known predictor in genomic logit regression and then executed association tests for candidate markers. This ensured that GRAMMAR had the lowest computing complexity for association tests for binary traits among the existing GLMM-based association methods [17–19]. Because GRAMMAR for binary diseases produces high false negative rates for quantitative traits distributed normally, we optimized the genomic control for GRAMMAR by regulating downward genomic heritability to underestimate the GBVs with GBLUP equations for GLMM. Thus, Optim-GRAMMAR solved the GBLUP equations and performed association tests with simple logit regression only for several iterations, thus improving the computational efficiency for genome-wide GLMM association analysis.

Several GLMM-based association methods such LTMLM, GMMAT and SAIGE have simplified genome-wide mixed model association analysis for binary traits to a certain extent, but they are more appropriate to handle the less complex populations such as human datasets [18, 19]. Moreover, the heritability for binary diseases could not be robustly and precisely estimated using genomic markers [14–16], which also limited the efficient application of these association methods. In contrast, because Optim-GRAMMAR does not need to directly estimate genomic heritability, it can powerfully and robustly map QTNs for binary traits in complexly structured populations. Within the framework of Optim-GRAMMAR, further, joint analysis for the candidate quantitative trait nucleotides chosen by multiple testing significantly improved statistical power to detect QTNs with almost perfect genomic control.

The Optim-GRAMMAR extremely simplified genome-wide GLMM association analysis for binary traits in large-scale population. For a genomic dataset containing m SNPs genotyped on n individuals, Optim-GRAMMAR for binary traits took only the computing complexity of O(mn²) to build the GRM and O(imn) for association tests with i rounds to optimize genomic controls. Especially, the inverse of GRM could not be calculated in solving the mixed model equations [4]. When number of sampling markers (k) used to calculate the GRM was far less than n, we could solve genetic effects of the sampled markers using ridge regression [30]:

$$ \left[\begin{array}{@{}cc@{}}{\mathbf{X}}^{\mathrm{T}}\mathbf{W}\mathbf{X}& {\mathbf{X}}^{\mathrm{T}}\mathbf{W}\mathbf{Z}\\{}{\mathbf{Z}}^{\mathrm{T}}\mathbf{W}\mathbf{X}& {\mathbf{Z}}^{\mathrm{T}}\mathbf{W}\mathbf{Z}+\frac{k-{h}^2}{h^2}\end{array}\right]\left[\begin{array}{@{}c@{}}\hat{\mathbf{b}}\\{}\hat{\mathbf{a}}\end{array}\right]=\left[\begin{array}{@{}c@{}}{\mathbf{X}}^{\mathrm{T}}\mathbf{W}{\mathbf{y}}^{\ast}\\{}{\mathbf{Z}}^{\mathrm{T}}\mathbf{W}{\mathbf{y}}^{\ast}\end{array}\right], $$

and then estimated GBVs by |${\mathbf{Z}}^{\mathrm{T}}\hat{\mathbf{a}}$|⁠. As a result, the computing complexity to build the information matrix would reduce to O(k²n), as in FaST-LMM-Select [31]. At the same time, if we evaluated genomic control using m₁ sampled markers at each iteration, then the computing complexity for association tests would reduce to O(im₁n). For the simulated 8 million SNPs on 400 000 individuals based on genomic data in UK Biobank [32], Optim-GRAMMAR required only 5.1 h to analyze single binary phenotype by sampling 5000 SNPs to calculate GRM, 0.9 h of which to build the information matrix and optimize genomic controls, while SAIGE did about 534 h ¹⁵, and GMMAT did not work for such a large dataset.

Data availability

The datasets can be available at https://www.panzea.org/%21#genotypes/cctl for maize, and http://www.wtccc.org.uk/ with the authorization for human genomic data.

Key Points

GRAMMAR is extended to handle binary diseases by considering genomic breeding values as a known predictor in logit regression.
The Optim-GRAMMAR can effectively correct confounders with polygenic effects in association tests, controlling false negative errors.
Within the framework of Optim-GRAMMAR, joint association analysis greatly improves statistical power to detect QTNs.
The Optim-GRAMMAR can handle large-scale data by using sampling markers to evaluate polygenic effects and genomic controls.

Funding

National Key R&D Program of China (2018YFD0900201); National Natural Science Foundations of China (32072726); Special Scientific Research Funds for Central Non-profit Institutes, Chinese Academy of Fishery Sciences (2019ZY09).

Yuxin Song is currently a PhD student at Wuxi Fisheries College, Nanjing Agricultural University.

Li'ang Yang is a teaching assistant at College of Life Science, Northeast Agricultural University. Her research focuses on animal breeding and genetics.

Li Jiang is an associate professor at Research Centre for Aquatic biotechnology, Chinese Academy of Fishery Sciences. He is interested in breeding and genetics in aquaculture.

Zhiyu Hao is a research assistant at Institute of Animal Husbandry, Heilongjiang Academy of Agricultural Sciences. His research focuses on genome-wide association study.

Runqing Yang is a professor at Research Centre for Aquatic biotechnology, Chinese Academy of Fishery Sciences. He is interested in developing statistical methodology of genome-wide association study.

Pao Xu is a professor at Wuxi Fisheries College, Nanjing Agricultural University. His research focuses on breeding and genetics in aquaculture.

References

1.

Bulmer

MG

.

The effect of selection on genetic variability

.

Am Nat

1971

;

105

:

201

–

11

.

Google Scholar

Crossref

WorldCat

2.

Falconer

DS

.

Introduction to Quantitative Genetics

, 2nd edn.

London

:

Longman

,

1981

.

Google Scholar

Google Preview

OpenURL Placeholder Text

WorldCat

3.

Yu

JM

,

Pressoir

G

,

Briggs

WH

, et al.

A unified mixed-model method for association mapping that accounts for multiple levels of relatedness

.

Nat Genet

2006

;

38

:

203

–

8

.

4.

Henderson

CR

.

Applications of Linear Models in Animal Breeding

.

Guelph

:

University of Guelph

,

1984

.

Google Scholar

Google Preview

OpenURL Placeholder Text

WorldCat

5.

Wedderburn

RWM

.

Quasi-likelihood functions, generalized linear models, and the gauss-newton method

.

Biometrika

1974

;

61

:

439

–

47

.

Google Scholar

OpenURL Placeholder Text

WorldCat

6.

McCullagh

P

,

Nelder

JA

.

Generalized Linear Models

, 2nd edn.

New York

:

Chapman and Hall

,

1989

.

7.

Chang

CC

,

Chow

CC

,

Tellier

LC

, et al.

Second-generation PLINK: rising to the challenge of larger and richer datasets

.

Gigascience

2015

;

4

:

7

.

8.

Mefford

J

,

Witte

JS

.

The Covariate's dilemma

.

PLoS Genet

2012

;

8

:

e1003096

.

9.

Zaitlen

N

,

Lindstrom

S

,

Pasaniuc

B

, et al.

Informed conditioning on clinical covariates increases power in case-control association studies

.

PLoS Genet

2012

;

8

:

e1003032

.

10.

Zaitlen

N

,

Pasaniuc

B

,

Patterson

N

, et al.

Analysis of case-control association studies with known risk variants

.

Bioinformatics

2012

;

28

:

1729

–

37

.

11.

Breslow

NE

,

Clayton

DG

.

Approximate inference in generalized linear mixed models

.

J Am Stat Assoc

1993

;

88

:

9

–

25

.

Google Scholar

Crossref

WorldCat

12.

Patterson

HD

,

Thompson

R

.

Recovery of inter-block information when block sizes are unequal

.

Biometrika

1971

;

58

:

545

–

54

.

Google Scholar

Crossref

WorldCat

13.

Sorenrsen

D

,

Gianola

D

.

Likelihood, Bayesian, and MCMC Methods in Quantitative Genetics

.

New York

:

Springer

,

2002

.

14.

Schall

R

.

Estimation in generalized linear models with random effects

.

Biometrika

1991

;

78

:

719

–

27

.

Google Scholar

Crossref

WorldCat

15.

Gilmour

AR

,

Anderson

RD

,

Rae

AL

.

The analysis of binomial data by a generalized linear mixed model

.

Biometrika

1985

;

72

:

593

–

9

.

Google Scholar

Crossref

WorldCat

16.

Lee

SH

,

Wray

NR

,

Goddard

ME

, et al.

Estimating missing heritability for disease from genome-wide association studies

.

Am J Hum Genet

2011

;

88

:

294

–

305

.

17.

Hayeck

TJ

,

Zaitlen

NA

,

Loh

PR

, et al.

Mixed model with correction for case-control ascertainment increases association power

.

Am J Hum Genet

2015

;

96

:

720

–

30

.

18.

Chen

H

,

Wang

C

,

Conomos

MP

, et al.

Control for population structure and relatedness for binary traits in genetic association studies via logistic mixed models

.

Am J Hum Genet

2016

;

98

:

653

–

66

.

19.

Zhou

W

,

Nielsen

JB

,

Fritsche

LG

, et al.

Efficiently controlling for case-control imbalance and sample relatedness in large-scale genetic association studies

.

Nat Genet

2018

;

50

:

1335

–

41

.

20.

Kang

HM

,

Sul

JH

,

Service

SK

, et al.

Variance component model to account for sample structure in genome-wide association studies

.

Nat Genet

2010

;

42

:

348

–

54

.

21.

Loh

PR

,

Tucker

G

,

Bulik-Sullivan

BK

, et al.

Efficient Bayesian mixed-model analysis increases association power in large cohorts

.

Nat Genet

2015

;

47

:

284

–

90

.

22.

Aulchenko

YS

,

de

Koning

DJ

,

Haley

C

.

Genomewide rapid association using mixed model and regression: a fast and simple method for genomewide pedigree-based quantitative trait loci association analysis

.

Genetics

2007

;

177

:

577

–

85

.

23.

Vanraden

PM

,

Tassell

CP

,

Van Wiggans

GR

, et al.

Invited review: reliability of genomic predictions for north American Holstein bulls

.

J Dairy Sci

2009

;

92

:

16

–

24

.

24.

Vanraden

PM

.

Efficient methods to compute genomic predictions

.

J Dairy Sci

2008

;

91

:

4414

–

23

.

25.

Brent

RP

.

Algorithms for Minimization Without Derivatives

.

New Jersey

:

Prentice-Hall

,

1973

.

Google Scholar

Google Preview

OpenURL Placeholder Text

WorldCat

26.

Price

AL

,

Zaitlen

NA

,

David

R

, et al.

New approaches to population stratification in genome-wide association studies

.

Nat Rev Genet

2010

;

11

:

459

–

63

.

27.

Hochberg

Y

,

Tamhane

AC

.

Multiple Comparison Procedures

.

New York

:

John Wiley & Sons, Inc.

,

1987

.

28.

Romay

MC

,

Millard

MJ

,

Glaubitz

JC

, et al.

Comprehensive genotyping of the USA national maize inbred seed bank

.

Genome Biol

2013

;

14

:

R55

.

29.

Consortium

WTCC

.

Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls

.

Nature

2007

;

447

:

661

–

78

.

Google Scholar

PubMed

OpenURL Placeholder Text

WorldCat

30.

Hoerl

AE

,

Kennard

RW

.

Ridge regression: biased estimation for nonorthogonal problems

.

Dent Tech

1970

;

12

:

55

–

67

.

Google Scholar

OpenURL Placeholder Text

WorldCat

31.

Jennifer

L

,

Christoph

L

,

Kadie

CM

, et al.

Improved linear mixed models for genome-wide association studies

.

Nat Methods

2012

;

9

:

525

–

6

.

Google Scholar

PubMed

OpenURL Placeholder Text

WorldCat

32.

Jiang

L

,

Zheng

Z

,

Qi

T

, et al.

A resource-efficient tool for mixed model association analysis of large-scale data

.

Nat Genet

2019

;

51

:

1749

–

55

.

Author notes

Yuxin Song and Li’ang Yang authors have contributed equally to this study.

This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://dbpia.nl.go.kr/journals/pages/open_access/funder_policies/chorus/standard_publication_model)

Download all slides

Month:	Total Views:
October 2021	59
November 2021	38
December 2021	14
January 2022	13
February 2022	16
March 2022	19
April 2022	5
May 2022	9
June 2022	12
July 2022	10
August 2022	6
September 2022	4
October 2022	11
November 2022	6
December 2022	9
January 2023	4
February 2023	6
March 2023	18
April 2023	5
May 2023	7
June 2023	7
July 2023	12
August 2023	3
October 2023	7
December 2023	6
January 2024	39
February 2024	15
March 2024	10
April 2024	18
May 2024	26
June 2024	13
July 2024	20
August 2024	22
September 2024	19
October 2024	16
November 2024	16
December 2024	24
January 2025	7
February 2025	21
March 2025	28
April 2025	13
May 2025	4

Article Contents

Optimizing genomic control in mixed model associations with binary diseases

Abstract

Introduction

Methods

Genomic logit regression

GRAMMAR for binary diseases

Optimization of genomic control

Joint association analysis

Simulations

Results

Statistical properties of Optim-GRAMMAR for binary traits

Calculation of GRMs and GCs with the sampling markers

Application of Optim-GRAMMAR to WTCCC study

Discussion

Data availability

Funding

References

Author notes

Supplementary data

Citations

Views

Altmetric

Email alerts

Citing articles via

Latest

Most Read

Most Cited

Article Contents

Optimizing genomic control in mixed model associations with binary diseases Free

Abstract

Introduction

Methods

Genomic logit regression

GRAMMAR for binary diseases

Optimization of genomic control

Joint association analysis

Simulations

Results

Statistical properties of Optim-GRAMMAR for binary traits

Calculation of GRMs and GCs with the sampling markers

Application of Optim-GRAMMAR to WTCCC study

Discussion

Data availability

Funding

References

Author notes

Supplementary data

Citations

Views

Altmetric

Email alerts

Citing articles via

Latest

Most Read

Most Cited

This Feature Is Available To Subscribers Only

Optimizing genomic control in mixed model associations with binary diseases