Abstract

Complex computation and approximate solution hinder the application of generalized linear mixed models (GLMM) into genome-wide association studies. We extended GRAMMAR to handle binary diseases by considering genomic breeding values (GBVs) estimated in advance as a known predictor in genomic logit regression, and then reduced polygenic effects by regulating downward genomic heritability to control false negative errors produced in the association tests. Using simulations and case analyses, we showed in optimizing GRAMMAR, polygenic effects and genomic controls could be evaluated using the fewer sampling markers, which extremely simplified GLMM-based association analysis in large-scale data. Further, joint association analysis for quantitative trait nucleotide (QTN) candidates chosen by multiple testing offered significant improved statistical power to detect QTNs over existing methods.

Introduction

Although usually expressed as binary phenotypes, many disease traits in plants and animals are thought to be controlled by a number of loci each having a small effect [1, 2]. Thus, random polygenic effects excluding the tested markers should be considered in genome-wide association study (GWAS) for disease traits to correct the population stratification and cryptic relatedness, as linear mixed model (LMM) does for quantitative traits [3, 4]. Logistic regression, as a kind of generalized linear model (GLM) [5, 6], has been used earlier for genome-wide association analysis with the disease traits [7]. Despite correction for fixed-effect covariates [8–10], logistic regression still produces inflation of association test statistics. Therefore, it is necessary to introduce a generalized linear mixed model (GLMM) [11] that considers random polygenic effects to increase the power to map quantitative trait nucleotides (QTNs) for disease traits.

However, genome-wide GLMM association consumes much more computing time than mixed model association with either restricted maximum likelihood estimation (REML) [12] or Markov Chain Monte Carlo (MCMC) iteration [13]. If using the maximum likelihood in estimation and approximations to avoid numerical integration [14], the GLMM yields a problem of serious bias induced by the approximations [15], especially solutions tending toward the positive/negative infinity. In the GWAS for case–control studies, a non-random sampling of cases from the population results in biased estimation for genomic heritability. Analyzed by the LMM for binary phenotypes, genomic heritability estimate is transformed to a liability scale by adjusting both for scale and for ascertainment of the case samples [16]. The genomic heritability of liability is estimated in a biased manner for disease traits, even though it is done by GLMM via MCMC iteration. Based on the calibrated genomic hereditability for case–control ascertainment, a Chi-squared score statistic for GWAS of disease traits is computed from posterior mean liabilities (PMLs) under the liability-threshold model [17]. The individual PMLs are estimated with a multivariate Gibbs sampler, which increases the computational demand. For simplification to GLMM-based association analysis, the GMMAT [18] and SAIGE [19] separately extend the EMMAX [20] and BOLT-LMM [21], respectively, for normally distributed traits to binary phenotypes.

Owing to the computational intensity and approximate solutions obtained, GLMM can hardly be employed in GWAS for disease traits. Moreover, genomic heritability cannot be accurately estimated for complex diseases, especially in ascertained case–control studies. For simplification to genome-wide mixed model association analysis for quantitative traits normally distributed, GRAMMAR [22] replaced polygenic effects excluding the tested marker with the genomic breeding values (GBVs) estimated under null mixed model, and then associated phenotype residuals removing GBVs with genetic markers by least square method in the lowest computational complexity. However, removing partial marker effects from phenotype residuals made GRAMMAR deflating association test statistics and decreasing statistical power to detect QTNs. In this study, we extended GRAMMAR to handle binary traits by considering GBVs estimated in advance as a known predictor in genomic logit regression. Further, we optimized the genomic control for GRAMMAR for binary traits by regulating the downward genomic heritability to estimate the residual phenotypes. The complicated GLMM does not need to be directly solved by the Optim-GRAMMAR for binary traits, and it only repeatedly estimates GBVs with genomic best linear unbiased prediction (GBLUP) [23] for GLMM, achieving genome-wide GLMM association analysis rapidly. Finally, we jointly analyzed the QTN candidates chosen by multiple testing to improve the statistical power to detect QTNs.

Methods

Genomic logit regression

Complex disease traits, as binary ones, usually follow binomial or Poisson distributions, so the generalized linear model (GLM) [5, 6] is used to map QTLs controlling the traits. Assume that n individuals are recorded for phenotypic values and genotyped for m genetic markers. Distinguishing these markers from major and common alleles in a magnitude of effects of the markers on quantitative traits, we describe the relationship between all markers (predictors) and the mean of the exponential distribution family in the following logit regression:
(1)
where |$\mu $| denotes the expectations of phenotypic distribution, b is the systematic environment effect; the population structure (stratification), which results in phenotypic differences among subpopulations is always considered here, except for sex, age and some initial experimental conditions. a1 is the large genetic effect of q markers on phenotype, a2 is the minor or zero effect of the m-q markers on phenotype, and X, Z1 and Z2 are the corresponding design matrices of b, a1 and a2, respectively.

GRAMMAR for binary diseases

We define the GBVs as
(2)
Then, model (1) becomes
(3)
where a is the GLMM [11]. Under the assumption that |$({\mathbf{a}}_{\mathbf{1}},{\mathbf{a}}_{\mathbf{2}})\sim{N}_m(\mathbf{0},\mathbf{I}{\sigma}_a^2)$| with minor |${\sigma}_a^2$| for each marker, the GBVs are turned into random effects and |$\mathbf{g}\sim{N}_n(\mathbf{0},\mathbf{K}{\sigma}_g^2)$| with genomic variance of traits |${\sigma}_g^2=m{\sigma}_a^2$| and the genomic relationship matrix (GRM) K [24]. Without need to invert GRM, the GBVs can be estimated with the following GBLUP equations [4]
(4)
with
where y is a binary phenotype, and |${h}^2=\frac{\sigma_g^2}{\sigma_g^2+1}$| is the unknown genomic heritability of liability with the residual variance of 1 to be assumed in GLMM, which can be estimated in advance using the REML for GLMM [11, 18].
Unlike the normally distributed quantitative traits, the residuals for binary traits cannot be directly obtained due to the difference in scale between the predictors and response variables. Within the framework of GRAMMAR, thus, we eliminated polygenic effects on the binary phenotype by regarding the estimated GBVs |$\hat{\mathbf{g}}$| as a known predictor in the following GLM:
(5)
with a regression item |${\mathbf{z}}_{\mathrm{SNP}}{a}_{\mathrm{SNP}}$| of the SNP tested.
With the iteratively re-weighted least square method [5], we obtained the maximum likelihood estimate for the SNP effect as:
(6)
where W is calculated based on model (5).
The test statistic to infer the association of the SNP with binary traits is generally formulated by
(7)
which is subject to the chi-squared distribution with the 1 degree of freedom.

Optimization of genomic control

In GRAMMAR for binary traits, replacement of polygenic effects excluding QTNs with GBVs deflates the test association statistics, which yields a high false-negative rate. By regulating the downward genomic heritability, we can more accurately estimate the polygenic effects with the GBLUP Equation (4). The polygenic heritability less than genomic heritability is determined by optimizing genomic control for association tests. Such an optimization for GRAMMAR can be summarized in the following steps:

  • (1) set the searching open interval of h2 to (0, 1);

  • (2) estimate the GBVs |$\hat{\mathbf{g}}$| using Equation (4);

  • (3) statistically infer the genetic effect for each SNP by the chi-squared statistic (7);

  • (4) calculate the genome-wide chi-squared mean or statistical probability for each SNP;

  • (5) plot the quantile-quantile (Q-Q) profile for genome-wide statistical probabilities;

  • (6) update h2 with Brent’s method [25];

  • (7) repeat step (2)–(6) until the genome-wide chi-squared mean reaches 1.0 plus or yields a satisfactory Q-Q plot, as proposed in [26].

Joint association analysis

After optimizing genomic control for GRAMMAR, we jointly analyzed multiple QTN candidates to improve the statistical power to detect QTNs for binary disease traits. Multiple QTN candidates can be chosen within the interval of significance level 0.05 to the Bonferroni corrected criterion [27], but the number of QTN candidates should be limited to be no greater than the population size. Taking the polygenic effects estimated above as a known predictor, we adopted backward regression to select the QTNs from the following multiple GLM:
(8)

Given the Bonferroni corrected significance level, non-significant QTN effects would be eliminated stepwisely from the model (8) according to the corrected statistic (7).

Optimizing genomic control in mixed model associations with binary diseases was written in C language on R platform. A user-friendly Optim-GRAMMAR software with the options of a test at once and joint association analysis was developed, which is freely available at https://github.com/RunKingProgram/Optim-GRAMMAR.

Simulations

Two genomic datasets of human [28] and maize [11] samples were used to simulate the adaptability of GRAMMAR for binary traits to population structure. The maize population has a more complex structure than the human population. Then, 300 000 SNPs for both 12 000 people and 2640 maize were extracted through higher quality control. In whole simulations, control and case samples were constrained to 1:1 for the maize population, and 3000 cases were selected from the human population with low incidence rate of 5% simulated in advance. QTNs were distributed randomly over the entire SNPs, whose additive effects were sampled from a gamma distribution with shape = 1.66 and scale = 0.4. Given the genomic heritability of liability, phenotypes of control (0) and case (1) can be generated from the genomic logit model (1).

In addition to population structure, the number of QTNs, genome heritability and sampling number of SNPs were considered as experimental factors in the simulations. Under the optimized genomic control infinitely close to 1.0, the ROC profiles can be plotted by statistical powers to detect the QTNs relative to a given series of Type I errors. Statistical powers are defined as the percentage of identified QTNs that have the maximum test statistic among their 20 closest neighbors over the total number of simulated QTNs. Simulations were repeated 50 times, and in each simulation, the positions and effects of QTNs simulated were varied and the average results were recorded.

Results

Statistical properties of Optim-GRAMMAR for binary traits

Based on the two genomic datasets, we simulated phenotypes controlled by 40, 200 and 1000 QTNs at the low (0.2), moderate (0.5) and high (0.8) genomic heritability, respectively. The statistical properties of Optim-GRAMMAR using a test at once for binary traits were investigated by comparing it with GRAMMAR, GMMAT, LTMLM and SAIGE. The Q-Q and ROC profiles are displayed selectively in Figures 1 and 2, respectively, and in Supplementary Figures 1S and 2S, respectively, in detail. The genomic controls are estimated in Supplementary Table 1S. Making genomic control infinitely close to 1.0, Optim-GRAMMAR achieved almost the same statistical power to detect QTNs as the GMMAT, which approximates the exact GLMM, irrespective of how many QTNs and heritabilities are simulated. Among Optim-GRAMMAR and the four competing methods, GRAMMAR had the lowest genomic controls and statistical power, and for GRAMMAR, the population structure was more complex and the false negative rate was larger. Although LTMLM achieved the highest statistical power to detect QTNs for all simulated phenotypes for the maize dataset, and SAIGE demonstrated a higher statistical power for the dataset controlled by 1000 QTNs at the genomic heritability of 0.2. A strong false positive error rate was observed for SAIGE. In the human dataset, there were no distinct differences in the statistical properties between GRAMMAR-lambda and the four competing methods, although GRAMMAR provided some false negative errors.

Comparison in the Q-Q profiles between Optim-GRAMMAR and the four competing methods. The simulated phenotypes are controlled by 200 QTNs with the low, moderate and high heritabilities in human and maize. The Q-Q profiles for all simulated phenotypes are reported in Supplementary Figure 1S.
Figure 1

Comparison in the Q-Q profiles between Optim-GRAMMAR and the four competing methods. The simulated phenotypes are controlled by 200 QTNs with the low, moderate and high heritabilities in human and maize. The Q-Q profiles for all simulated phenotypes are reported in Supplementary Figure 1S.

Comparison in the ROC profiles between Optim-GRAMMAR and the four competing methods. The ROC profiles are plotted using the statistical powers to detect QTNs relative to the given series of Type I errors. Here, the simulated phenotypes are controlled by 200 QTNs with the low, moderate and high heritabilities in human and maize. The ROC profiles for all simulated phenotypes are reported in Supplementary Figure 2S.
Figure 2

Comparison in the ROC profiles between Optim-GRAMMAR and the four competing methods. The ROC profiles are plotted using the statistical powers to detect QTNs relative to the given series of Type I errors. Here, the simulated phenotypes are controlled by 200 QTNs with the low, moderate and high heritabilities in human and maize. The ROC profiles for all simulated phenotypes are reported in Supplementary Figure 2S.

After optimization for genomic control, Optim-GRAMMAR jointly analyzed multiple QTN candidates chosen from a test at once at a significance level of 0.05. For convenience for comparison, we analyzed the statistical powers obtained with one test at a time and joint analyses together. By backward regression analysis, Optim-GRAMMAR evidently exhibited improved statistical power. In contrast, LTMLM was inferior to joint analysis of Optim-GRAMMAR in the terms of statistical power, even with the highest false positive rates.

Calculation of GRMs and GCs with the sampling markers

To investigate the effects of sampling markers on Optim-GRAMMAR, we randomly drawn 3 K, 5 K, 10 K, 20 K and 25 K SNPs from the entire genomic markers to calculate GRM. Changes in the genomic control at the varied sampling levels of SNPs are depicted in Figure 3 for Optim-GRAMMAR, GRAMMAR, GMMAT and SAIGE. Because LTMLM cannot sample SNPs, it was not included in the comparison. No competing method stably controlled the positive/negative false errors using less than 25 K sampling SNPs, besides SAIGE for human phenotypes. Specifically, GMMAT gradually controlled the positive false errors as the sampling markers increased; GRAMMAR controlled the negative false rate by sampling less markers, while SAIGE produced serious false negative errors in the complex maize population. In comparison, Optim-GRAMMAR still retained a high statistical power to detect QTNs through almost perfect genomic control, even using less than 3000 sampling markers (see Supplementary Figure 3S and Figure 4). In optimizing genomic control (GC), additionally, to avoid genome-wide scan, we also evaluated the GC each round by sampling 10 K, 30 K, 50 K and 100 K SNPs randomly. It was concluded that at those optimal GCs calculated from sampling markers, genome-wide GCs were all closed to or slightly above 1.0 and corresponding Q-Q profiles did not change (results not shown).

Changes in genomic controls with the number of sampling SNPs for Optim-GRAMMAR and the three competing methods. Genomic control is calculated by averaging genome-wide test statistics. The simulated phenotypes are controlled by 40, 200 and 1000 QTNs with the moderate heritability in human and maize.
Figure 3

Changes in genomic controls with the number of sampling SNPs for Optim-GRAMMAR and the three competing methods. Genomic control is calculated by averaging genome-wide test statistics. The simulated phenotypes are controlled by 40, 200 and 1000 QTNs with the moderate heritability in human and maize.

Changes in ROC profiles with the number of sampling SNPs for Optim-GRAMMAR. The simulated phenotypes are controlled by 40, 200, 1000 QTNs with the moderate heritability in human and maize.
Figure 4

Changes in ROC profiles with the number of sampling SNPs for Optim-GRAMMAR. The simulated phenotypes are controlled by 40, 200, 1000 QTNs with the moderate heritability in human and maize.

Application of Optim-GRAMMAR to WTCCC study

We were authorized to re-analyze the Wellcome trust case–control consortium (WTCCC) study 1 [29]. There were the 11 985 cases from six common diseases and 3004 shared controls, genotyped at a total of 490 032 SNPs. For each dataset, a standard quality control (QC) procedure was performed: SNPs with MAFs <0.01 and HWE > 0.05 were excluded, and individuals with missing rates >0.01 were also excluded. After the QC process, the number of samples and SNPs used for generalized mixed model association analyses were 5002 individuals (1998 cases and 3004 controls) and 409 642 SNPs for bipolar disorder (BD), 4992 individuals (1988 cases and 3004 controls) and 409 516 SNPs for coronary artery disease (CAD), 5003 individuals (1999 cases and 3004 controls) and 409,924 SNPs for rheumatoid arthritis (RA), 5005 individuals (2001 cases and 3004 controls) and 409 742 SNPs for hypertension (HT), 5004 individuals (2000 cases and 3004 controls) and 40 9674 SNPs for type I diabetes (T1D), and 5003 individuals (1999 cases and 3004 controls) and 409 805 SNPs for type II diabetes (T2D). All data analyses were performed in a CentOS Linux sever with 2.60 GHz Intel(R) Xeon(R) 40 CPUs E5–2660 v3 and 512 GB memory.

For the six common diseases, we implemented Optim-GRAMMAR using entire genomic markers and 5000 sampling SNPs, respectively, to estimate the GRM. The Q-Q and Manhattan profiles for the six common diseases are depicted in Supplementary Figures 4S and 5S obtained with the Optim-GRAMMAR using a test at once and the four competing methods used in simulations, while in Supplementary Figure 5S with the Optim-GRAMMAR using joint association analyses. The association analyses illustrated that (1) under perfect genomic control, Optim-GRAMMAR found the QTNs for each disease on each chromosome, and detected slightly more the QTNs than all the competing methods, except for GMMAT, of the detected QTNs, most were overlapped among the competing models; and (2) in Optim-GRAMMAR, joint association analyses detected more QTNs than a test at once, and several new genes were found to be associated with the analyzed diseases by gene annotation. As compared to Optim-GRAMMAR, GRAMMAR detected less QTNs with the lowest genomic control among all the methods, while GMMAT yielded more SNPs whose -log(p) exceeded the Bonferroni corrected thresholds for CAD, T1D, T2D and HT, but it produced the highest false positive errors (genomic controls). Additionally, LTMLM estimated the abnormal genomic heritabilities for CAD, BD, T2D and HT, with unstable genomic controls.

Further, we conducted strict QC for each dataset, as done in [16] for estimating genomic heritability. Despite this, the missing heritabilities could not be normally estimated for BD and HT. As shown in Supplementary Figures 6S and 7S, all methods exhibited clear and comparable association results, except for GRAMMAR. Interestingly, both LTMLM and GMMAT seriously underestimated the genomic heritability for each disease after strict QC. In summary, Optim-GRAMMAR could efficiently and robustly map QTNs for binary diseases and did not depend on estimation of genomic heritability and QC for genomic datasets.

Computational costs (time complexities) were summarized in Table 1 for the five competing association methods. Theoretically, Optim-GRAMMAR and SEIGE had distinct advantage over other GLMM-based association methods in computational complexity. For each dataset with standard QC, we recorded the running times from input of genotypes and phenotype to output of mapping QTNs for all the methods. Table 2 shows that Optim-GRAMMAR reduced the computing time by dozens of times with the lowest memory footprint.

Table 1

Computational costs of the five competing association methods

MethodCalculating GRMEstimating heritabilityEstimating GBVsAssociation test
GMMATO(mn2)O(in3)O(m2n)
LTMLMO(mn2)O(in3)O(smn)O(m2n)
SAIGEO(il2n)O(iln)O(imn)
GRAMMARO(mn2)O(in3)O(mn)O(imn)
Optim-GRAMMARO(k2n)O(ikn)O(imn)
MethodCalculating GRMEstimating heritabilityEstimating GBVsAssociation test
GMMATO(mn2)O(in3)O(m2n)
LTMLMO(mn2)O(in3)O(smn)O(m2n)
SAIGEO(il2n)O(iln)O(imn)
GRAMMARO(mn2)O(in3)O(mn)O(imn)
Optim-GRAMMARO(k2n)O(ikn)O(imn)

k and l are numbers of sampling markers for Optim-GRAMMAR and SAIGE, respectively; s and i are rounds of Bayesian sampling for LTMLM and iterative solving for GLM or GLMM.

Table 1

Computational costs of the five competing association methods

MethodCalculating GRMEstimating heritabilityEstimating GBVsAssociation test
GMMATO(mn2)O(in3)O(m2n)
LTMLMO(mn2)O(in3)O(smn)O(m2n)
SAIGEO(il2n)O(iln)O(imn)
GRAMMARO(mn2)O(in3)O(mn)O(imn)
Optim-GRAMMARO(k2n)O(ikn)O(imn)
MethodCalculating GRMEstimating heritabilityEstimating GBVsAssociation test
GMMATO(mn2)O(in3)O(m2n)
LTMLMO(mn2)O(in3)O(smn)O(m2n)
SAIGEO(il2n)O(iln)O(imn)
GRAMMARO(mn2)O(in3)O(mn)O(imn)
Optim-GRAMMARO(k2n)O(ikn)O(imn)

k and l are numbers of sampling markers for Optim-GRAMMAR and SAIGE, respectively; s and i are rounds of Bayesian sampling for LTMLM and iterative solving for GLM or GLMM.

Table 2

Computing time/memory footprint (Min/Gb) of the four competing methods for the six diseases

DiseaseLTMLMSAIGEGMMATOptim-GRAMMARaOptim-GRAMMARb
CAD26.14/33.115.60/1.26.78/2.12.47/0.40.91/0.6
RA27.46/33.115.19/1.26.82/2.12.56/0.40.96/0.6
T1D27.31/33.114.16/1.26.78/2.12.38/0.40.93/0.6
T2D25.83/33.115.22/1.26.54/2.12.41/0.40.94/0.6
HT28.61/33.114.67/1.26.57/2.13.02/0.40.89/0.6
BD26.88/33.114.85/1.26.74/2.12.04/0.40.91/0.6
DiseaseLTMLMSAIGEGMMATOptim-GRAMMARaOptim-GRAMMARb
CAD26.14/33.115.60/1.26.78/2.12.47/0.40.91/0.6
RA27.46/33.115.19/1.26.82/2.12.56/0.40.96/0.6
T1D27.31/33.114.16/1.26.78/2.12.38/0.40.93/0.6
T2D25.83/33.115.22/1.26.54/2.12.41/0.40.94/0.6
HT28.61/33.114.67/1.26.57/2.13.02/0.40.89/0.6
BD26.88/33.114.85/1.26.74/2.12.04/0.40.91/0.6

aRepresents using entire markers; bSampling markers for Optim-GRAMMAR. All computations were performed on a computing platform with a 3.20 GHz Intel(R) Core(R) 6-core CPU i7 8700 and 64 GB memory.

Table 2

Computing time/memory footprint (Min/Gb) of the four competing methods for the six diseases

DiseaseLTMLMSAIGEGMMATOptim-GRAMMARaOptim-GRAMMARb
CAD26.14/33.115.60/1.26.78/2.12.47/0.40.91/0.6
RA27.46/33.115.19/1.26.82/2.12.56/0.40.96/0.6
T1D27.31/33.114.16/1.26.78/2.12.38/0.40.93/0.6
T2D25.83/33.115.22/1.26.54/2.12.41/0.40.94/0.6
HT28.61/33.114.67/1.26.57/2.13.02/0.40.89/0.6
BD26.88/33.114.85/1.26.74/2.12.04/0.40.91/0.6
DiseaseLTMLMSAIGEGMMATOptim-GRAMMARaOptim-GRAMMARb
CAD26.14/33.115.60/1.26.78/2.12.47/0.40.91/0.6
RA27.46/33.115.19/1.26.82/2.12.56/0.40.96/0.6
T1D27.31/33.114.16/1.26.78/2.12.38/0.40.93/0.6
T2D25.83/33.115.22/1.26.54/2.12.41/0.40.94/0.6
HT28.61/33.114.67/1.26.57/2.13.02/0.40.89/0.6
BD26.88/33.114.85/1.26.74/2.12.04/0.40.91/0.6

aRepresents using entire markers; bSampling markers for Optim-GRAMMAR. All computations were performed on a computing platform with a 3.20 GHz Intel(R) Core(R) 6-core CPU i7 8700 and 64 GB memory.

Discussion

Development of the GRAMMAR for GLMM association was an essential prerequisite for extending the Optim-GRAMMAR to rapidly optimize mixed model association analysis for binary traits. For genomic GLMM, however, no binary residuals could be produced because of the scale difference between binary phenotype and predictors. Thus, we considered the GBVs estimated in advance as a known predictor in genomic logit regression and then executed association tests for candidate markers. This ensured that GRAMMAR had the lowest computing complexity for association tests for binary traits among the existing GLMM-based association methods [17–19]. Because GRAMMAR for binary diseases produces high false negative rates for quantitative traits distributed normally, we optimized the genomic control for GRAMMAR by regulating downward genomic heritability to underestimate the GBVs with GBLUP equations for GLMM. Thus, Optim-GRAMMAR solved the GBLUP equations and performed association tests with simple logit regression only for several iterations, thus improving the computational efficiency for genome-wide GLMM association analysis.

Several GLMM-based association methods such LTMLM, GMMAT and SAIGE have simplified genome-wide mixed model association analysis for binary traits to a certain extent, but they are more appropriate to handle the less complex populations such as human datasets [18, 19]. Moreover, the heritability for binary diseases could not be robustly and precisely estimated using genomic markers [14–16], which also limited the efficient application of these association methods. In contrast, because Optim-GRAMMAR does not need to directly estimate genomic heritability, it can powerfully and robustly map QTNs for binary traits in complexly structured populations. Within the framework of Optim-GRAMMAR, further, joint analysis for the candidate quantitative trait nucleotides chosen by multiple testing significantly improved statistical power to detect QTNs with almost perfect genomic control.

The Optim-GRAMMAR extremely simplified genome-wide GLMM association analysis for binary traits in large-scale population. For a genomic dataset containing m SNPs genotyped on n individuals, Optim-GRAMMAR for binary traits took only the computing complexity of O(mn2) to build the GRM and O(imn) for association tests with i rounds to optimize genomic controls. Especially, the inverse of GRM could not be calculated in solving the mixed model equations [4]. When number of sampling markers (k) used to calculate the GRM was far less than n, we could solve genetic effects of the sampled markers using ridge regression [30]:
and then estimated GBVs by |${\mathbf{Z}}^{\mathrm{T}}\hat{\mathbf{a}}$|⁠. As a result, the computing complexity to build the information matrix would reduce to O(k2n), as in FaST-LMM-Select [31]. At the same time, if we evaluated genomic control using m1 sampled markers at each iteration, then the computing complexity for association tests would reduce to O(im1n). For the simulated 8 million SNPs on 400 000 individuals based on genomic data in UK Biobank [32], Optim-GRAMMAR required only 5.1 h to analyze single binary phenotype by sampling 5000 SNPs to calculate GRM, 0.9 h of which to build the information matrix and optimize genomic controls, while SAIGE did about 534 h 15, and GMMAT did not work for such a large dataset.

Data availability

The datasets can be available at https://www.panzea.org/%21#genotypes/cctl for maize, and http://www.wtccc.org.uk/ with the authorization for human genomic data.

Key Points
  • GRAMMAR is extended to handle binary diseases by considering genomic breeding values as a known predictor in logit regression.

  • The Optim-GRAMMAR can effectively correct confounders with polygenic effects in association tests, controlling false negative errors.

  • Within the framework of Optim-GRAMMAR, joint association analysis greatly improves statistical power to detect QTNs.

  • The Optim-GRAMMAR can handle large-scale data by using sampling markers to evaluate polygenic effects and genomic controls.

Funding

National Key R&D Program of China (2018YFD0900201); National Natural Science Foundations of China (32072726); Special Scientific Research Funds for Central Non-profit Institutes, Chinese Academy of Fishery Sciences (2019ZY09).

Yuxin Song is currently a PhD student at Wuxi Fisheries College, Nanjing Agricultural University.

Li'ang Yang is a teaching assistant at College of Life Science, Northeast Agricultural University. Her research focuses on animal breeding and genetics.

Li Jiang is an associate professor at Research Centre for Aquatic biotechnology, Chinese Academy of Fishery Sciences. He is interested in breeding and genetics in aquaculture.

Zhiyu Hao is a research assistant at Institute of Animal Husbandry, Heilongjiang Academy of Agricultural Sciences. His research focuses on genome-wide association study.

Runqing Yang is a professor at Research Centre for Aquatic biotechnology, Chinese Academy of Fishery Sciences. He is interested in developing statistical methodology of genome-wide association study.

Pao Xu is a professor at Wuxi Fisheries College, Nanjing Agricultural University. His research focuses on breeding and genetics in aquaculture.

References

1.

Bulmer
 
MG
.
The effect of selection on genetic variability
.
Am Nat
 
1971
;
105
:
201
11
.

2.

Falconer
 
DS
.
Introduction to Quantitative Genetics
, 2nd edn.
London
:
Longman
,
1981
.

3.

Yu
 
JM
,
Pressoir
 
G
,
Briggs
 
WH
, et al.  
A unified mixed-model method for association mapping that accounts for multiple levels of relatedness
.
Nat Genet
 
2006
;
38
:
203
8
.

4.

Henderson
 
CR
.
Applications of Linear Models in Animal Breeding
.
Guelph
:
University of Guelph
,
1984
.

5.

Wedderburn
 
RWM
.
Quasi-likelihood functions, generalized linear models, and the gauss-newton method
.
Biometrika
 
1974
;
61
:
439
47
.

6.

McCullagh
 
P
,
Nelder
 
JA
.
Generalized Linear Models
, 2nd edn.
New York
:
Chapman and Hall
,
1989
.

7.

Chang
 
CC
,
Chow
 
CC
,
Tellier
 
LC
, et al.  
Second-generation PLINK: rising to the challenge of larger and richer datasets
.
Gigascience
 
2015
;
4
:
7
.

8.

Mefford
 
J
,
Witte
 
JS
.
The Covariate's dilemma
.
PLoS Genet
 
2012
;
8
:
e1003096
.

9.

Zaitlen
 
N
,
Lindstrom
 
S
,
Pasaniuc
 
B
, et al.  
Informed conditioning on clinical covariates increases power in case-control association studies
.
PLoS Genet
 
2012
;
8
:
e1003032
.

10.

Zaitlen
 
N
,
Pasaniuc
 
B
,
Patterson
 
N
, et al.  
Analysis of case-control association studies with known risk variants
.
Bioinformatics
 
2012
;
28
:
1729
37
.

11.

Breslow
 
NE
,
Clayton
 
DG
.
Approximate inference in generalized linear mixed models
.
J Am Stat Assoc
 
1993
;
88
:
9
25
.

12.

Patterson
 
HD
,
Thompson
 
R
.
Recovery of inter-block information when block sizes are unequal
.
Biometrika
 
1971
;
58
:
545
54
.

13.

Sorenrsen
 
D
,
Gianola
 
D
.
Likelihood, Bayesian, and MCMC Methods in Quantitative Genetics
.
New York
:
Springer
,
2002
.

14.

Schall
 
R
.
Estimation in generalized linear models with random effects
.
Biometrika
 
1991
;
78
:
719
27
.

15.

Gilmour
 
AR
,
Anderson
 
RD
,
Rae
 
AL
.
The analysis of binomial data by a generalized linear mixed model
.
Biometrika
 
1985
;
72
:
593
9
.

16.

Lee
 
SH
,
Wray
 
NR
,
Goddard
 
ME
, et al.  
Estimating missing heritability for disease from genome-wide association studies
.
Am J Hum Genet
 
2011
;
88
:
294
305
.

17.

Hayeck
 
TJ
,
Zaitlen
 
NA
,
Loh
 
PR
, et al.  
Mixed model with correction for case-control ascertainment increases association power
.
Am J Hum Genet
 
2015
;
96
:
720
30
.

18.

Chen
 
H
,
Wang
 
C
,
Conomos
 
MP
, et al.  
Control for population structure and relatedness for binary traits in genetic association studies via logistic mixed models
.
Am J Hum Genet
 
2016
;
98
:
653
66
.

19.

Zhou
 
W
,
Nielsen
 
JB
,
Fritsche
 
LG
, et al.  
Efficiently controlling for case-control imbalance and sample relatedness in large-scale genetic association studies
.
Nat Genet
 
2018
;
50
:
1335
41
.

20.

Kang
 
HM
,
Sul
 
JH
,
Service
 
SK
, et al.  
Variance component model to account for sample structure in genome-wide association studies
.
Nat Genet
 
2010
;
42
:
348
54
.

21.

Loh
 
PR
,
Tucker
 
G
,
Bulik-Sullivan
 
BK
, et al.  
Efficient Bayesian mixed-model analysis increases association power in large cohorts
.
Nat Genet
 
2015
;
47
:
284
90
.

22.

Aulchenko
 
YS
,
de
 
Koning
 
DJ
,
Haley
 
C
.
Genomewide rapid association using mixed model and regression: a fast and simple method for genomewide pedigree-based quantitative trait loci association analysis
.
Genetics
 
2007
;
177
:
577
85
.

23.

Vanraden
 
PM
,
Tassell
 
CP
,
Van Wiggans
 
GR
, et al.  
Invited review: reliability of genomic predictions for north American Holstein bulls
.
J Dairy Sci
 
2009
;
92
:
16
24
.

24.

Vanraden
 
PM
.
Efficient methods to compute genomic predictions
.
J Dairy Sci
 
2008
;
91
:
4414
23
.

25.

Brent
 
RP
.
Algorithms for Minimization Without Derivatives
.
New Jersey
:
Prentice-Hall
,
1973
.

26.

Price
 
AL
,
Zaitlen
 
NA
,
David
 
R
, et al.  
New approaches to population stratification in genome-wide association studies
.
Nat Rev Genet
 
2010
;
11
:
459
63
.

27.

Hochberg
 
Y
,
Tamhane
 
AC
.
Multiple Comparison Procedures
.
New York
:
John Wiley & Sons, Inc.
,
1987
.

28.

Romay
 
MC
,
Millard
 
MJ
,
Glaubitz
 
JC
, et al.  
Comprehensive genotyping of the USA national maize inbred seed bank
.
Genome Biol
 
2013
;
14
:
R55
.

29.

Consortium
 
WTCC
.
Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls
.
Nature
 
2007
;
447
:
661
78
.

30.

Hoerl
 
AE
,
Kennard
 
RW
.
Ridge regression: biased estimation for nonorthogonal problems
.
Dent Tech
 
1970
;
12
:
55
67
.

31.

Jennifer
 
L
,
Christoph
 
L
,
Kadie
 
CM
, et al.  
Improved linear mixed models for genome-wide association studies
.
Nat Methods
 
2012
;
9
:
525
6
.

32.

Jiang
 
L
,
Zheng
 
Z
,
Qi
 
T
, et al.  
A resource-efficient tool for mixed model association analysis of large-scale data
.
Nat Genet
 
2019
;
51
:
1749
55
.

Author notes

Yuxin Song and Li’ang Yang authors have contributed equally to this study.

This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://dbpia.nl.go.kr/journals/pages/open_access/funder_policies/chorus/standard_publication_model)

Supplementary data