-
PDF
- Split View
-
Views
-
Cite
Cite
Ronald Yurko, Kathryn Roeder, Bernie Devlin, Max G’Sell, An approach to gene-based testing accounting for dependence of tests among nearby genes, Briefings in Bioinformatics, Volume 22, Issue 6, November 2021, bbab329, https://doi.org/10.1093/bib/bbab329
- Share Icon Share
Abstract
In genome-wide association studies (GWAS), it has become commonplace to test millions of single-nucleotide polymorphisms (SNPs) for phenotypic association. Gene-based testing can improve power to detect weak signal by reducing multiple testing and pooling signal strength. While such tests account for linkage disequilibrium (LD) structure of SNP alleles within each gene, current approaches do not capture LD of SNPs falling in different nearby genes, which can induce correlation of gene-based test statistics. We introduce an algorithm to account for this correlation. When a gene’s test statistic is independent of others, it is assessed separately; when test statistics for nearby genes are strongly correlated, their SNPs are agglomerated and tested as a locus. To provide insight into SNPs and genes driving association within loci, we develop an interactive visualization tool to explore localized signal. We demonstrate our approach in the context of weakly powered GWAS for autism spectrum disorder, which is contrasted to more highly powered GWAS for schizophrenia and educational attainment. To increase power for these analyses, especially those for autism, we use adaptive |$P$|-value thresholding, guided by high-dimensional metadata modeled with gradient boosted trees, highlighting when and how it can be most useful. Notably our workflow is based on summary statistics.
Introduction
More than 3000 human genome-wide association studies (GWAS) have examined over 1800 diseases and traits, with uneven success in discovering associations [1]. For schizophrenia, for example, 280 discoveries were recently announced, while, for genetically correlated autism spectrum disorder, a handful of loci have been discovered [2]. The difference largely is due to statistical power. To increase power, one might decrease the number of hypotheses tested and thus reduce the threshold for significance. A natural strategy is gene-based testing: SNPs are assigned to genes they occur in or nearby [3]; within this unit, test statistics for SNPs are aggregated; and, finally, significance is judged by the number of genes tested. By focusing tests on genes instead of SNPs dispersed throughout the genome, gene-based testing also has interpretability as an appealing feature. Power can also be enhanced by choosing false discovery rate (FDR) control for significance testing. These two options, gene-based testing and FDR control, are not mutually exclusive. H-MAGMA [4] combines them and also incorporates Hi-C data into its testing scheme. Likewise, when SNPs affect gene expression, these functional SNP-to-gene assignments can be modeled [5].

Schematic of our workflow to improve power with metadata and highlight interesting signals while adjusting for LD structure.
A related approach is to increase power by incorporating metadata about SNPs or genes in the targeting of multiple testing procedures; selective inference provides approaches to incorporating this information while maintaining valid FDR control. An early approach incorporated metadata directly through the use of P-value weights [6]. More recently, in the setting of SNP-based GWAS, we [7] implemented a data-driven approach to determine weights via the adaptive P-value thresholding (AdaPT) framework [8]. In brief [7], we improved power for detecting a subset of weakly correlated SNPs by using gradient boosted trees to model potentially informative metadata, such as known effects of SNPs on gene expression and on genetically correlated traits.
Here we explore the use of AdaPT in the context of gene-based tests for autism (ASD), schizophrenia (SCZ) and educational attainment (EA), placing special emphasis on how it enhances power to detect associations of genes with ASD. To do so, we utilize gene-based testing methods introduced to account for linkage disequilibrium (LD) among SNPs in a gene [9–11]. LD is not limited by gene boundaries, however. To the contrary, LD among SNPs falling in different genes is commonplace. This compromises the interpretability of current gene-based tests, obscuring the meaning of error guarantees with family-wise error rate and FDR controlling procedures. Because of the extensive and heterogeneous LD in the genome, in our prior SNP-based GWAS using AdaPT, we purposely selected quasi-independent SNPs for analysis. By contrast, for gene-based testing, we introduce an agglomerative algorithm to account for LD-induced correlation of test statistics. This algorithm directly groups genes into ‘loci’ for which between-loci test statistic correlation—based on LD—is bounded above. This reduces the set of tests to a collection of weakly correlated genes and loci. Importantly, this agglomerative algorithm can be used in any gene-based testing framework to highlight gene-based tests that are dependent.
We analyze results from three GWAS: ASD[2], SCZ[12] and EA[13]. Using AdaPT guided by loci metadata as our example, we are able to improve power to select ASD-associated genes and LD-defined loci with multiple genes while maintaining finite-sample FDR control. Improvements are more modest for the other phenotypes, due to the high power of their original GWAS. One novel feature of our analyses is that it groups genes into loci when their test statistics are expected to be highly dependent, due to LD. We complement this feature with graphical tools to examine the distribution of association signal within each such locus. The interactive visualization tool we develop uses R Shiny [14, 15] and plotly [16] for exploring and highlighting biological signals therein.
Our workflow for improving power to select associated genes and LD-defined loci follows (Figure 1): We first introduce the agglomerative algorithm. Then, we demonstrate our approach for detecting associations with in the AdaPT framework. For metadata, we use eQTL data from cortical tissue samples, gene co-expression networks [17] and GWAS results for the other phenotypes, which is motivated by the observation that all three are genetically correlated[18]. Relationships of metadata to gene-based statistic are uncovered by use of gradient boosted trees. We separate our results into two categories, technical features related to gene-based testing and implementation of AdaPT in this setting, and association results for these phenotypes and their implications, with special emphasis on ASD.
Methods
SNP-to-gene assignment and correlation between gene-level tests
We consider two different approaches for assigning |$n$| SNPs to a set of genes |$G$|. Following common practice, we use genomic location only for ‘Positional’ SNP assignment: SNP |$i$| is assigned to gene |$g$| if its genomic position is within gene |$g$|’s start and end positions (|$S^{Pos}_g$|). In some published analyses, the start and end positions are expanded slightly to include presumed regulatory regions, although we do not do so here. As an alternative approach, recognizing that some SNPs have documented effects on genes, such as eQTL effects, we use Positional SNP assignment and include in the set cis-eQTL SNP-gene pairs, which we dub eSNPs. We denote the collection of eSNPs for gene |$g$| as |$S^{eQTL}_g$| and call this ‘Positional + eSNPs’ SNP assignment.
After assigning SNPs to genes, each gene’s vector of SNP-level |$z$| statistics |$\boldsymbol{z}_g$| is modeled as multivariate normal (Gaussian) with mean 0 and LD-induced covariance |$\boldsymbol{\Sigma }_g$|. Following common practice [9], we assume |$\boldsymbol{\Sigma }_g$| is known using correlations from reference genotype data, specifically 503 individuals from the 1000 Genomes EUR sample as the reference data [19]. Gene-level testing frameworks [3, 9, 10] combine SNP-level signals into gene-level test statistics |$T_g$| while accounting for |$\boldsymbol{\Sigma }_g$|, the LD among SNPs in a gene.
For nearby gene sets |$g$| and |$g^{\prime}$|, these correlations can be quite strong. This can confound the interpretability—and even meaning—of the guarantees from multiple testing procedures. To avoid these issues, we will modify the construction of our gene sets to bound the pairwise correlation between the resulting gene sets.
Agglomerative LD loci testing
To account for substantially correlated test statistics, we introduce an agglomerative procedure to group highly correlated genes that are within |$w$| megabases (Mb) into sets of genes we refer to as LD loci or simply loci. Given an LD threshold |$r^2$|, we apply the following procedure to a set of genes |$G$| within a chromosome:
Compute |$\textrm{Cor}(T_{g}, T_{g^{\prime}})$| for all pairs of genes, |$g, g^{\prime} \in G$|, within |$w$| Mb of each other (using Equations 2 and 1).
Repeat the following until |$(\textrm{Cor}(T_{g}, T_{g^{\prime}}))^2 < r^2$| for all remaining pairs of genes/loci in |$G$|:
Find genes/loci |$\{g_*, g_*^{\prime}\} = \underset{g,g^{\prime} \in G, g \neq g^{\prime}}{\textrm{arg max}}\ \textrm{Cor}(T_{g}, T_{g^{\prime}})$|,
Merge |$\{g_*, g_*^{\prime}\}$| into locus |$g_{LD}$|,
Update |$G = G \setminus \{g_*, g_*^{\prime}\} \cup \{g_{LD}\}$|, and compute |$\textrm{Cor}(T_{g_{LD}}, T_{g^{\prime}})$| for all |$g^{\prime} \in G$| within |$w$| Mb of |$g_{LD}$|.
This is essentially agglomerative hierarchical clustering, but with a linkage determined by the LD-based correlation structure of the test statistics. We compute the quadratic test statistic, |$T_g$|, for each remaining gene/locus |$g \in G$|.
To generate the |$B$| samples, we use the Cholesky decomposition of |$\boldsymbol{\Sigma }_g = \textbf{L} \textbf{L}^{\textrm{T}}$|, where |$\textbf{L}$| is a lower triangular |$n_g \times n_g$| matrix. A single sample is generated by multiplying |$\textbf{L}$| with a vector of |$n_g$| independent, standard Gaussian random variables. Across |$B$| samples, the combined computational cost is |$O(n_g^3 + Bn_g^2)$|. Using an efficient implementation of these steps [11], for a gene/locus with |$n_g \approx $| 1000 SNPs, we can generate one million draws in ¡10 min with a 3.6 GHz Intel processor. Parallelization can be used to further increase speed.
Overview of GWAS data and eQTL sources
Our investigation focuses on reported GWAS |$z$| statistics, |$\{z_i, i = 1, \dots , n\}$| measuring SNP-level association with ASD [2], SCZ [12] and EA [13]. For GWAS results of one phenotype, we explore SNP-level association statistics from the other two GWAS as potential sources of metadata due to previous evidence of their genetic correlation [18]. We consider |$n = 5 238 256$| SNPs whose alleles could be aligned across all three phenotypes and with minor allele frequency (MAF) |$> 0.05$| based on the 1000 Genomes EUR sample reference data [19]. Also, for these SNPs, their hg19 variant locations could be converted to GRCh38 using the LiftOver utility from the UCSC Genome Browser (http://genome.ucsc.edu/). Probably due to a smaller sample size, ASD has lower power: 18 381 cases and 27 969 controls, in comparison to SCZ with 33 426 cases and 32 541 controls, and EA with |$\approx $| 1.1 million subjects (Supplementary Figure S1). Because we focus on detecting associations for ASD, a neurodevelopmental disorder, we leverage two different sources of cortical tissue to identify eSNPs for functional SNP-to-gene assignment. The first source of eSNPs was obtained from the BrainVar study of dorsolateral prefrontal cortex from 176 individuals sampled across a developmental span [20]. We identified 151 491 cis-eQTL SNP-gene pairs meeting BH |$\alpha \leq 0.05$| for at least one of the three sample sets: prenatal (112 individuals), postnatal (60 individuals), as well as across the complete study. This corresponds to 123 664 eSNPs associated with 6660 genes, with |$85\%$| of the eSNPs associated with one unique cis-eQTL gene pairing.
The second source is adult cortical tissue cis-eQTLs from the Genotype-Tissue Expression (GTEx) V7 project dataset [21]. Instead of using eQTLs as reported by GTEx, to be consistent with the BrainVar eQTL definition, we identified 414 405 cis-eQTL SNP-gene pairs meeting BH |$\alpha \leq 0.05$| for either Frontal Cortex BA9 or Anterior cingulate cortex BA24 samples based on the tissue specific files for all SNP-gene associations available at gtexportal.org. This resulted in 313 316 GTEx eSNPs associated with 9012 genes, where |$78\%$| of the eSNPs are associated with one gene. However, we observe an overlap of 55,313 cis-eQTL SNP-gene pairs with BrainVar, culminating in 510 583 unique cis-eQTL SNP-gene pairs with 370 749 eSNPs associated with 12 854 genes across the union of BrainVar and GTEx sources.
GENCODE version
We use GENCODE v21 [22] for our list of genes with their respective start and end positions based on genome assembly version GRCh38. This matches the version used in the BrainVar study, but differs from GTEx, which is based on v19. When identifying GTEx eQTLs, we removed 187 genes from GENCODE v19 that do not match Ensembl IDs in v21. This provides us with an initial list of |$G = 57 005$| candidate genes to potentially assign SNPs to, based on either positional or functional eSNP status.
Metadata
For each gene/locus |$g$|, we create a vector of metadata |$x_g$| collected independently of |$p_g$|. This process is completed in the same manner for both the Positional and Positional + eSNPs collections. First, we consider the number of SNPs assigned to a gene/locus, |$n_g = |S_g|$|, which can be viewed as statistical information relevant to the power of the quadratic test statistic. Additionally, we include one-sided |$z$| statistics, i.e., |$z_g = \Phi ^{-1}(1 - p_g)$|, constructed using the gene/locus-level |$P$|-values from independent GWAS results. For our target phenotype ASD, we use SCZ and EA GWAS |$z$| statistics, |$z_g^{SCZ}$| and |$z_g^{EA}$|, while for SCZ (EA) we use |$z_g^{EA}$| (|$z_g^{SCZ}$|) and |$z_g^{ASD}$| as metadata.
Given the set of eSNPs |$S_g^{eQTL}$| associated with single genes or genes in LD locus |$g$|, we summarize the expression level as the average absolute eQTL slope in a relevant source to obtain |$\bar{\beta }_g^{source}$| for five sources: three BrainVar developmental periods (pre-, post- and complete) and two adult GTEx cortical regions. Furthermore, we account for weighted gene co-expression network analysis (WGCNA) [17] modules by creating two sets of indicators, one set for the 20 modules reported in the BrainVar study and another for eight modules constructed using the GTEx cortical tissue samples. For simplicity, we also construct indicators denoting if gene/loci is not included in any of the modules.
We also include additional context about the gene/loci. Indicator variables determine four GENCODE biotypes: protein coding, antisense, long non-coding RNA and other. Using gnomAD v2.1.1 [23], we associate with each gene its loss-of-function observed / expected upper fraction (LOEUF) value, which indicates the gene’s tolerance to loss-of-function. Because a lower LOEUF score indicates strong selection against loss-of-function, we include the minimum LOEUF across all genes in an LD locus in our vector of metadata |$x_g$|.
AdaPT implementation
Given a collection of gene/locus-level |$P$|-values and metadata, |$(p_g, x_g)_{g \in G}$|, we apply AdaPT to select a subset of discoveries with FDR control at target level |$\alpha = 0.05$|. AdaPT is guaranteed finite-sample FDR control under the assumption of independent null |$P$|-values, and was demonstrated to maintain control in weak, positive correlated scenarios [7], such as the one considered here.
We incorporate metadata from the feature space |$x_g \in \mathcal{X}$| using XGBoost [24] (see Supplementary information for details). XGBoost is a popular implementation of gradient boosted trees, which constructs a flexible predictive function as a weighted sum of many simple trees, fit using a gradient descent procedure that minimizes a specified objective function. The two objective functions considered in our AdaPT context correspond to estimating the probability that a hypothesis is non-null, and the distribution of effect size for non-null hypotheses. XGBoost gives us flexibility to include many potentially useful covariates without being overly concerned about the functional form with which they enter the model or their marginal utility. However, overfitting in this context will lead to a loss of power. To find appropriate settings for the gradient boosted trees (number of trees, learning rate and maximum depth), we first ‘tune’ AdaPT’s performance with synthetic SCZ |$P$|-values that are aligned with the ASD |$P$|-values in the following manner:
Sort SCZ and ASD |$P$|-values: (|$p_{(1)}^{SCZ}, \dots , p_{(G)}^{SCZ}$|) and (|$p_{(1)}^{ASD}, \dots , p_{(G)}^{ASD}$|)
Replace SCZ with ASD |$P$|-values by matching order, i.e., replace |$p_{(1)}^{SCZ}$| with |$p_{(1)}^{ASD}$|, |$p_{(2)}^{SCZ}$| with |$p_{(2)}^{ASD}$|, |$\dots $|
This transforms the SCZ signal to match the weaker signal in ASD. We then proceed to apply AdaPT using these synthetic SCZ |$P$|-values to find candidate settings which yield the highest number of synthetic SCZ discoveries at FDR level |$\alpha = 0.05$|. Finally, for each phenotype and positional assignment, we use two cross-validation steps within AdaPT [7] to select from among these candidate settings to generate our AdaPT: XGBoost results using the adapt_xgboost_cv() function in the adaptMTR package (available at https://github.com/ryurko/adaptMT).
Kernel smoothing localization
Results
Assigning SNPs to genes and generating LD loci
We assign SNPs to genes using the two approaches: Positional, which assigns 2 779 780 SNPs to 40 581 genes; and Positional + eSNPs, which includes an additional 109 042 intergenic cortical tissue eSNPs resulting in 2 888 822 SNPs assigned to 41 301 genes. Next, we generate genes/loci based on the LD-induced correlation of gene-based test statistics using the agglomerative algorithm with window size |$w = 6$| Mb and with one of three thresholds for |$r^2$|, 0.25, 0.50, 0.75. The number of independent genes/loci decreases substantially as the threshold becomes more strict for both SNP assignment types (Figure 2). Even a relatively high threshold of |$r^2 = 0.75$| reduces the number of Positional + eSNPs (Positional) gene/locus tests from 41,301 (40,581) to 37,522 (37,114). We report the conservative threshold |$r^2 = 0.25$| in the body of the manuscript (see Supplementary information for results with |$r^2 \in \{0.50, 0.75\}$|). Due to the conservative threshold we combine 17 915 genes to form 4136 LD loci for Positional + eSNPs and 16 625 genes to form 3985 LD loci for the Positional approaches. Over 75% of these loci contain five or fewer genes while the largest is a chromosome 11 locus that groups over sixty genes, most of which encode olfactory receptors. Combined with the 23 386 and 23 956 individual genes that were not merged, this results in 27 522 and 27 941 genes/loci for testing. The reduction of independent testing units highlights the correlation among genes that is often ignored in gene-based testing. We then compute the gene/locus quadratic test statistics and |$P$|-values for each phenotype using the Monte Carlo-based approach in Equation 3 with |$B$| equal to 2 million simulations.

Comparison of the number of genes/loci following our agglomerative algorithm over a range of values for the induced LD threshold |$r^2 \in \{ 0.25, 0.50, 0.75 \}$| by positional type (in color). The initial number of genes is provided for reference (corresponding to |$r^2 = 1$|).
AdaPT models and results
To generate AdaPT: XGBoost results, we first tune the procedure based on synthetic SCZ |$P$|-values, which mimic the distribution of ASD P-values, to find optimal XGBoost settings. To avoid over-fitting, we consider shallow trees with maximum depth |$\in \{1, 2\}$|, while searching over the number of trees |$P \in 100,\dots , 450$| by increments of fifty and the learning rate |$\eta \in 0.03, \dots , 0.06$| by increments of 0.01. To analyze the real |$P$|-values for each phenotype and thereby select associated genes/loci, then, these top setting combinations (Supplementary Table S1) were used in the AdaPT cross-validation steps and while targeting FDR control at |$\alpha = 0.05$|.
To find genes/loci associated with each phenotype, our AdaPT: XGBoost implementation fits a mixture model that requires two parameters– the probability of association and the effect of each gene/locus on the phenotype— these are estimated separately for each step of the algorithm. Variables in the metadata inform on each of these parameters to different degrees; see Supplementary information and Supplementary Table S2 for details on measuring variable importance. For the Positional + eSNPs results, the number of SNPs per gene/locus and z-statistics for at least one genetically correlated trait are important predictors for all three phenotypes (Figure 3). SCZ and EA, in contrast to ASD, display increased importance for LOEUF and membership in a WGCNA module constructed from the GTEx cortical tissue samples (Figure 3) Lower LOEUF values, which indicate lower tolerance to loss-of-function, were more likely to be associated with SCZ and EA. We observe similar patterns of variable importance for the Positional results (Supplementary Figure S2).

The most important variables (top five) predicting association with phenotype as ranked by XGBoost for the Positional + eSNPs assignment of SNPs. Variables are sorted in order of importance, while the color of the bars denote the separate parameters for the AdaPT implementation: probability of association (orange) and of non-zero effect size (blue).
Comparing the number of genes/loci selected by AdaPT to baseline results of intercept-only versions of AdaPT and BH, there is a clear gain in gene/locus discovery by accounting for metadata through the AdaPT: XGBoost implementation, regardless of phenotype and SNP-to-gene assignment approach (Figure 4A, Supplementary Table S3, see also Supplementary information, Supplementary Figures S3 and S4 for results with LD threshold of |$r^2 \in \{0.50, 0.75 \}$|). Unsurprisingly, the number of associated genes/loci is much larger for SCZ and EA than for ASD, likely due to the lower power of the original ASD GWAS. For Positional ASD, we see that the intercept-only version of AdaPT fails to select any genes/loci due to the weak signal without inclusion of metadata (Figure 4A).

(A) Comparison of the number of selected genes/loci at FDR level |$\alpha = 0.05$| for each phenotype by positional assignment. AdaPT: XGBoost results are presented in comparison to BH and AdaPT: intercept-only baselines which do not account for metadata. (B) Comparison of the proportion of implicated genes that overlap between the two types of positional assignment results, based on the AdaPT: XGBoost results for each phenotype.
Comparison of phenotypic results
For ASD, analysis of Positional + eSNPs identifies 483 genes, of which 405 cluster in 47 loci and 78 are unclustered, whereas analysis of Positional SNPs alone yields 447 genes, of which 370 cluster in 54 loci (Supplementary Table S3). A substantial portion of these genes overlap (Figure 4B). While similar patterns emerge for SCZ and EA, the ratio of unclustered to clustered genes increases with increasing number of genes/loci detected: 0.193 for ASD, 0.414 for SCZ and 0.681 for EA. This presumably reflects greater power to detect small effects of a SNP on phenotype with larger sample size: decay of this signal tends to cause it to fall below the threshold of detection for SNPs in nearby genes. In contrast, the proportion of genes uniquely identified when eSNP information is included is substantially higher for ASD than it is for SCZ or EA (Figure 4B), again likely due to lower power for the ASD sample.
As expected, for all three phenotypes, the number of unclustered genes increases with increasing threshold |$r^2$| (Supplementary Table S4). If, however, we assume that signal should be sparsely distributed across the genome, then the sum of LD loci and unclustered genes, for |$r^2=0.25$|, should be a reasonable estimate of genes associated given the current data. This translates into 125 genes for ASD, 2277 for SCZ,and 6598 for EA, and correspondingly 0.30, 5.51 and 15.98% of the total 41 301 genes analyzed (including protein-coding and non-coding genes). For SCZ and EA, the total number of associated genes per chromosome declines strongly and linearly with chromosome size (Supplementary Figures S5–S7), which itself is correlated with the total number of genes per chromosome. This pattern, while more variable, is also seen in a breakdown of genes into protein-coding and other non-coding types (Supplementary Figure S6). For ASD, however, an unusually large number of genes are associated on chromosomes 3, 8, 15 and 17, relative to chromosome size, and the associated gene counts show only a modest relationship with chromosome size (Supplementary Figures S5–S7), presumably due to lower power and resulting lower number of associated genes per chromosome.
Exploring signal in selected genes/loci
LD loci, as we define them, are expected to exhibit correlated association signal. Nonetheless, the signal is unlikely to be distributed evenly across the locus, instead in many instances it will be concentrated near one or more SNPs generating the signal, depending on the LD pattern in the locus. The same is true for signal in non-clustered genes. To interactively explore localized signal within genes/loci, we developed an LD locus zoom application using R Shiny [14, 15] and plotly [16] (available here: https://ron-yurko.shinyapps.io/ld_locus_zoom/). This tool displays the gene/locus of interest, represents genes by their location therein, and highlights the association signal by kernel smoothing for positional SNP signals, including interpolation.
3.7 Mb deletion region in chromosome 17: One of the associated loci, roughly 500 Kb, falls in the well-known 3.7 Mb 17p11.2 deletion/duplication region associated with Smith-Magenis syndrome (deletion, OMIM:182290) [26] and Potocki-Lupski syndrome (duplication, OMIM:610883) [27]. The associated LD locus displays overlapping genes and signals from eSNPS and positional SNPs for ASD, SCZ and EA (Figure 5A). For reference, the gray dotted line denotes a point-wise |$95{th}$| percentile of 1000 simulations for squared null Gaussian random variables, |$z_{g^*}^2$| where |$z_{g^*} \sim \textrm{Normal}(\boldsymbol{0}_{g^*}, \boldsymbol{\Sigma }_{g^*})$|, given the LD structure |$\boldsymbol{\Sigma }_{g^*}$| of the selected LD locus |$g^*$|. Notably, a single gene in the locus has been associated with all three phenotypes to various degrees, RAI1 [28]. Known to be dosage-sensitive, both deletion and duplication of a single copy of the gene is sufficient to elevate the likelihood of ASD substantially and to diminish cognitive function strongly [28]. Curiously, only the smoothed signal for EA association peaks over RAI1, whereas it peaks over TOM1L1 for ASD and further proximal for SCZ. The peak association for EA over RAI1 also coincides with the SNP having the lowest P-value for association with this phenotype: this index SNP rs11655029 has P-value |$2.84 \times 10^{-9}$| and falls in an intron of RAI1. While RAI1 is a prime candidate for the target gene in this locus for ASD, due to prior evidence, the peak signal located over TOM1L2 is intriguing, especially because its index SNP rs4244599, while not strongly associated (P-value = 0.0018), is an eQTL for TOM1L2. The index SNP for SCZ, rs8082590, has P-value |$6.02 \times 10^{-8}$|, close to the traditional threshold for GWAS, and coincides roughly with its smoothed signal for association over GID4.

Gene/locus zoom displays for (A) locus falling in the chromosome 17 3.7 Mb Smith-Magenis syndrome region, (B)FOXP1 in chromosome 3 and (C) locus in chromosome 17 1 Mb inversion region. (A)-(C) The genes located within the locus are represented in rectangles denoting their respective start and end positions below the smooth display, arranged by position and size to prevent overlapping. The gene/locus level kernel smoothing of ASD signal (black line) is displayed along with SCZ (red line) and EA (blue line) kernel smoothing signal (both are normalized to appear on the same signal scale as ASD). The gray dotted line denotes a point-wise |$95{th}$| percentile of 1000 simulations for null simulations, and vertical bars correspond to eSNP signals (colored by their associated genes). A subset of genes are highlighted with colors in (A) and (B), with other genes represented by gray rectangles. Additionally, individual genes are labeled in (A): RAI1 and TOM1L2, (B): FOXP1 and (C): MAPT and KANSL1, with gene direction (+ indicates left to right, - indicates right to left).
FOXP1 in chromosome 3: An unclustered protein-coding gene on chromosome 3 highlights a set of associated SNPs in FOXP1, which has previously been strongly associated with intellectual disability, language impairment and ASD [29] (Figure 5B). The smoothed association signal for ASD falls near the gene’s 3’ end whereas, for SCZ and EA, it falls close to the gene’s 5’ region. It is possible that variation associated with ASD is regulating FOXP1’s expression quite differently than that for SCZ and EA. The index SNP for EA is GWAS-significant and the one for SCZ approaches it (rs55736314 and rs4677597 with P-values |$1.63 \times 10^{-16}$| and |$3.87 \times 10^{-7}$| respectively) and both fall closer to the 5’ region of FOXP1. For ASD, its index SNP rs7616330 carries a more modest signal (P-value |$2.26 \times 10^{-4}$|). Compared to the complete gene, it falls toward the 3’ end, but it also falls quite close to the 5’ start site of certain transcripts, such as ENST00000650387.1. Given the prominent role that FOXP1 has in ASD and cognitive function, we speculate that the differential location of signal for SCZ/EA versus ASD could trace to different transcripts and regulation of their expression.
1 Mb inversion region in chromosome 17: Another locus of interest is a 1.5 Mb region of chromosome 17, namely 17q21 (Figure 5C). This region of the genome is well known in human genetics because it comprises a 1.5 Mb inversion polymorphism [30, 31] and the inversion alleles, actually haplotypes, have been associated with a wide variety of neurodegenerative disorders, including Progressive Supranuclear Palsy [32], corticobasal degeneration [33], frontotemporal dementia [34] and other tauopathies [35]. In this locus, altered MAPT is well known to affect risk for late-life neurodegenerative disorders. Moreover, the inversion itself inhibits recombination, rendering alleles at SNPs across this region in high LD. Indeed, the complexity of this region inspires the interactive features of our application (conveyed in Supplementary Figures S8 and S9) with subsets of genes and their associated eSNPs. Our results suggest variation in the region is associated with all three phenotypes. The index SNP for EA exceeds the standard threshold for single SNP significance (rs74998289, P-value |$1.31 \times 10^{-17}$|), while for ASD it approaches it (rs12942300, P-value |$2.06 \times 10^{-6}$|). Variation in the region has been previously implicated in ASD susceptibility [36]; more recently, KANSL1 expression has been implicated in cognitive function and ASD [37]. While the smoothed association signals for ASD and EA show a peak over this gene, signal is distributed across many genes in this locus and gene-based analysis is unlikely to pinpoint any gene therein. A clue to the driver or drivers of association comes by comparison of panels A–C in Figure 5. In the inversion region (Figure 5C), eSNP signals are noticeably more enriched compared to the other two loci and indeed the index SNP for EA is an eSNP for KANSL1. As highlighted by the display (Figure 5C), however, careful statistical and molecular analyses will be required to pinpoint what variation and what genes influence each of the three phenotypes at this locus.
Enrichment analysis
A primary motivation for gene-based analyses is to garner insight into the biological mechanisms underlying the phenotype by evaluating the set of genes associated with it. A standard approach infers these mechanisms by gene-set enrichment analysis, which we will implement in two ways. First, we will use the FUMA GENE2FUNC tool [38] for gene ontology (GO) enrichment analysis of the AdaPT: XGBoost genes/loci. This analysis has the advantage of searching through myriad functional sets of genes in an agnostic fashion. Still, this could also be viewed as a disadvantage if, a priori, certain biological functions are likely to affect the phenotype. For example, for all three phenotypes analyzed here, rare variation has already provided evidence linking synaptic, epigenetic and transcription factor genes to them. Moreover, for ASD there exists a substantial set of genes implicated in risk by studies of rare variation. For these reasons, we implement a second gene-set enrichment analysis, specifically GSEA [39], which is a tool for testing if different sets of genes are enriched in a ranked gene list. We perform GSEA at the gene/locus-level ranked by their one-sided |$z$| statistics, using five different sets of genes/loci to test for enrichment at the top of the phenotype-specific ranked list: (1) brain expressed, (2) synaptic, (3) epigenetic, (4) transcription factors and (5) 102 ASD risk genes identified based on de novo and case-control variation [40] (see Supplementary information and Supplementary Table S5 for details on compilation of gene lists).
Another reason to use both approaches is that they allow us to handle a confounder, gene size, in different ways. In our analyses, gene/locus size (and the number of SNPs therein) is a predictor of association: larger genes/loci are more likely to be associated (Supplementary Figure S10). So enrichment analysis should control for gene size in some way. One approach is to contrast the associated set of genes with control genes after matching on gene size. We will use this approach in the FUMA analyses (see Supplementary information for details). However, a substantial portion of brain-expressed genes are synaptic, which are known to be among the largest genes in the genome. We reasoned that if these phenotypes were influenced by variation altering function of synaptic genes—which rare variant studies suggest they are [40–43]—then matching on gene size will tend to match associated synaptic genes with other synaptic genes, thereby over-matching and lowering the power to detect synaptic association. For this reason, in the GSEA analysis, we first remove the effect of gene size on the association |$z$| statistics by regressing them on gene size, then entering the residual |$z^{\prime}$| values into the GSEA analysis.
Rather than include all genes in the associated loci for the FUMA enrichment analysis, we used the kernel smoothing results to identify signal genes (see Supplementary information). For example, this reduces the number of ASD Positional + eSNPs genes from 483 to 464 signal genes. After matching for gene size, no GO terms show enrichment for associated ASD genes. Genes associated with SCZ display GO enrichment: 698 terms in biological processes, 163 terms in cellular components and 108 in molecular function. The EA genes display enrichment for 114 terms in biological processes, 64 in cellular components and 27 in molecular function (Supplementary Table S3). Using REVIGO [44] to summarize these terms, they highlight neuron projection, synaptic function, cell adhesion, cell cycle, chromosome organization and many more for both SCZ and EA (Supplementary Figures S11 and S12).
GSEA analysis of five gene sets (brain expressed, synaptic, epigenetic, transcription factor and ASD risk) finds ASD implicated genes enriched for two gene sets, synaptic and epigenetic genes (Figure 6, Supplementary Figure S13), whereas for SCZ and EA, all five gene sets are enriched (Supplementary Table S3).

Enrichment analysis of genes/loci ranked by one-sided ASD z statistics, adjusted for size, by positional assignment. (A) BH-adjusted GSEA |$P$|-values are displayed on the -log|$_10$| scale, with FDR target level |$\alpha = 0.05$| indicated by dashed gray line. Enrichment score for (B) synaptic and (C) epigenetic genes/loci with tick marks denoting gene/locus rank based on Positional + eSNPs ASD z statistics, adjusted for size, in descending order.
Conclusion
GWAS studies have identified myriad associations between SNPs and human phenotypes. With few exceptions, these associations are weak. One of the goals in this study is to improve power to detect weak associations through gene-based tests, while also accounting for LD among SNPs in different genes, which can induce correlations among tests for different genes. Although such correlation can elevate the false positive rate and obscure the interpretation of gene-based test results, most published studies have ignored it. To account for LD, we introduce a workflow (Figure 1) to aggregate genes into loci if the expected dependence of their test statistics exceeds a pre-specified threshold. This approach produces a notable reduction in the number of genes tested: nearly 4000 are aggregated even for a tolerant threshold for correlation of test statistics (aggregate if |$r^2>0.75$|; Figure 2).
We recommend practitioners use a stringent LD |$r^2$| threshold, such as the one we use to analyze the data reported here (|$r^2$| = 0.25). This ensures test statistics for genes/loci are largely independent, it will be appropriately conservative regarding reporting of the number of discoveries, and, for loci, it will highlight the complexity of causal inference therein. Practitioners will also need to set the window size to search for correlation among test statistics. For our analyses, we chose a large one, 6 Mb. This is computationally expensive, compared to a smaller window. A smaller window will cover most loci, with a few known exceptions including the chromosome 6p MHC region. One should also note that, like all gene-based test approaches that account for LD structure, our approach is sensitive to mismatch of the LD reference panel. Such a mismatch could impact our results in two ways: (1) the individual test-level statistics could have the inappropriate null distribution, (2) the induced correlations between gene-level test statistics will be incorrect, potentially leading to over/under-clustering of the genes. We believe that (1) is the larger concern, and it is a concern that is shared by any testing approach that aggregates SNP-level test statistics (e.g. MAGMA or VEGAS). Properly capturing this sensitivity to correlation in gene-level testing is a topic for future research.
With a collection of uncorrelated or only weakly correlated genes/loci, any gene-based test can be applied. To improve power, we adopt the FDR control framework for hypothesis testing and a particular implementation of it, AdaPT. AdaPT has the potential to increase power over the classical Benjamini–Hochberg (BH) procedure for FDR control by accounting for covariates, which collectively we call metadata, to inform on which genes are likely to be true positives even though their test statistics failed to cross the BH threshold for significance. Importantly, like BH, AdaPT also maintains finite-sample FDR control.
We applied AdaPT to data from three phenotypes, ASD, SCZ and EA, all of which are genetically correlated. Notably, the ASD GWAS was weakly powered compared to the well-powered GWAS for SCZ and EA. Although AdaPT increased power for every phenotype and setting, relative to BH, the largest improvement was achieved for ASD (Figure 4A, Supplementary Figures S3 and S4). We believe the most likely explanation for these contrasting results is the differential information content of the original GWAS. Even without the additional power from AdaPT, the BH-corrected GWAS for SCZ and EA identifies many genes/loci: there is little to be gained by analyses with more power. By contrast, there is far more to be gained by additional power for ASD analyses. Gene-based tests that incorporated the impact of SNPs on gene expression also increased power of gene discovery (Figure 4B, Supplementary Figure S5). Reflecting the genetic correlation, summary statistics for other phenotypes were always useful predictors for the AdaPT model (Figure 3, Supplementary Figure S2). Gene conservation also played an important role, as did the size of the gene/locus (Figure 3, Supplementary Figure S10).
That gene/locus size was a useful predictor for associated genes is of limited biological and genetic interest; however, it generates an interpretative challenge when the set of associated genes are evaluated for functional relevance, such as by GO enrichment analysis. One approach to enrichment analysis would match associated genes with control genes by size and we did this in our GO analyses. Such an approach, however, could be conservative. For example, a substantial portion of brain-expressed genes are synaptic, they tend to be large, and synaptic genes likely play a role in all three phenotypes analyzed here [40–43]. Yet, matching on gene size will tend to contrast associated synaptic genes with other synaptic genes, lowering power to detect their enrichment. For this reason, we also took a different approach to enrichment analysis. Specifically, we regressed the association statistics on size, then entered the residual value into a GSEA analysis. The end result, for ASD, is that synaptic genes were not enriched when genes were matched before GO analysis, while they were enriched in the GSEA analysis (Figure 6, Supplementary Figure S13). The same is true for a set of epigenetic genes, which include chromatin readers, remodelers and so on. Both results agree well with previous rare variant studies [40, 41]. On the other hand, for the well powered SCZ and EA studies, such enrichment emerges regardless of the way gene size is controlled (Supplementary Figures S11 and S12).
Here we demonstrated the gain in power of evaluating gene-based association statistics using AdaPT, guided by metadata, to detect genes affecting three specific phenotypes, ASD, SCZ and EA. As our results show, the greatest gain in power is achieved when the underlying study has intermediate power, neither too high nor too low. Power gains will be modest for highly powered studies and absent for very weakly powered studies. We believe that AdaPT, guided by metadata, can be applied to a wide variety of omics problems, although it will undoubtedly require some adaptation for the specific problem to be solved. The advantage of doing so is 2-fold, increased power and increased interpretability. We are especially interested in the latter, as a means of expediting our understanding of the etiology of human diseases, disorders and other phenotypes.
Although gene-based association draws attention to an important functional unit (or units for locus-based association), it does not inform on what variation drives the association. We developed an interactive visualization tool (https://ron-yurko.shinyapps.io/ld_locus_zoom/) for exploring the localization of association signal within associated genes/loci and generating hypotheses about mechanisms of action (Figure 5, Supplementary Figures S8 and S9). For example, using this tool to explore the association of FOXP1 [29] reveals that the pattern of association in the gene is different for ASD versus EA and SCZ (Figure 5B) and suggests that different patterns of gene expression could be important for these phenotypes. Another example is the 500Kb associated locus within the 17p11.2 deletion/duplication (Smith-Magenis syndrome) region (Figure 5A) [26, 27]. In this locus, all three phenotypes have been associated to varying degrees to one gene, RAI1 [28]. Curiously, however, only the association signal for EA peaks within this gene, whereas for ASD it unexpectedly maximizes over an adjacent gene. While we believe this tool can be a useful guide to researchers, this example also underscores a limitation of this approach: we cannot provide error rate guarantees at the localized level. Such an analysis will be a target of our future work in this challenging area.
We introduce an agglomerative algorithm to group genes into loci on the basis of LD.
We extend previous work to model metadata with gradient boosted trees and greatly improve our power to detect interesting genes/loci associated with ASD.
We develop an interactive locus zoom visualization tool to explore the localized signal within our selected genes/loci; it also accepts custom results uploaded by researchers.
Author Contributions Statement
R.Y., K.R., B.D. and M.G. conceived the study; R.Y. conducted the analyses; R.Y., K.R., B.D. and M.G. wrote and reviewed the manuscript.
Funding
National Institute of Mental Health (Grant R37MH057881), the Simons Foundation (Grant SFARI SF575097) and the National Science Foundation (Grant DMS 1613202).
Data Availability Statement
The data and code used in this manuscript are available at https://github.com/ryurko/Agglomerative-LD-loci-testing.
Ronald Yurko is pursuing a Ph.D. degree in Statistics & Data Science at Carnegie Mellon University. His research interests include statistical genetics, clustering and selective inference.
Kathryn Roeder is a UPMC chaired professor at Carnegie Mellon University. She is a member of the National Academy of Sciences. Her research interests include statistical genetics, multi-omics and neurodevelopment.
Bernie Devlin is a professor at University of Pittsburgh, School of Medicine. He leads the Computational Genomics Program and is a fellow of American Association for the Advancement of Science. His research interests include statistical and psychiatric genetics.
Max G'Sell is an assistant professor of Statistics in the Department of Statistics & Data Science at Carnegie Mellon University. His research focuses on questions at the intersection of statistical inference and machine learning.
References