Abstract

In genome-wide association studies (GWAS), it has become commonplace to test millions of single-nucleotide polymorphisms (SNPs) for phenotypic association. Gene-based testing can improve power to detect weak signal by reducing multiple testing and pooling signal strength. While such tests account for linkage disequilibrium (LD) structure of SNP alleles within each gene, current approaches do not capture LD of SNPs falling in different nearby genes, which can induce correlation of gene-based test statistics. We introduce an algorithm to account for this correlation. When a gene’s test statistic is independent of others, it is assessed separately; when test statistics for nearby genes are strongly correlated, their SNPs are agglomerated and tested as a locus. To provide insight into SNPs and genes driving association within loci, we develop an interactive visualization tool to explore localized signal. We demonstrate our approach in the context of weakly powered GWAS for autism spectrum disorder, which is contrasted to more highly powered GWAS for schizophrenia and educational attainment. To increase power for these analyses, especially those for autism, we use adaptive |$P$|-value thresholding, guided by high-dimensional metadata modeled with gradient boosted trees, highlighting when and how it can be most useful. Notably our workflow is based on summary statistics.

Introduction

More than 3000 human genome-wide association studies (GWAS) have examined over 1800 diseases and traits, with uneven success in discovering associations [1]. For schizophrenia, for example, 280 discoveries were recently announced, while, for genetically correlated autism spectrum disorder, a handful of loci have been discovered [2]. The difference largely is due to statistical power. To increase power, one might decrease the number of hypotheses tested and thus reduce the threshold for significance. A natural strategy is gene-based testing: SNPs are assigned to genes they occur in or nearby [3]; within this unit, test statistics for SNPs are aggregated; and, finally, significance is judged by the number of genes tested. By focusing tests on genes instead of SNPs dispersed throughout the genome, gene-based testing also has interpretability as an appealing feature. Power can also be enhanced by choosing false discovery rate (FDR) control for significance testing. These two options, gene-based testing and FDR control, are not mutually exclusive. H-MAGMA [4] combines them and also incorporates Hi-C data into its testing scheme. Likewise, when SNPs affect gene expression, these functional SNP-to-gene assignments can be modeled [5].

Schematic of our workflow to improve power with metadata and highlight interesting signals while adjusting for LD structure.
Figure 1

Schematic of our workflow to improve power with metadata and highlight interesting signals while adjusting for LD structure.

A related approach is to increase power by incorporating metadata about SNPs or genes in the targeting of multiple testing procedures; selective inference provides approaches to incorporating this information while maintaining valid FDR control. An early approach incorporated metadata directly through the use of P-value weights [6]. More recently, in the setting of SNP-based GWAS, we [7] implemented a data-driven approach to determine weights via the adaptive P-value thresholding (AdaPT) framework [8]. In brief [7], we improved power for detecting a subset of weakly correlated SNPs by using gradient boosted trees to model potentially informative metadata, such as known effects of SNPs on gene expression and on genetically correlated traits.

Here we explore the use of AdaPT in the context of gene-based tests for autism (ASD), schizophrenia (SCZ) and educational attainment (EA), placing special emphasis on how it enhances power to detect associations of genes with ASD. To do so, we utilize gene-based testing methods introduced to account for linkage disequilibrium (LD) among SNPs in a gene [9–11]. LD is not limited by gene boundaries, however. To the contrary, LD among SNPs falling in different genes is commonplace. This compromises the interpretability of current gene-based tests, obscuring the meaning of error guarantees with family-wise error rate and FDR controlling procedures. Because of the extensive and heterogeneous LD in the genome, in our prior SNP-based GWAS using AdaPT, we purposely selected quasi-independent SNPs for analysis. By contrast, for gene-based testing, we introduce an agglomerative algorithm to account for LD-induced correlation of test statistics. This algorithm directly groups genes into ‘loci’ for which between-loci test statistic correlation—based on LD—is bounded above. This reduces the set of tests to a collection of weakly correlated genes and loci. Importantly, this agglomerative algorithm can be used in any gene-based testing framework to highlight gene-based tests that are dependent.

We analyze results from three GWAS: ASD[2], SCZ[12] and EA[13]. Using AdaPT guided by loci metadata as our example, we are able to improve power to select ASD-associated genes and LD-defined loci with multiple genes while maintaining finite-sample FDR control. Improvements are more modest for the other phenotypes, due to the high power of their original GWAS. One novel feature of our analyses is that it groups genes into loci when their test statistics are expected to be highly dependent, due to LD. We complement this feature with graphical tools to examine the distribution of association signal within each such locus. The interactive visualization tool we develop uses R Shiny [14, 15] and plotly [16] for exploring and highlighting biological signals therein.

Our workflow for improving power to select associated genes and LD-defined loci follows (Figure 1): We first introduce the agglomerative algorithm. Then, we demonstrate our approach for detecting associations with in the AdaPT framework. For metadata, we use eQTL data from cortical tissue samples, gene co-expression networks [17] and GWAS results for the other phenotypes, which is motivated by the observation that all three are genetically correlated[18]. Relationships of metadata to gene-based statistic are uncovered by use of gradient boosted trees. We separate our results into two categories, technical features related to gene-based testing and implementation of AdaPT in this setting, and association results for these phenotypes and their implications, with special emphasis on ASD.

Methods

SNP-to-gene assignment and correlation between gene-level tests

We consider two different approaches for assigning |$n$| SNPs to a set of genes |$G$|⁠. Following common practice, we use genomic location only for ‘Positional’ SNP assignment: SNP |$i$| is assigned to gene |$g$| if its genomic position is within gene |$g$|’s start and end positions (⁠|$S^{Pos}_g$|⁠). In some published analyses, the start and end positions are expanded slightly to include presumed regulatory regions, although we do not do so here. As an alternative approach, recognizing that some SNPs have documented effects on genes, such as eQTL effects, we use Positional SNP assignment and include in the set cis-eQTL SNP-gene pairs, which we dub eSNPs. We denote the collection of eSNPs for gene |$g$| as |$S^{eQTL}_g$| and call this ‘Positional + eSNPs’ SNP assignment.

After assigning SNPs to genes, each gene’s vector of SNP-level |$z$| statistics |$\boldsymbol{z}_g$| is modeled as multivariate normal (Gaussian) with mean 0 and LD-induced covariance |$\boldsymbol{\Sigma }_g$|⁠. Following common practice [9], we assume |$\boldsymbol{\Sigma }_g$| is known using correlations from reference genotype data, specifically 503 individuals from the 1000 Genomes EUR sample as the reference data [19]. Gene-level testing frameworks [3, 9, 10] combine SNP-level signals into gene-level test statistics |$T_g$| while accounting for |$\boldsymbol{\Sigma }_g$|⁠, the LD among SNPs in a gene.

Consider the quadratic gene-level test statistics, |$T_{g} = \boldsymbol{z_g}^T \boldsymbol{z_g}$|⁠, featured in VEGAS [9, 10] and MAGMA (v1.08). Constructed in this way, the quadratic test statistic is merely the sum of the individual |$z_g^2$|’s at every SNP in the gene. Under the null model, if the |$|S_g|$| SNPs had been independent, then the test statistic |$T_g$| would have been approximately |$\chi ^2$| distributed with |$|S_g|$| degrees of freedom, and thus have an expectation of |$|S_g|$| with variance |$2|S_g|$|⁠. Here we are concerned with the case where the individual test statistics are dependent. The expectation is unchanged by dependence, but the variance in the sum now becomes |$\sum _{i \in S_g} \textrm{Var}(z_i^2) + 2 \sum \sum _{i < j \in S_g} \textrm{Cov}(z_i^2, z_j^2)$|⁠, which gives equation
(1)
Let |$S_g$| and |$S_{g^{\prime}}$| denote the sets of SNPs for |$g$| and |$g^{\prime}$|⁠, respectively. We can compute the induced correlation between the quadratic test statistics for two sets of SNPs, |$\textrm{Cor}(T_g, T_{g^{\prime}}) = \textrm{Cov}(T_g, T_{g^{\prime}}) / \sqrt{\textrm{Var}(T_g) \cdot \textrm{Var}(T_{g^{\prime}})}$|⁠, using the induced covariance between the test statistics,
(2)

For nearby gene sets |$g$| and |$g^{\prime}$|⁠, these correlations can be quite strong. This can confound the interpretability—and even meaning—of the guarantees from multiple testing procedures. To avoid these issues, we will modify the construction of our gene sets to bound the pairwise correlation between the resulting gene sets.

Agglomerative LD loci testing

To account for substantially correlated test statistics, we introduce an agglomerative procedure to group highly correlated genes that are within |$w$| megabases (Mb) into sets of genes we refer to as LD loci or simply loci. Given an LD threshold |$r^2$|⁠, we apply the following procedure to a set of genes |$G$| within a chromosome:

  1. Compute |$\textrm{Cor}(T_{g}, T_{g^{\prime}})$| for all pairs of genes, |$g, g^{\prime} \in G$|⁠, within |$w$| Mb of each other (using Equations 2 and 1).

  2. Repeat the following until |$(\textrm{Cor}(T_{g}, T_{g^{\prime}}))^2 < r^2$| for all remaining pairs of genes/loci in |$G$|⁠:

    • Find genes/loci |$\{g_*, g_*^{\prime}\} = \underset{g,g^{\prime} \in G, g \neq g^{\prime}}{\textrm{arg max}}\ \textrm{Cor}(T_{g}, T_{g^{\prime}})$|⁠,

    • Merge |$\{g_*, g_*^{\prime}\}$| into locus |$g_{LD}$|⁠,

    • Update |$G = G \setminus \{g_*, g_*^{\prime}\} \cup \{g_{LD}\}$|⁠, and compute |$\textrm{Cor}(T_{g_{LD}}, T_{g^{\prime}})$| for all |$g^{\prime} \in G$| within |$w$| Mb of |$g_{LD}$|⁠.

This is essentially agglomerative hierarchical clustering, but with a linkage determined by the LD-based correlation structure of the test statistics. We compute the quadratic test statistic, |$T_g$|⁠, for each remaining gene/locus |$g \in G$|⁠.

Because the resulting distribution |$T_g$| does not have a known closed-form solution, we use a Monte-Carlo based approach for computing the |$P$|-value |$p_g$| for the gene/locus to test the null hypothesis that its |$n_g$|-dimensional vector of SNP-level |$z$| statistics are not associated with trait status. We generate |$B$| draws of null, |$n_g$|-dimensional Gaussian random variables, |$\boldsymbol{z}_g^* \sim \textrm{Normal}(\boldsymbol{0}_g, \boldsymbol{\Sigma }_g)$|⁠. A quadratic test statistic |$T^*_b$| is calculated for each of the |$b \in [B]$| draws, resulting in an empirical |$P$|-value:
(3)

To generate the |$B$| samples, we use the Cholesky decomposition of |$\boldsymbol{\Sigma }_g = \textbf{L} \textbf{L}^{\textrm{T}}$|⁠, where |$\textbf{L}$| is a lower triangular |$n_g \times n_g$| matrix. A single sample is generated by multiplying |$\textbf{L}$| with a vector of |$n_g$| independent, standard Gaussian random variables. Across |$B$| samples, the combined computational cost is |$O(n_g^3 + Bn_g^2)$|⁠. Using an efficient implementation of these steps [11], for a gene/locus with |$n_g \approx $| 1000 SNPs, we can generate one million draws in ¡10 min with a 3.6 GHz Intel processor. Parallelization can be used to further increase speed.

Overview of GWAS data and eQTL sources

Our investigation focuses on reported GWAS |$z$| statistics, |$\{z_i, i = 1, \dots , n\}$| measuring SNP-level association with ASD [2], SCZ [12] and EA [13]. For GWAS results of one phenotype, we explore SNP-level association statistics from the other two GWAS as potential sources of metadata due to previous evidence of their genetic correlation [18]. We consider |$n = 5 238 256$| SNPs whose alleles could be aligned across all three phenotypes and with minor allele frequency (MAF) |$> 0.05$| based on the 1000 Genomes EUR sample reference data [19]. Also, for these SNPs, their hg19 variant locations could be converted to GRCh38 using the LiftOver utility from the UCSC Genome Browser (http://genome.ucsc.edu/). Probably due to a smaller sample size, ASD has lower power: 18 381 cases and 27 969 controls, in comparison to SCZ with 33 426 cases and 32 541 controls, and EA with |$\approx $| 1.1 million subjects (Supplementary Figure S1). Because we focus on detecting associations for ASD, a neurodevelopmental disorder, we leverage two different sources of cortical tissue to identify eSNPs for functional SNP-to-gene assignment. The first source of eSNPs was obtained from the BrainVar study of dorsolateral prefrontal cortex from 176 individuals sampled across a developmental span [20]. We identified 151 491 cis-eQTL SNP-gene pairs meeting BH |$\alpha \leq 0.05$| for at least one of the three sample sets: prenatal (112 individuals), postnatal (60 individuals), as well as across the complete study. This corresponds to 123 664 eSNPs associated with 6660 genes, with |$85\%$| of the eSNPs associated with one unique cis-eQTL gene pairing.

The second source is adult cortical tissue cis-eQTLs from the Genotype-Tissue Expression (GTEx) V7 project dataset [21]. Instead of using eQTLs as reported by GTEx, to be consistent with the BrainVar eQTL definition, we identified 414 405 cis-eQTL SNP-gene pairs meeting BH |$\alpha \leq 0.05$| for either Frontal Cortex BA9 or Anterior cingulate cortex BA24 samples based on the tissue specific files for all SNP-gene associations available at gtexportal.org. This resulted in 313 316 GTEx eSNPs associated with 9012 genes, where |$78\%$| of the eSNPs are associated with one gene. However, we observe an overlap of 55,313 cis-eQTL SNP-gene pairs with BrainVar, culminating in 510 583 unique cis-eQTL SNP-gene pairs with 370 749 eSNPs associated with 12 854 genes across the union of BrainVar and GTEx sources.

GENCODE version

We use GENCODE v21 [22] for our list of genes with their respective start and end positions based on genome assembly version GRCh38. This matches the version used in the BrainVar study, but differs from GTEx, which is based on v19. When identifying GTEx eQTLs, we removed 187 genes from GENCODE v19 that do not match Ensembl IDs in v21. This provides us with an initial list of |$G = 57 005$| candidate genes to potentially assign SNPs to, based on either positional or functional eSNP status.

Metadata

For each gene/locus |$g$|⁠, we create a vector of metadata |$x_g$| collected independently of |$p_g$|⁠. This process is completed in the same manner for both the Positional and Positional + eSNPs collections. First, we consider the number of SNPs assigned to a gene/locus, |$n_g = |S_g|$|⁠, which can be viewed as statistical information relevant to the power of the quadratic test statistic. Additionally, we include one-sided |$z$| statistics, i.e., |$z_g = \Phi ^{-1}(1 - p_g)$|⁠, constructed using the gene/locus-level |$P$|-values from independent GWAS results. For our target phenotype ASD, we use SCZ and EA GWAS |$z$| statistics, |$z_g^{SCZ}$| and |$z_g^{EA}$|⁠, while for SCZ (EA) we use |$z_g^{EA}$| (⁠|$z_g^{SCZ}$|⁠) and |$z_g^{ASD}$| as metadata.

Given the set of eSNPs |$S_g^{eQTL}$| associated with single genes or genes in LD locus |$g$|⁠, we summarize the expression level as the average absolute eQTL slope in a relevant source to obtain |$\bar{\beta }_g^{source}$| for five sources: three BrainVar developmental periods (pre-, post- and complete) and two adult GTEx cortical regions. Furthermore, we account for weighted gene co-expression network analysis (WGCNA) [17] modules by creating two sets of indicators, one set for the 20 modules reported in the BrainVar study and another for eight modules constructed using the GTEx cortical tissue samples. For simplicity, we also construct indicators denoting if gene/loci is not included in any of the modules.

We also include additional context about the gene/loci. Indicator variables determine four GENCODE biotypes: protein coding, antisense, long non-coding RNA and other. Using gnomAD v2.1.1 [23], we associate with each gene its loss-of-function observed / expected upper fraction (LOEUF) value, which indicates the gene’s tolerance to loss-of-function. Because a lower LOEUF score indicates strong selection against loss-of-function, we include the minimum LOEUF across all genes in an LD locus in our vector of metadata |$x_g$|⁠.

AdaPT implementation

Given a collection of gene/locus-level |$P$|-values and metadata, |$(p_g, x_g)_{g \in G}$|⁠, we apply AdaPT to select a subset of discoveries with FDR control at target level |$\alpha = 0.05$|⁠. AdaPT is guaranteed finite-sample FDR control under the assumption of independent null |$P$|-values, and was demonstrated to maintain control in weak, positive correlated scenarios [7], such as the one considered here.

We incorporate metadata from the feature space |$x_g \in \mathcal{X}$| using XGBoost [24] (see Supplementary information for details). XGBoost is a popular implementation of gradient boosted trees, which constructs a flexible predictive function as a weighted sum of many simple trees, fit using a gradient descent procedure that minimizes a specified objective function. The two objective functions considered in our AdaPT context correspond to estimating the probability that a hypothesis is non-null, and the distribution of effect size for non-null hypotheses. XGBoost gives us flexibility to include many potentially useful covariates without being overly concerned about the functional form with which they enter the model or their marginal utility. However, overfitting in this context will lead to a loss of power. To find appropriate settings for the gradient boosted trees (number of trees, learning rate and maximum depth), we first ‘tune’ AdaPT’s performance with synthetic SCZ |$P$|-values that are aligned with the ASD |$P$|-values in the following manner:

  1. Sort SCZ and ASD |$P$|-values: (⁠|$p_{(1)}^{SCZ}, \dots , p_{(G)}^{SCZ}$|⁠) and (⁠|$p_{(1)}^{ASD}, \dots , p_{(G)}^{ASD}$|⁠)

  2. Replace SCZ with ASD |$P$|-values by matching order, i.e., replace |$p_{(1)}^{SCZ}$| with |$p_{(1)}^{ASD}$|⁠, |$p_{(2)}^{SCZ}$| with |$p_{(2)}^{ASD}$|⁠, |$\dots $|

This transforms the SCZ signal to match the weaker signal in ASD. We then proceed to apply AdaPT using these synthetic SCZ |$P$|-values to find candidate settings which yield the highest number of synthetic SCZ discoveries at FDR level |$\alpha = 0.05$|⁠. Finally, for each phenotype and positional assignment, we use two cross-validation steps within AdaPT [7] to select from among these candidate settings to generate our AdaPT: XGBoost results using the adapt_xgboost_cv() function in the adaptMTR package (available at https://github.com/ryurko/adaptMT).

Kernel smoothing localization

Following the selection of interesting genes/loci, researchers may be interested in ‘zooming’ in on localized signals at the gene and/or SNP-level. For a selected gene/locus |$g^*$| and its corresponding set of SNPs |$S_{g^*}$|⁠, we smooth over the squared |$z$| statistics of the locus’ positional SNPs |$S_{g^*}^{Pos} \subseteq S_{g^*}$|⁠, given their genomic positions (BP) using the Nadaraya–Watson estimator:
(4)
in which |$K_h$| is a one-dimensional Gaussian kernel with bandwidth |$h$| selected separately for each gene/locus using generalized cross-validation (as implemented in the np package [25]). We then provide the option to display any subset of eSNPs |$S_{g^*}^{eQTL} \subseteq S_{g^*}$| separately as bars with their heights indicating individual SNP-level signal. This separation is due to the presence of intergenic eSNPs, however, any eSNPs that are also positionally assigned to genes, |$S_{g^*}^{Pos} \cap S_{g^*}^{eQTL}$|⁠, are included in the positional smoothing in Equation 4.

Results

Assigning SNPs to genes and generating LD loci

We assign SNPs to genes using the two approaches: Positional, which assigns 2 779 780 SNPs to 40 581 genes; and Positional + eSNPs, which includes an additional 109 042 intergenic cortical tissue eSNPs resulting in 2 888 822 SNPs assigned to 41 301 genes. Next, we generate genes/loci based on the LD-induced correlation of gene-based test statistics using the agglomerative algorithm with window size |$w = 6$| Mb and with one of three thresholds for |$r^2$|⁠, 0.25, 0.50, 0.75. The number of independent genes/loci decreases substantially as the threshold becomes more strict for both SNP assignment types (Figure 2). Even a relatively high threshold of |$r^2 = 0.75$| reduces the number of Positional + eSNPs (Positional) gene/locus tests from 41,301 (40,581) to 37,522 (37,114). We report the conservative threshold |$r^2 = 0.25$| in the body of the manuscript (see Supplementary information for results with |$r^2 \in \{0.50, 0.75\}$|⁠). Due to the conservative threshold we combine 17 915 genes to form 4136 LD loci for Positional + eSNPs and 16 625 genes to form 3985 LD loci for the Positional approaches. Over 75% of these loci contain five or fewer genes while the largest is a chromosome 11 locus that groups over sixty genes, most of which encode olfactory receptors. Combined with the 23 386 and 23 956 individual genes that were not merged, this results in 27 522 and 27 941 genes/loci for testing. The reduction of independent testing units highlights the correlation among genes that is often ignored in gene-based testing. We then compute the gene/locus quadratic test statistics and |$P$|-values for each phenotype using the Monte Carlo-based approach in Equation 3 with |$B$| equal to 2 million simulations.

Comparison of the number of genes/loci following our agglomerative algorithm over a range of values for the induced LD threshold $r^2 \in \{ 0.25, 0.50, 0.75 \}$ by positional type (in color). The initial number of genes is provided for reference (corresponding to $r^2 = 1$).
Figure 2

Comparison of the number of genes/loci following our agglomerative algorithm over a range of values for the induced LD threshold |$r^2 \in \{ 0.25, 0.50, 0.75 \}$| by positional type (in color). The initial number of genes is provided for reference (corresponding to |$r^2 = 1$|⁠).

AdaPT models and results

To generate AdaPT: XGBoost results, we first tune the procedure based on synthetic SCZ |$P$|-values, which mimic the distribution of ASD P-values, to find optimal XGBoost settings. To avoid over-fitting, we consider shallow trees with maximum depth |$\in \{1, 2\}$|⁠, while searching over the number of trees |$P \in 100,\dots , 450$| by increments of fifty and the learning rate |$\eta \in 0.03, \dots , 0.06$| by increments of 0.01. To analyze the real |$P$|-values for each phenotype and thereby select associated genes/loci, then, these top setting combinations (Supplementary Table S1) were used in the AdaPT cross-validation steps and while targeting FDR control at |$\alpha = 0.05$|⁠.

To find genes/loci associated with each phenotype, our AdaPT: XGBoost implementation fits a mixture model that requires two parameters– the probability of association and the effect of each gene/locus on the phenotype— these are estimated separately for each step of the algorithm. Variables in the metadata inform on each of these parameters to different degrees; see Supplementary information and Supplementary Table S2 for details on measuring variable importance. For the Positional + eSNPs results, the number of SNPs per gene/locus and z-statistics for at least one genetically correlated trait are important predictors for all three phenotypes (Figure 3). SCZ and EA, in contrast to ASD, display increased importance for LOEUF and membership in a WGCNA module constructed from the GTEx cortical tissue samples (Figure 3) Lower LOEUF values, which indicate lower tolerance to loss-of-function, were more likely to be associated with SCZ and EA. We observe similar patterns of variable importance for the Positional results (Supplementary Figure S2).

The most important variables (top five) predicting association with phenotype as ranked by XGBoost for the Positional + eSNPs assignment of SNPs. Variables are sorted in order of importance, while the color of the bars denote the separate parameters for the AdaPT implementation: probability of association (orange) and of non-zero effect size (blue).
Figure 3

The most important variables (top five) predicting association with phenotype as ranked by XGBoost for the Positional + eSNPs assignment of SNPs. Variables are sorted in order of importance, while the color of the bars denote the separate parameters for the AdaPT implementation: probability of association (orange) and of non-zero effect size (blue).

Comparing the number of genes/loci selected by AdaPT to baseline results of intercept-only versions of AdaPT and BH, there is a clear gain in gene/locus discovery by accounting for metadata through the AdaPT: XGBoost implementation, regardless of phenotype and SNP-to-gene assignment approach (Figure 4A, Supplementary Table S3, see also Supplementary information, Supplementary Figures S3 and S4 for results with LD threshold of |$r^2 \in \{0.50, 0.75 \}$|⁠). Unsurprisingly, the number of associated genes/loci is much larger for SCZ and EA than for ASD, likely due to the lower power of the original ASD GWAS. For Positional ASD, we see that the intercept-only version of AdaPT fails to select any genes/loci due to the weak signal without inclusion of metadata (Figure 4A).

(A) Comparison of the number of selected genes/loci at FDR level $\alpha = 0.05$ for each phenotype by positional assignment. AdaPT: XGBoost results are presented in comparison to BH and AdaPT: intercept-only baselines which do not account for metadata. (B) Comparison of the proportion of implicated genes that overlap between the two types of positional assignment results, based on the AdaPT: XGBoost results for each phenotype.
Figure 4

(A) Comparison of the number of selected genes/loci at FDR level |$\alpha = 0.05$| for each phenotype by positional assignment. AdaPT: XGBoost results are presented in comparison to BH and AdaPT: intercept-only baselines which do not account for metadata. (B) Comparison of the proportion of implicated genes that overlap between the two types of positional assignment results, based on the AdaPT: XGBoost results for each phenotype.

Comparison of phenotypic results

For ASD, analysis of Positional + eSNPs identifies 483 genes, of which 405 cluster in 47 loci and 78 are unclustered, whereas analysis of Positional SNPs alone yields 447 genes, of which 370 cluster in 54 loci (Supplementary Table S3). A substantial portion of these genes overlap (Figure 4B). While similar patterns emerge for SCZ and EA, the ratio of unclustered to clustered genes increases with increasing number of genes/loci detected: 0.193 for ASD, 0.414 for SCZ and 0.681 for EA. This presumably reflects greater power to detect small effects of a SNP on phenotype with larger sample size: decay of this signal tends to cause it to fall below the threshold of detection for SNPs in nearby genes. In contrast, the proportion of genes uniquely identified when eSNP information is included is substantially higher for ASD than it is for SCZ or EA (Figure 4B), again likely due to lower power for the ASD sample.

As expected, for all three phenotypes, the number of unclustered genes increases with increasing threshold |$r^2$| (Supplementary Table S4). If, however, we assume that signal should be sparsely distributed across the genome, then the sum of LD loci and unclustered genes, for |$r^2=0.25$|⁠, should be a reasonable estimate of genes associated given the current data. This translates into 125 genes for ASD, 2277 for SCZ,and 6598 for EA, and correspondingly 0.30, 5.51 and 15.98% of the total 41 301 genes analyzed (including protein-coding and non-coding genes). For SCZ and EA, the total number of associated genes per chromosome declines strongly and linearly with chromosome size (Supplementary Figures S5–S7), which itself is correlated with the total number of genes per chromosome. This pattern, while more variable, is also seen in a breakdown of genes into protein-coding and other non-coding types (Supplementary Figure S6). For ASD, however, an unusually large number of genes are associated on chromosomes 3, 8, 15 and 17, relative to chromosome size, and the associated gene counts show only a modest relationship with chromosome size (Supplementary Figures S5–S7), presumably due to lower power and resulting lower number of associated genes per chromosome.

Exploring signal in selected genes/loci

LD loci, as we define them, are expected to exhibit correlated association signal. Nonetheless, the signal is unlikely to be distributed evenly across the locus, instead in many instances it will be concentrated near one or more SNPs generating the signal, depending on the LD pattern in the locus. The same is true for signal in non-clustered genes. To interactively explore localized signal within genes/loci, we developed an LD locus zoom application using R Shiny [14, 15] and plotly [16] (available here: https://ron-yurko.shinyapps.io/ld_locus_zoom/). This tool displays the gene/locus of interest, represents genes by their location therein, and highlights the association signal by kernel smoothing for positional SNP signals, including interpolation.

3.7 Mb deletion region in chromosome 17: One of the associated loci, roughly 500 Kb, falls in the well-known 3.7 Mb 17p11.2 deletion/duplication region associated with Smith-Magenis syndrome (deletion, OMIM:182290) [26] and Potocki-Lupski syndrome (duplication, OMIM:610883) [27]. The associated LD locus displays overlapping genes and signals from eSNPS and positional SNPs for ASD, SCZ and EA (Figure 5A). For reference, the gray dotted line denotes a point-wise |$95{th}$| percentile of 1000 simulations for squared null Gaussian random variables, |$z_{g^*}^2$| where |$z_{g^*} \sim \textrm{Normal}(\boldsymbol{0}_{g^*}, \boldsymbol{\Sigma }_{g^*})$|⁠, given the LD structure |$\boldsymbol{\Sigma }_{g^*}$| of the selected LD locus |$g^*$|⁠. Notably, a single gene in the locus has been associated with all three phenotypes to various degrees, RAI1 [28]. Known to be dosage-sensitive, both deletion and duplication of a single copy of the gene is sufficient to elevate the likelihood of ASD substantially and to diminish cognitive function strongly [28]. Curiously, only the smoothed signal for EA association peaks over RAI1, whereas it peaks over TOM1L1 for ASD and further proximal for SCZ. The peak association for EA over RAI1 also coincides with the SNP having the lowest P-value for association with this phenotype: this index SNP rs11655029 has P-value |$2.84 \times 10^{-9}$| and falls in an intron of RAI1. While RAI1 is a prime candidate for the target gene in this locus for ASD, due to prior evidence, the peak signal located over TOM1L2 is intriguing, especially because its index SNP rs4244599, while not strongly associated (P-value = 0.0018), is an eQTL for TOM1L2. The index SNP for SCZ, rs8082590, has P-value |$6.02 \times 10^{-8}$|⁠, close to the traditional threshold for GWAS, and coincides roughly with its smoothed signal for association over GID4.

Gene/locus zoom displays for (A) locus falling in the chromosome 17 3.7 Mb Smith-Magenis syndrome region, (B)FOXP1 in chromosome 3 and (C) locus in chromosome 17 1 Mb inversion region. (A)-(C) The genes located within the locus are represented in rectangles denoting their respective start and end positions below the smooth display, arranged by position and size to prevent overlapping. The gene/locus level kernel smoothing of ASD signal (black line) is displayed along with SCZ (red line) and EA (blue line) kernel smoothing signal (both are normalized to appear on the same signal scale as ASD). The gray dotted line denotes a point-wise $95{th}$ percentile of 1000 simulations for null simulations, and vertical bars correspond to eSNP signals (colored by their associated genes). A subset of genes are highlighted with colors in (A) and (B), with other genes represented by gray rectangles. Additionally, individual genes are labeled in (A): RAI1 and TOM1L2, (B): FOXP1 and (C): MAPT and KANSL1, with gene direction (+ indicates left to right, - indicates right to left).
Figure 5

Gene/locus zoom displays for (A) locus falling in the chromosome 17 3.7 Mb Smith-Magenis syndrome region, (B)FOXP1 in chromosome 3 and (C) locus in chromosome 17 1 Mb inversion region. (A)-(C) The genes located within the locus are represented in rectangles denoting their respective start and end positions below the smooth display, arranged by position and size to prevent overlapping. The gene/locus level kernel smoothing of ASD signal (black line) is displayed along with SCZ (red line) and EA (blue line) kernel smoothing signal (both are normalized to appear on the same signal scale as ASD). The gray dotted line denotes a point-wise |$95{th}$| percentile of 1000 simulations for null simulations, and vertical bars correspond to eSNP signals (colored by their associated genes). A subset of genes are highlighted with colors in (A) and (B), with other genes represented by gray rectangles. Additionally, individual genes are labeled in (A): RAI1 and TOM1L2, (B): FOXP1 and (C): MAPT and KANSL1, with gene direction (+ indicates left to right, - indicates right to left).

FOXP1 in chromosome 3: An unclustered protein-coding gene on chromosome 3 highlights a set of associated SNPs in FOXP1, which has previously been strongly associated with intellectual disability, language impairment and ASD [29] (Figure 5B). The smoothed association signal for ASD falls near the gene’s 3’ end whereas, for SCZ and EA, it falls close to the gene’s 5’ region. It is possible that variation associated with ASD is regulating FOXP1’s expression quite differently than that for SCZ and EA. The index SNP for EA is GWAS-significant and the one for SCZ approaches it (rs55736314 and rs4677597 with P-values |$1.63 \times 10^{-16}$| and |$3.87 \times 10^{-7}$| respectively) and both fall closer to the 5’ region of FOXP1. For ASD, its index SNP rs7616330 carries a more modest signal (P-value |$2.26 \times 10^{-4}$|⁠). Compared to the complete gene, it falls toward the 3’ end, but it also falls quite close to the 5’ start site of certain transcripts, such as ENST00000650387.1. Given the prominent role that FOXP1 has in ASD and cognitive function, we speculate that the differential location of signal for SCZ/EA versus ASD could trace to different transcripts and regulation of their expression.

1 Mb inversion region in chromosome 17: Another locus of interest is a 1.5 Mb region of chromosome 17, namely 17q21 (Figure 5C). This region of the genome is well known in human genetics because it comprises a 1.5 Mb inversion polymorphism [30, 31] and the inversion alleles, actually haplotypes, have been associated with a wide variety of neurodegenerative disorders, including Progressive Supranuclear Palsy [32], corticobasal degeneration [33], frontotemporal dementia [34] and other tauopathies [35]. In this locus, altered MAPT is well known to affect risk for late-life neurodegenerative disorders. Moreover, the inversion itself inhibits recombination, rendering alleles at SNPs across this region in high LD. Indeed, the complexity of this region inspires the interactive features of our application (conveyed in Supplementary Figures S8 and S9) with subsets of genes and their associated eSNPs. Our results suggest variation in the region is associated with all three phenotypes. The index SNP for EA exceeds the standard threshold for single SNP significance (rs74998289, P-value |$1.31 \times 10^{-17}$|⁠), while for ASD it approaches it (rs12942300, P-value |$2.06 \times 10^{-6}$|⁠). Variation in the region has been previously implicated in ASD susceptibility [36]; more recently, KANSL1 expression has been implicated in cognitive function and ASD [37]. While the smoothed association signals for ASD and EA show a peak over this gene, signal is distributed across many genes in this locus and gene-based analysis is unlikely to pinpoint any gene therein. A clue to the driver or drivers of association comes by comparison of panels A–C in Figure 5. In the inversion region (Figure 5C), eSNP signals are noticeably more enriched compared to the other two loci and indeed the index SNP for EA is an eSNP for KANSL1. As highlighted by the display (Figure 5C), however, careful statistical and molecular analyses will be required to pinpoint what variation and what genes influence each of the three phenotypes at this locus.

Enrichment analysis

A primary motivation for gene-based analyses is to garner insight into the biological mechanisms underlying the phenotype by evaluating the set of genes associated with it. A standard approach infers these mechanisms by gene-set enrichment analysis, which we will implement in two ways. First, we will use the FUMA GENE2FUNC tool [38] for gene ontology (GO) enrichment analysis of the AdaPT: XGBoost genes/loci. This analysis has the advantage of searching through myriad functional sets of genes in an agnostic fashion. Still, this could also be viewed as a disadvantage if, a priori, certain biological functions are likely to affect the phenotype. For example, for all three phenotypes analyzed here, rare variation has already provided evidence linking synaptic, epigenetic and transcription factor genes to them. Moreover, for ASD there exists a substantial set of genes implicated in risk by studies of rare variation. For these reasons, we implement a second gene-set enrichment analysis, specifically GSEA [39], which is a tool for testing if different sets of genes are enriched in a ranked gene list. We perform GSEA at the gene/locus-level ranked by their one-sided |$z$| statistics, using five different sets of genes/loci to test for enrichment at the top of the phenotype-specific ranked list: (1) brain expressed, (2) synaptic, (3) epigenetic, (4) transcription factors and (5) 102 ASD risk genes identified based on de novo and case-control variation [40] (see Supplementary information and Supplementary Table S5 for details on compilation of gene lists).

Another reason to use both approaches is that they allow us to handle a confounder, gene size, in different ways. In our analyses, gene/locus size (and the number of SNPs therein) is a predictor of association: larger genes/loci are more likely to be associated (Supplementary Figure S10). So enrichment analysis should control for gene size in some way. One approach is to contrast the associated set of genes with control genes after matching on gene size. We will use this approach in the FUMA analyses (see Supplementary information for details). However, a substantial portion of brain-expressed genes are synaptic, which are known to be among the largest genes in the genome. We reasoned that if these phenotypes were influenced by variation altering function of synaptic genes—which rare variant studies suggest they are [40–43]—then matching on gene size will tend to match associated synaptic genes with other synaptic genes, thereby over-matching and lowering the power to detect synaptic association. For this reason, in the GSEA analysis, we first remove the effect of gene size on the association |$z$| statistics by regressing them on gene size, then entering the residual |$z^{\prime}$| values into the GSEA analysis.

Rather than include all genes in the associated loci for the FUMA enrichment analysis, we used the kernel smoothing results to identify signal genes (see Supplementary information). For example, this reduces the number of ASD Positional + eSNPs genes from 483 to 464 signal genes. After matching for gene size, no GO terms show enrichment for associated ASD genes. Genes associated with SCZ display GO enrichment: 698 terms in biological processes, 163 terms in cellular components and 108 in molecular function. The EA genes display enrichment for 114 terms in biological processes, 64 in cellular components and 27 in molecular function (Supplementary Table S3). Using REVIGO [44] to summarize these terms, they highlight neuron projection, synaptic function, cell adhesion, cell cycle, chromosome organization and many more for both SCZ and EA (Supplementary Figures S11 and S12).

GSEA analysis of five gene sets (brain expressed, synaptic, epigenetic, transcription factor and ASD risk) finds ASD implicated genes enriched for two gene sets, synaptic and epigenetic genes (Figure 6, Supplementary Figure S13), whereas for SCZ and EA, all five gene sets are enriched (Supplementary Table S3).

Enrichment analysis of genes/loci ranked by one-sided ASD z statistics, adjusted for size, by positional assignment. (A) BH-adjusted GSEA $P$-values are displayed on the -log$_10$ scale, with FDR target level $\alpha = 0.05$ indicated by dashed gray line. Enrichment score for (B) synaptic and (C) epigenetic genes/loci with tick marks denoting gene/locus rank based on Positional + eSNPs ASD z statistics, adjusted for size, in descending order.
Figure 6

Enrichment analysis of genes/loci ranked by one-sided ASD z statistics, adjusted for size, by positional assignment. (A) BH-adjusted GSEA |$P$|-values are displayed on the -log|$_10$| scale, with FDR target level |$\alpha = 0.05$| indicated by dashed gray line. Enrichment score for (B) synaptic and (C) epigenetic genes/loci with tick marks denoting gene/locus rank based on Positional + eSNPs ASD z statistics, adjusted for size, in descending order.

Conclusion

GWAS studies have identified myriad associations between SNPs and human phenotypes. With few exceptions, these associations are weak. One of the goals in this study is to improve power to detect weak associations through gene-based tests, while also accounting for LD among SNPs in different genes, which can induce correlations among tests for different genes. Although such correlation can elevate the false positive rate and obscure the interpretation of gene-based test results, most published studies have ignored it. To account for LD, we introduce a workflow (Figure 1) to aggregate genes into loci if the expected dependence of their test statistics exceeds a pre-specified threshold. This approach produces a notable reduction in the number of genes tested: nearly 4000 are aggregated even for a tolerant threshold for correlation of test statistics (aggregate if |$r^2>0.75$|⁠; Figure 2).

We recommend practitioners use a stringent LD |$r^2$| threshold, such as the one we use to analyze the data reported here (⁠|$r^2$| = 0.25). This ensures test statistics for genes/loci are largely independent, it will be appropriately conservative regarding reporting of the number of discoveries, and, for loci, it will highlight the complexity of causal inference therein. Practitioners will also need to set the window size to search for correlation among test statistics. For our analyses, we chose a large one, 6 Mb. This is computationally expensive, compared to a smaller window. A smaller window will cover most loci, with a few known exceptions including the chromosome 6p MHC region. One should also note that, like all gene-based test approaches that account for LD structure, our approach is sensitive to mismatch of the LD reference panel. Such a mismatch could impact our results in two ways: (1) the individual test-level statistics could have the inappropriate null distribution, (2) the induced correlations between gene-level test statistics will be incorrect, potentially leading to over/under-clustering of the genes. We believe that (1) is the larger concern, and it is a concern that is shared by any testing approach that aggregates SNP-level test statistics (e.g. MAGMA or VEGAS). Properly capturing this sensitivity to correlation in gene-level testing is a topic for future research.

With a collection of uncorrelated or only weakly correlated genes/loci, any gene-based test can be applied. To improve power, we adopt the FDR control framework for hypothesis testing and a particular implementation of it, AdaPT. AdaPT has the potential to increase power over the classical Benjamini–Hochberg (BH) procedure for FDR control by accounting for covariates, which collectively we call metadata, to inform on which genes are likely to be true positives even though their test statistics failed to cross the BH threshold for significance. Importantly, like BH, AdaPT also maintains finite-sample FDR control.

We applied AdaPT to data from three phenotypes, ASD, SCZ and EA, all of which are genetically correlated. Notably, the ASD GWAS was weakly powered compared to the well-powered GWAS for SCZ and EA. Although AdaPT increased power for every phenotype and setting, relative to BH, the largest improvement was achieved for ASD (Figure 4A, Supplementary Figures S3 and S4). We believe the most likely explanation for these contrasting results is the differential information content of the original GWAS. Even without the additional power from AdaPT, the BH-corrected GWAS for SCZ and EA identifies many genes/loci: there is little to be gained by analyses with more power. By contrast, there is far more to be gained by additional power for ASD analyses. Gene-based tests that incorporated the impact of SNPs on gene expression also increased power of gene discovery (Figure 4B, Supplementary Figure S5). Reflecting the genetic correlation, summary statistics for other phenotypes were always useful predictors for the AdaPT model (Figure 3, Supplementary Figure S2). Gene conservation also played an important role, as did the size of the gene/locus (Figure 3, Supplementary Figure S10).

That gene/locus size was a useful predictor for associated genes is of limited biological and genetic interest; however, it generates an interpretative challenge when the set of associated genes are evaluated for functional relevance, such as by GO enrichment analysis. One approach to enrichment analysis would match associated genes with control genes by size and we did this in our GO analyses. Such an approach, however, could be conservative. For example, a substantial portion of brain-expressed genes are synaptic, they tend to be large, and synaptic genes likely play a role in all three phenotypes analyzed here [40–43]. Yet, matching on gene size will tend to contrast associated synaptic genes with other synaptic genes, lowering power to detect their enrichment. For this reason, we also took a different approach to enrichment analysis. Specifically, we regressed the association statistics on size, then entered the residual value into a GSEA analysis. The end result, for ASD, is that synaptic genes were not enriched when genes were matched before GO analysis, while they were enriched in the GSEA analysis (Figure 6, Supplementary Figure S13). The same is true for a set of epigenetic genes, which include chromatin readers, remodelers and so on. Both results agree well with previous rare variant studies [40, 41]. On the other hand, for the well powered SCZ and EA studies, such enrichment emerges regardless of the way gene size is controlled (Supplementary Figures S11 and S12).

Here we demonstrated the gain in power of evaluating gene-based association statistics using AdaPT, guided by metadata, to detect genes affecting three specific phenotypes, ASD, SCZ and EA. As our results show, the greatest gain in power is achieved when the underlying study has intermediate power, neither too high nor too low. Power gains will be modest for highly powered studies and absent for very weakly powered studies. We believe that AdaPT, guided by metadata, can be applied to a wide variety of omics problems, although it will undoubtedly require some adaptation for the specific problem to be solved. The advantage of doing so is 2-fold, increased power and increased interpretability. We are especially interested in the latter, as a means of expediting our understanding of the etiology of human diseases, disorders and other phenotypes.

Although gene-based association draws attention to an important functional unit (or units for locus-based association), it does not inform on what variation drives the association. We developed an interactive visualization tool (https://ron-yurko.shinyapps.io/ld_locus_zoom/) for exploring the localization of association signal within associated genes/loci and generating hypotheses about mechanisms of action (Figure 5, Supplementary Figures S8 and S9). For example, using this tool to explore the association of FOXP1 [29] reveals that the pattern of association in the gene is different for ASD versus EA and SCZ (Figure 5B) and suggests that different patterns of gene expression could be important for these phenotypes. Another example is the 500Kb associated locus within the 17p11.2 deletion/duplication (Smith-Magenis syndrome) region (Figure 5A) [26, 27]. In this locus, all three phenotypes have been associated to varying degrees to one gene, RAI1 [28]. Curiously, however, only the association signal for EA peaks within this gene, whereas for ASD it unexpectedly maximizes over an adjacent gene. While we believe this tool can be a useful guide to researchers, this example also underscores a limitation of this approach: we cannot provide error rate guarantees at the localized level. Such an analysis will be a target of our future work in this challenging area.

Key Points
  • We introduce an agglomerative algorithm to group genes into loci on the basis of LD.

  • We extend previous work to model metadata with gradient boosted trees and greatly improve our power to detect interesting genes/loci associated with ASD.

  • We develop an interactive locus zoom visualization tool to explore the localized signal within our selected genes/loci; it also accepts custom results uploaded by researchers.

Author Contributions Statement

R.Y., K.R., B.D. and M.G. conceived the study; R.Y. conducted the analyses; R.Y., K.R., B.D. and M.G. wrote and reviewed the manuscript.

Funding

National Institute of Mental Health (Grant R37MH057881), the Simons Foundation (Grant SFARI SF575097) and the National Science Foundation (Grant DMS 1613202).

Data Availability Statement

The data and code used in this manuscript are available at https://github.com/ryurko/Agglomerative-LD-loci-testing.

Ronald Yurko is pursuing a Ph.D. degree in Statistics & Data Science at Carnegie Mellon University. His research interests include statistical genetics, clustering and selective inference.

Kathryn Roeder is a UPMC chaired professor at Carnegie Mellon University. She is a member of the National Academy of Sciences. Her research interests include statistical genetics, multi-omics and neurodevelopment.

Bernie Devlin is a professor at University of Pittsburgh, School of Medicine. He leads the Computational Genomics Program and is a fellow of American Association for the Advancement of Science. His research interests include statistical and psychiatric genetics.

Max G'Sell is an assistant professor of Statistics in the Department of Statistics & Data Science at Carnegie Mellon University. His research focuses on questions at the intersection of statistical inference and machine learning.

References

1.

MacArthur
J
,
Bowler
E
,
Cerezo
M
, et al.
The new nhgri-ebi catalog of published genome-wide association studies (gwas catalog)
.
Nucleic Acids Res
2017
;
45
(
D1
):
D896
901
.

2.

Grove
J
,
Ripke
S
,
Als
TD
, et al.
Identification of common genetic risk variants for autism spectrum disorder
.
Nat Genet
2019
;
51
(
3
):
431
44
.

3.

de Leeuw
CA
,
Mooij
JM
,
Heskes
T
, et al.
Magma: Generalized gene-set analysis of gwas data
.
PLoS Comput Biol
2015
;
11
(
4
):
1
19
.

4.

Sey
NYA
,
Hu
B
,
Mah
W
, et al.
A computational tool (h-magma) for improved prediction of brain-disorder risk genes by incorporating brain chromatin interaction profiles
.
Nat Neurosci
2020
;
23
(
4
):
583
93
.

5.

Gerring
ZF
,
Gamazon
ER
,
Derks
EM
, et al.
A gene co-expression network-based analysis of multiple brain tissues reveals novel genes and molecular pathways underlying major depression
.
PLoS Genet
2019
;
15
(
7
):e1008245.

6.

Genovese
CR
,
Roeder
K
,
Wasserman
L
.
False discovery control with P-value weighting
.
Biometrika
2006
;
93
(
3
):
509
24
.

7.

Yurko
R
,
G’Sell
M
,
Roeder
K
, et al.
A selective inference approach for false discovery rate control using multiomics covariates yields insights into disease risk
.
Proc Natl Acad Sci
2020
.

8.

Lei
L
,
Fithian
W
.
Adapt: an interactive procedure for multiple testing with side information
.
J R Stat Soc Series B Stat Methodology
2018
;
80
(
4
):
649
79
.

9.

Liu
JZ
,
McRae
AF
,
Nyholt
DR
, et al.
A versatile gene-based test for genome-wide association studies
.
Am J Hum Genet
2010
;
87
(
1
):
139
45
.

10.

Mishra
A
,
Macgregor
S
.
Vegas2: Software for more flexible gene-based testing
.
Twin Res Hum Genet
2015
;
18
(
1
):
86
91
.

11.

Yurko
R
,
Roeder
K
,
Devlin
B
, et al.
H-magma, inheriting a shaky statistical foundation, yields excess false positives
.
Ann Hum Genet
in press
.

12.

Ruderfer
DM
,
Ripke
S
,
McQuillin
A
, et al.
Genomic dissection of bipolar disorder and schizophrenia, including 28 subphenotypes
.
Cell
2018
;
173
(
7
):
1705
15
.

13.

Lee
JJ
,
Wedow
R
,
Okbay
A
, et al.
Gene discovery and polygenic prediction from a 1.1-million-person gwas of educational attainment
.
Nat Genet
2018
;
50
(
8
):
1112
.

14.

R Core Team
.
R: A Language and Environment for Statistical Computing
.
Vienna, Austria
:
R Foundation for Statistical Computing
,
2020
.

15.

Winston
Chang
,
Joe
Cheng
,
JJ
Allaire
,
Yihui
Xie
, and
Jonathan
McPherson
.
shiny: Web Application Framework for R
, 2020.
R package version 1.4.0.2
.

16.

Sievert
C
.
Interactive Web-Based Data Visualization with R, plotly, and shiny
.
Chapman and Hall/CRC
,
2020
.

17.

Zhang
B
,
Horvath
S
.
A general framework for weighted gene co-expression network analysis a general framework for weighted gene co-expression network analysis
.
Stat Appl Genet Mol Biol
2005
;
4
(
1
).

18.

Weiner
DJ
,
Wigdor
EM
,
Ripke
S
, et al.
Polygenic transmission disequilibrium confirms that common and rare variation act additively to create risk for autism spectrum disorders
.
Nat Genet
2017
;
49
(
7
):
978
85
.

19.

1000 Genomes Project Consortium and others
.
An integrated map of genetic variation from 1,092 human genomes
.
Nature
2012
;
491
(
7422
):
56
.

20.

Werling
DM
,
Pochareddy
S
,
Choi
J
, et al.
Whole-genome and rna sequencing reveal variation and transcriptomic coordination in the developing human prefrontal cortex
.
Cell Rep
2020
;
31
(
1
):107489.

21.

GTEx Consortium and others
.
The genotype-tissue expression (gtex) pilot analysis: multitissue gene regulation in humans
.
Science
2015
;
348
(
6235
):
648
60
.

22.

Harrow
J
,
Frankish
A
,
Gonzalez
JM
, et al.
Gencode: the reference human genome annotation for the encode project
.
Genome Res
2012
;
22
(
9
):
1760
74
.

23.

Karczewski
KJ
,
Francioli
LC
,
Tiao
G
, et al.
The mutational constraint spectrum quantified from variation in 141,456 humans
.
Nature
2020
;
581
,
434
7809
,
43
.

24.

Chen
T
,
Guestrin
C
.
Xgboost: A scalable tree boosting system
. In:
Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
,
KDD
, Vol.
16
.
ACM
,
2016
,
785
94
.

25.

Hayfield
T
,
Racine
JS
.
Nonparametric econometrics: The np package
.
J Stat Softw
2008
;
27
(
5
).

26.

Seranski
P
,
Heiss
NS
,
Dhorne-Pollet
S
, et al.
Transcription mapping in a medulloblastoma breakpoint interval and smith–magenis syndrome candidate region: identification of 53 transcriptional units and new candidate genes
.
Genomics
1999
;
56
(
1
):
1
11
.

27.

Neira-Fresneda
J
,
Potocki
L
.
Neurodevelopmental disorders associated with abnormal gene dosage: Smith–magenis and potocki–lupski syndromes
.
Journal of Pediatric Genetics
2015
;
4
(
03
):
159
67
.

28.

Carmona-Mora
P
,
Walz
K
.
Retinoic acid induced 1, rai1: a dosage sensitive gene related to neurobehavioral alterations including autistic behavior
.
Curr Genomics
2010
;
11
(
8
):
607
17
.

29.

Hamdan
FF
,
Daoud
H
,
Rochefort
D
, et al.
De novo mutations in foxp1 in cases with intellectual disability, autism, and language impairment
.
The American Journal of Human Genetics
2010
;
87
(
5
):
671
8
.

30.

Stefansson
H
,
Helgason
A
,
Thorleifsson
G
, et al.
A common inversion under selection in europeans
.
Nat Genet
2005
;
37
(
2
):
129
37
.

31.

Steinberg
KM
,
Antonacci
F
,
Sudmant
PH
, et al.
Structural diversity and african origin of the 17q21. 31 inversion polymorphism
.
Nat Genet
2012
;
44
(
8
):
872
80
.

32.

Höglinger
GU
,
Melhem
NM
,
Dickson
DW
, et al.
Identification of common variants influencing risk of the tauopathy progressive supranuclear palsy
.
Nat Genet
2011
;
43
(
7
):
699
.

33.

Kouri
N
,
Ross
OA
,
Dombroski
B
, et al.
Genome-wide association study of corticobasal degeneration identifies risk variants shared with progressive supranuclear palsy
.
Nat Commun
2015
;
6
(
1
):
1
7
.

34.

Furukawa
K
,
Wang
Y
,
Yao
PJ
, et al.
Alteration in calcium channel properties is responsible for the neurotoxic action of a familial frontotemporal dementia tau mutation
.
J Neurochem
2003
;
87
(
2
):
427
36
.

35.

Silva
M
,
Haggarty
SJ
.
Tauopathies: Deciphering disease mechanisms to develop effective therapies
.
Int J Mol Sci
2020
;
21
(
23
):8948.

36.

Cantor
RM
,
Kono
N
,
Duvall
JA
, et al.
Replication of autism linkage: fine-mapping peak at 17q21
.
The American Journal of Human Genetics
2005
;
76
(
6
):
1050
6
.

37.

Arbogast
T
,
Iacono
G
,
Chevalier
C
, et al.
Mouse models of 17q21. 31 microdeletion and microduplication syndromes highlight the importance of kansl1 for cognition
.
PLoS Genet
2017
;
13
(
7
):e1006886.

38.

Watanabe
K
,
Taskesen
E
,
Van Bochoven
A
, et al.
Functional mapping and annotation of genetic associations with fuma
.
Nat Commun
2017
;
8
(
1
):
1
11
.

39.

Subramanian
A
,
Tamayo
P
,
Mootha
VK
, et al.
Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles
.
Proc Natl Acad Sci
2005
;
102
(
43
):
15545
50
.

40.

Satterstrom
FK
,
Kosmicki
JA
,
Wang
J
, et al.
Large-scale exome sequencing study implicates both developmental and functional changes in the neurobiology of autism
.
Cell
2020
;
180
(
3
):
568
84
.

41.

De Rubeis
S
,
He
X
,
Goldberg
AP
, et al.
Synaptic, transcriptional and chromatin genes disrupted in autism
.
Nature
2014
;
515
(
7526
):
209
15
.

42.

Fromer
M
,
Pocklington
AJ
,
Kavanagh
DH
, et al.
De novo mutations in schizophrenia implicate synaptic networks
.
Nature
2014
;
506
(
7487
):
179
84
.

43.

Kurki
MI
,
Saarentaus
E
,
Pietiläinen
O
, et al.
Contribution of rare and common variants to intellectual disability in a sub-isolate of northern finland
.
Nat Commun
2019
;
10
(
1
):
1
15
.

44.

Supek
F
,
Bošnjak
M
,
Škunca
N
, et al.
Revigo summarizes and visualizes long lists of gene ontology terms
.
PloS one
2011
;
6
(
7
):e21800.

This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://dbpia.nl.go.kr/journals/pages/open_access/funder_policies/chorus/standard_publication_model)

Supplementary data