-
PDF
- Split View
-
Views
-
Cite
Cite
Evans K Cheruiyot, Tingyan Yang, Allan F McRae, GWAS significance thresholds in large cohorts of European ancestry, Genetics, 2025;, iyaf056, https://doi.org/10.1093/genetics/iyaf056
- Share Icon Share
Abstract
While the P-value threshold of remains the standard for genome-wide association studies (GWAS) in humans and other species, it still needs to be updated to reflect the current era of large-scale GWAS, where tens of thousands of sample sizes are used to discover genetic associations at loci with smaller minor allele frequencies. In this study, we used a dataset of 348,501 individuals of European ancestry from the UK Biobank to determine the GWAS thresholds required for multiple testing corrections when considering rare and common variants in additive and dominant GWAS models. Additionally, we employed conditional and joint analysis to quantify the proportion of false significant hits in the GWAS results for 72 traits in the UK Biobank when applying the traditional GWAS cutoff vs our newly proposed P-value thresholds. Overall, the results indicate that the conventional GWAS significance threshold of yields a false-positive rate of between 20% and 30% in GWAS studies that utilize large sample sizes and less common variants. Instead, a more stringent GWAS P-value threshold of is needed when rare variants (with minor allele frequency > 0.1%) are included in the association test for both additive and dominance models within the European ancestry population. However, further validation across diverse datasets and study designs, is needed to evaluate the broader applicability of this proposed threshold.
Introduction
The widely accepted P-value threshold for genome-wide association studies (GWAS) of common variants in humans is (Panagiotou et al. 2012; Fadista et al. 2016; Marees et al. 2018). This threshold was developed based on smaller sample sizes and marker densities available during the early days of GWAS (International HapMap Consortium 2005; Dudbridge and Gusnanto 2008) and has been highly successful in discovering many reproducible genetic variants associated with common traits and diseases. However, it is important to note that the choice of GWAS thresholds can be context-dependent, varying based on study-specific factors such as sample size, the number of variants tested, and population structure (Fadista et al. 2016; Kanai et al. 2016; Asif et al. 2021).
Two remarkable advances in the GWAS era necessitate revisiting and updating this P-value threshold. First, the sample sizes used in GWAS have increased steadily over the years, reaching over 5 million individuals in recent studies (Yengo et al. 2022). Second, with the advances in large-scale genotyping and next-generation sequencing technologies, GWA studies have focused not only on common variants (MAF > 5%) but also on low-frequency (1% < MAF < 5%) and rare (MAF < 1%) variants, which have been shown to contribute substantially to complex trait heritability (Lee et al. 2014).
Several studies have investigated GWAS significance threshold in the context of deep phenotyping and wide allele frequency spectrum in GWAS (Xu et al. 2014; Fadista et al. 2016; Kanai et al. 2016; Asif et al. 2021; Chen et al. 2021). However, these studies have used relatively small sample sizes in their simulations to derive GWAS thresholds and are thus outdated when performing GWAS in large biobank cohorts. In addition, large sample sizes are providing power to investigate nonadditive modes of association. In particular, the traditional GWAS traditional cutoff () has also been used for dominance models (Palmer et al. 2023; Zhu et al. 2023). However, different GWAS thresholds for additive and dominance models would be expected since the extent of linkage disequilibrium (LD) tagging across variants differs between the 2 models (Zhu et al. 2023).
Various statistical approaches have been used to determine GWAS significance thresholds: the Bonferroni correction (which assumes independent tests and thus overly conservative considering that the genetic variants are in LD) (Risch and Merikangas 1996), permutation and bootstrapping (resampling) (International HapMap Consortium 2005; Kanai et al. 2016; Lin 2019), and Bayesian methods (Wellcome Trust Case Control Consortium 2007). Other methods involving calculating the “effective number of tests” based on LD pruning or eigenvalue decomposition of phenotypes and genotypes (Cheverud 2001; Duggal et al. 2008; Sobota et al. 2015) have also been proposed. However, while the permutation-based method is computationally demanding, it remains a “gold standard” in GWAS multiple testing correction because it preserves correlation structure for the data and is robust to the assumption of marker independence (Dudbridge and Gusnanto 2008; Hoggart et al. 2008; Asif et al. 2021).
Materials and methods
We simulated 2 sets of 1,000 phenotypes to reflect: (1) the ideal statistical scenario where the phenotype is normally distributed (or has undergone rank-based inverse normal transformation) and (2) a trait with skewness. A normally distributed phenotype serves as a common proxy for the assumption of normally distributed residuals in the GWAS regression model. To simulate skewness, we used body mass index (BMI) measurements from the UK Biobank participants (regardless of inclusion in the genetic data). BMI data were standardized within sex to have a mean of zero and unit variance and values with scaled BMI > 6 (∼5 interquartile range) were removed. Then, we randomly sampled the scaled BMI without replacement to generate 1,000 independent traits.
To estimate GWA significance thresholds across different minor allele frequency (MAF) scenarios, the genotype data were pruned based on 4 MAF cutoffs [>0.1%, >0.5%, >1%, and >5%]. A total of 13,035,294, 9,603,317, 8,545,459, and 6,098,223 autosomal variants remained following MAF filtering at these thresholds, respectively.
For each of the simulated phenotypes [N = 1,000 × 2 models], we ran additive and dominance GWAS analysis. We used the fastGWA method implemented in GCTA software [option –fastGWA-lr] (Yang et al. 2011) to run additive linear models, while the dominance GWAS was performed using PLINK version 2 software [–glm dominant] (Purcell et al. 2007). No covariates were included in all the GWAS models.
To identify genome-wide significance thresholds for each of the MAF cutoffs, we examined the distribution of the minimum P-values for each simulated trait, separately for additive and dominance models. The 95% percentile of the distribution was defined as the genome-wide significance threshold for each MAF filtering scenario, reflecting a 5% genome-wide significance P-value. The false-positive rate when using the traditional GWAS significance threshold for each scenario was calculated as the proportion of traits for which surpassed the genome-wide significance threshold of .
Results and discussion
We used a permutation procedure to estimate the GWAS significance thresholds using the UK Biobank (version 3) dataset (Bycroft et al. 2018). We restricted our analyses to a cohort of 348,501 unrelated individuals with European ancestry. Our results revealed that a more stringent GWAS cutoff than the commonly used threshold () is required to control false-positive rate if variants with MAF of <5% are included in the association test (Table 1). The traditional GWAS P-value cutoff of is still valid when considering common (>5% MAF) variants in the additive GWA model using large sample sizes. However, a stringent GWA P-value threshold of is needed when rare variants (MAF > 0.1%) are included in an additive GWA model (Fig. 1 and Table 1). A more stringent GWA correction is required to guard against false positives when the phenotype is not normally distributed. This was demonstrated using our BMI based trait, where the P-value cutoff was determined to be for MAF > 0.1% (Fig. 1). Using the traditional threshold of leads to substantial Type-I error, particularly with low MAF thresholds (Table 2). When rare variants (MAF > 0.1%) are included in the association tests, we observed a false-positive rate of approximately 22% for a perfectly normal trait in the additive model, increasing to 30% with some phenotype skewness.

The distribution of the minimum P-value (.) from the additive GWAS models. We extracted from a GWA results for each of the (a) 1,000 simulated normally distributed phenotypes and (b) permuted raw BMI phenotypes from UK Biobank. The genotypes used in GWA were from the European cohort in the UK Biobank (N = 348,501). The dotted line corresponds to the traditional genome-wide significance cutoff of (), while the vertical solid lines represent the new empirical GWAS cutoff after pruning genotypes based on the >5% (blue), >1% (green), >0.5% (orange), and >0.1% (gray) minor allele thresholds. The new cutoff GWAS thresholds represents 95% percentile of Pmin distribution (α = 0.05).
Estimated GWAS significance thresholds for different MAF cutoff obtained from additive and dominance models.
Model . | MAF cutoffa . | P-valueSig threshold . | Meff . |
---|---|---|---|
Additive GWA (normal) | >0.1% | 5,636,978 | |
>0.5% | 2,873,563 | ||
>1% | 1,785,714 | ||
>5% | 1,014,198 | ||
Additive GWA (raw BMI) | >0.1% | 8,417,508 | |
>0.5% | 3,378,378 | ||
>1% | 2,873,563 | ||
>5% | 1,497,005 | ||
Dominance (normal) | >0.1% | 6,720,430 | |
>0.5% | 3,246,753 | ||
>1% | 2,564,102 | ||
>5% | 1,404,494 | ||
Dominance (raw BMI) | >0.1% | 8,756,567 | |
>0.5% | 3,937,007 | ||
>1% | 3,597,122 | ||
>5% | 1,818,181 |
Model . | MAF cutoffa . | P-valueSig threshold . | Meff . |
---|---|---|---|
Additive GWA (normal) | >0.1% | 5,636,978 | |
>0.5% | 2,873,563 | ||
>1% | 1,785,714 | ||
>5% | 1,014,198 | ||
Additive GWA (raw BMI) | >0.1% | 8,417,508 | |
>0.5% | 3,378,378 | ||
>1% | 2,873,563 | ||
>5% | 1,497,005 | ||
Dominance (normal) | >0.1% | 6,720,430 | |
>0.5% | 3,246,753 | ||
>1% | 2,564,102 | ||
>5% | 1,404,494 | ||
Dominance (raw BMI) | >0.1% | 8,756,567 | |
>0.5% | 3,937,007 | ||
>1% | 3,597,122 | ||
>5% | 1,818,181 |
BMI, body mass index; GWA, genome-wide association.
aAfter pruning SNPs based on the 0.1%, 0.5, 1, and 5% MAF cutoff there were 13,035,294, 9,603,317, 8,545,459, and 6,098,223 SNPs remaining, respectively; Effective number of test (Meff) was calculated as 0.05/P-valueSig threshold.
Estimated GWAS significance thresholds for different MAF cutoff obtained from additive and dominance models.
Model . | MAF cutoffa . | P-valueSig threshold . | Meff . |
---|---|---|---|
Additive GWA (normal) | >0.1% | 5,636,978 | |
>0.5% | 2,873,563 | ||
>1% | 1,785,714 | ||
>5% | 1,014,198 | ||
Additive GWA (raw BMI) | >0.1% | 8,417,508 | |
>0.5% | 3,378,378 | ||
>1% | 2,873,563 | ||
>5% | 1,497,005 | ||
Dominance (normal) | >0.1% | 6,720,430 | |
>0.5% | 3,246,753 | ||
>1% | 2,564,102 | ||
>5% | 1,404,494 | ||
Dominance (raw BMI) | >0.1% | 8,756,567 | |
>0.5% | 3,937,007 | ||
>1% | 3,597,122 | ||
>5% | 1,818,181 |
Model . | MAF cutoffa . | P-valueSig threshold . | Meff . |
---|---|---|---|
Additive GWA (normal) | >0.1% | 5,636,978 | |
>0.5% | 2,873,563 | ||
>1% | 1,785,714 | ||
>5% | 1,014,198 | ||
Additive GWA (raw BMI) | >0.1% | 8,417,508 | |
>0.5% | 3,378,378 | ||
>1% | 2,873,563 | ||
>5% | 1,497,005 | ||
Dominance (normal) | >0.1% | 6,720,430 | |
>0.5% | 3,246,753 | ||
>1% | 2,564,102 | ||
>5% | 1,404,494 | ||
Dominance (raw BMI) | >0.1% | 8,756,567 | |
>0.5% | 3,937,007 | ||
>1% | 3,597,122 | ||
>5% | 1,818,181 |
BMI, body mass index; GWA, genome-wide association.
aAfter pruning SNPs based on the 0.1%, 0.5, 1, and 5% MAF cutoff there were 13,035,294, 9,603,317, 8,545,459, and 6,098,223 SNPs remaining, respectively; Effective number of test (Meff) was calculated as 0.05/P-valueSig threshold.
Type I error rates associated with including common (MAF > 5%) and rare variants in GWAS based on the traditional threshold of .
. | . | MAF cutoff . | |||
---|---|---|---|---|---|
Model . | Trait . | >0.1% . | >0.5% . | >1% . | >5% . |
Additive GWA | Simulated normal | 22.2 | 13.1 | 9.5 | 5.1 |
Raw BMI | 29.5 | 16.6 | 13.3 | 7.3 |
. | . | MAF cutoff . | |||
---|---|---|---|---|---|
Model . | Trait . | >0.1% . | >0.5% . | >1% . | >5% . |
Additive GWA | Simulated normal | 22.2 | 13.1 | 9.5 | 5.1 |
Raw BMI | 29.5 | 16.6 | 13.3 | 7.3 |
GWA, genome-wide association; MAF, minor allele frequency; simulated normal, normal distributed trait; Raw BMI, body mass index exhibiting skewness.
Type I error rates associated with including common (MAF > 5%) and rare variants in GWAS based on the traditional threshold of .
. | . | MAF cutoff . | |||
---|---|---|---|---|---|
Model . | Trait . | >0.1% . | >0.5% . | >1% . | >5% . |
Additive GWA | Simulated normal | 22.2 | 13.1 | 9.5 | 5.1 |
Raw BMI | 29.5 | 16.6 | 13.3 | 7.3 |
. | . | MAF cutoff . | |||
---|---|---|---|---|---|
Model . | Trait . | >0.1% . | >0.5% . | >1% . | >5% . |
Additive GWA | Simulated normal | 22.2 | 13.1 | 9.5 | 5.1 |
Raw BMI | 29.5 | 16.6 | 13.3 | 7.3 |
GWA, genome-wide association; MAF, minor allele frequency; simulated normal, normal distributed trait; Raw BMI, body mass index exhibiting skewness.
These findings are consistent with previous studies showing that a more stringent P-value cutoff is required when the MAF pruning is relaxed to accommodate rare variants in GWAS (Fadista et al. 2016). Our results are comparable with simulations using data (∼140 k) from European ancestry that obtained a GWA significance P-value of for MAF ≥ 0.1% (Kemp et al. 2017). This indicates that for MAF > 0.1%, the effect of increasing sample size is attenuated by the time sample sizes reach hundreds of thousands (Asif et al. 2021). As such, our point estimates are likely applicable even for large GWAs (Yengo et al. 2022). Taken together, we recommend a more stringent GWAS P-value cutoff of if rare variants (MAF > 0.1%) are considered in the association test for an additive model—equivalent to 10 million independent tests at 5% alpha (Table 1). This conservative threshold is motivated by our use of skewed BMI data in the simulations, as other traits may exhibit even greater skewness, resulting in higher false-positive rates (Supplementary Tables S1 and S2). While this threshold will provide stringent control of type-1 error for GWAS in European individuals for many traits, it is important to note that GWAS thresholds are context-dependent, and studies may benefit from the establishment of study-specific thresholds if they deviate from the conditions used in our simulations.
We also investigated the P-value threshold when applying a dominance model in GWAS (Fig. 2). It is common for gene mapping studies to assume a typical GWA P-value () for both additive and dominance GWAS models, particularly for common variants (Palmer et al. 2023). However, a more stringent threshold is needed to minimize false-positive results under the dominance model—even for common variants—because dominance LD tagging captures less variation than additive LD tagging in the genome (Zhu et al. 2015). Our results show that a lower P-value threshold of is needed when mapping common variants (MAF > 5%) in a dominance GWAS (Fig. 2). Even a smaller P-value cutoff is ideal if the phenotype is skewed, as shown from our simulations based on the UK Biobank BMI trait where we found a P-value threshold of (Fig. 2 and Table 1). When rare variants (MAF > 0.1%) are included the GWAS, we found a P-value cutoff of (for normally distributed trait and a for a slightly skewed phenotypes (Fig. 2 and Table 1).

The distribution of minimum P-value () from the dominance GWAS models. The was tracted from a GWA results based on the (a) 1,000 simulated normally distributed phenotypes and (b) permuted raw BMI phenotypes from UK Biobank. The genotypes used were from the European cohort in the UK Biobank (N = 348,501). The dotted line corresponds to the traditional genome-wide significance cutoff of (), while the vertical solid lines represent the new empirical GWAS cutoff after pruning genotypes based on the >5% (blue), >1% (green), >0.5% (orange), and >0.1% (gray) minor allele thresholds. The new cutoff GWAS thresholds represents 95% percentile of Pmin distribution (α = 0.05).
To quantify the number of independent significant GWAS signals, we employed PLINK “clumping” approach (Purcell et al. 2007) and conditional and joint (COJO) analysis (Yang et al. 2012) implemented in the GCTA software (Yang et al. 2011). COJO was performed using the following command: gcta64 –bfile [LD reference] –maf 0.001 –cojo-file [trait name] –cojo-slct –cojo-p [GWAS cutoff] –thread-num 5 –out [file name], while PLINK was run with these parameters: –clump [trait name], –clump-p1 [GWAS cutoff], –clump-r2 0.01, –clump-kb 250 –out [file name]. We quantified the “false significance rate (FSR)”—defined as the proportion of tests in the GWA results from UK Biobank that passed traditional significance threshold () but failed to reach the new empirical GWAS threshold from our simulations. Here, we quantified the FSR from the most stringent empirical GWAS thresholds obtained from filtering genotypes based on the MAF cutoff of >0.1% for only the additive GWAS simulations. COJO performs conditional analysis of GWA summary statistics and does not require individual-level genotypes. To run COJO, we downloaded GWA summary statistics for 72 quantitative traits in the UK Biobank (GWAS round 2). The summary statistics for the raw and inverse-rank normalized (irnt) phenotypes were available for these traits. We used a random subset of 10,000 unrelated Europeans from the UK Biobank as LD reference in the COJO analysis. We ran COJO per chromosome for the GWA summary data for both sexes using the default setting. We calculated the FSR for each trait as the percentage difference in the number of independent GWAS counts between the traditional GWA significance threshold () and the new empirical GWA threshold.
By using the most stringent empirical GWA correction from the additive model based on the MAF > 0.1% cutoff () COJO analysis (see methods), we found the proportion of false significant results for the inverse normalized traits ranged from 5.56% (Glycated hemoglobin) to 51.43% (microalbumin in urine), with an mean of 18.26% across traits (Fig. 3 and Supplementary Table S1). Similarly, applying the updated P-value correction from the simulations for the raw phenotypes () in COJO, yielded a slightly higher mean proportion of false significant results of 20% across non-normalized traits. Besides microalbumin in urine, the other normalized traits with the highest FSR included creatinine and sodium in urine and fluid intelligence score (Fig. 3). We found similar levels of FSR when using the software plink “clumping” approach, with the mean false significant results across traits of 20.95 and 23.29% for normalized (irnt) and non-normalized (raw) traits, respectively (Supplementary Table S2). Given the move to large meta-analysis leaving few suitable cohorts with power for replication, these results highlight the need for properly controlled type-I error to avoid follow-up studies on potential false-positive associations.

Top 30 traits with the highest proportion of associations (FSR) that passed traditional threshold () but failed to reach the updated threshold for the additive GWAS. The FSR for the irnt traits and their corresponding raw untransformed phenotypes are represented by black and gray colors, respectively. The FSR was calculated as the percentage of minimum P-values () passing the typical GWAS cutoff () from the GWAS of the 1,000 simulated traits. The red vertical dashed line represents mean FSR for the irnt traits.
The debate over the most appropriate GWAS thresholds for different scenarios has persisted for decades (Dudbridge and Gusnanto 2008; Panagiotou et al. 2012; Fadista et al. 2016; Chen et al. 2021). While we have proposed increasing the stringency of P-value thresholds to reduce false positives, this approach could inadvertently increase the false-negative rate. Empirical data suggest that relaxing the traditional GWAS threshold (5.0 × 10⁻8) could yield a substantial fraction of true “borderline” discoveries, albeit at the cost of a higher false discovery rate may reach up to 50% (Panagiotou et al. 2012). This observation aligns with our findings on the FSR. Although higher thresholds may mitigate false negatives, the relative cost and wasted effort associated with follow-up research on false discoveries could be substantial, which may justify the call for stringent GWAS thresholds (Ioannidis et al. 2011; Panagiotou et al. 2012). Future studies that integrate replication efforts, functional validation, and context-specific thresholds will be essential for balancing discovery potential with robust and replicable findings (Panagiotou et al. 2012; Abdellaoui et al. 2023).
Notably, our empirical GWA threshold applies to only the European ancestry population, which we used in the simulations because a large sample size was available for GWAS. Future studies on other populations, e.g. Africans, are required given that LD structure is population specific and, therefore, different GWAS correction thresholds are needed (Kanai et al. 2016; Pulit et al. 2017). However, we recognize that the choice of GWAS thresholds may be context-dependent, and further validation across diverse datasets, populations, and study designs is warranted to evaluate the broader applicability of this proposed threshold.
In summary, using simulations with large sample sizes, we have established that the traditional GWAS significance P-value of is insufficient for multiple testing corrections in current GWAS that use large sample sizes and less common variants. Instead, a stringent GWAS correction P-value of is needed when rare variants (MAF > 0.1%) are considered in the association test for both additive and dominance models in the European ancestry population. Adopting the new P-value threshold for rare variants is critical for guarding against type 1 errors in future GWAS for European ancestry.
Data availability
This work used genotype and phenotype data from UK Biobank Resource under project 12505. UKB data can be accessed upon request once a research project has been submitted and approved by the UKB committee. The data analyses were conducted using publicly available tools (GCTA and PLINK), with results visualized using the R programming language. Detailed descriptions of the analytical procedures are provided in the Materials and Methods section to facilitate reproducibility of the results. The scripts for running GWAS and generating visualizations are available at: https://github.com/mcraelab/gwas_threshold. UK Biobank GWAS data, https://www.nealelab.is/uk-biobank (Neale Lab).
Supplemental material available at GENETICS online.
Funding
Allan McRae is the recipient of an Australian Research Council Australian Fellowship (project number FT200100837) funded by the Australian Government. The views expressed herein are those of the authors and are not necessarily those of the Australian Government or Australian Research Council. This research has been conducted using the UK Biobank Resource under project 12505.
Author contributions
AFM conceived and designed the study. EKC, TY, and AFM conducted data analysis. EKC wrote the manuscript. All authors reviewed, revised and approved the final manuscript for publication.
Literature cited
Author notes
Conflicts of interest: The authors declare no competing interests.