Are all models wrong? Falsifying binary formation models in gravitational-wave astronomy using exceptional events

ABSTRACT

As the catalogue of gravitational-wave transients grows, several entries appear ‘exceptional’ within the population. Tipping the scales with a total mass of |$\sim 150 \,{\rm M}_\odot$|⁠, GW190521 likely contained black holes in the pair-instability mass gap. The event GW190814, meanwhile, is unusual for its extreme mass ratio and the mass of its secondary component. A growing model-building industry has emerged to provide explanations for such exceptional events, and Bayesian model selection is frequently used to determine the most informative model. However, Bayesian methods can only take us so far. They provide no answer to the question: does our model provide an adequate explanation for exceptional events in the data? If none of the models we are testing provide an adequate explanation, then it is not enough to simply rank our existing models – we need new ones. In this paper, we introduce a method to answer this question with a frequentist p-value. We apply the method to different models that have been suggested to explain the unusually massive event GW190521: hierarchical mergers in active galactic nuclei and globular clusters. We show that some (but not all) of these models provide adequate explanations for exceptionally massive events like GW190521.

black hole physics, gravitational waves, methods: statistical

1 INTRODUCTION

The LIGO (Aasi et al. 2015), Virgo (Acernese et al. 2015), and KAGRA (Akutsu et al. 2019) gravitational-wave collaborations have published |$\sim 100$| confident compact-binary mergers so far (Abbott et al. 2023). Several of these events exhibit unusual properties. One such exceptional event is GW190521 (Abbott et al. 2020a), with component masses |$m_1 = 85^{+21}_{-14}\text{ } {\rm M}_\odot$| and |$m_2 = 66^{+17}_{-18}\text{ } {\rm M}_\odot$| (90 per cent credible intervals). The formation of these high-mass black holes is difficult to explain in field-binary models due to pair-instability processes (e.g. Heger & Woosley 2002; Woosley, Blinnikov & Heger 2007; Belczynski et al. 2016; Woosley 2017, 2019; Woosley & Heger 2021), which is expected to produce a black-hole mass gap of approximately |$50{\!-\!}135 \text{ } {\rm M}_{\odot }$|⁠. However, the precise bounds of this gap are unknown due especially to uncertainties in nuclear-reaction rates (e.g. Farmer et al. 2019; Costa et al. 2021) as well as envelope retention and mass fallback upon core collapse (e.g. Winch et al. 2024).

Several competing formation channels have been proposed to produce extreme-mass events like GW190521: hierarchical mergers in dense star clusters (e.g. Fragione, Loeb & Rasio 2020; Romero-Shaw et al. 2020; Arca-Sedda et al. 2021; Dall’Amico et al. 2021; Kimball et al. 2021; Liu & Lai 2021; Mapelli et al. 2021) or in accretion discs around active galactic nuclei (AGNs; e.g. Palmese et al. 2021; Tagawa et al. 2021; Samsing et al. 2022; Vajpeyi et al. 2022; Morton et al. 2023), the binary evolution of Population III stars (e.g. Liu & Bromm 2020; Safarzadeh & Haiman 2020; Kinugawa, Nakamura & Nakano 2021; Tanikawa et al. 2021) or mergers of primordial black holes (e.g. De Luca et al. 2021; Chen, Yuan & Huang 2022; Clesse & García-Bellido 2022).

There are different ways to determine whether a model is consistent with a catalogue of gravitational-wave events, each one corresponding to a different question. These questions can be subtly different, emphasizing different ways in which a model can fail to account for the data. Thus, we advocate a multipronged approach, which considers a variety of complementary questions. For example, Mould et al. (2023) calculate Bayes factors between different population models for the same event, and thus obtain a relative measure of a models suitability. Meanwhile, Essick et al. (2022) propose an extension to leave-one-out outlier tests, which emphasizes the self-consistency of parametrized models analysing different subsets of data. Finally, Payne & Thrane (2023) introduce the concept of the ‘maximum population likelihood’ as a tool for testing a model’s overall goodness of fit for a gravitational-wave catalogue.

In this paper, we build on the foundation laid by the above studies, but explore a different aspect of model testing. We assess whether a particular model can plausibly explain the most exceptional events observed, without directly comparing it to other models. In particular, we build on the work of Fishbach, Farr & Holz (2020a), which outlines how one may perform a posterior predictive check on the most massive event observed, to check consistency with an inferred population model.¹ We extend this method to account for the measurement uncertainty associated with high-mass events by factoring in the posterior distributions of extreme-event parameters, rather than using their maximum likelihood estimates.

We consider many simulated population catalogues, drawn from the model of interest. We take extremal events from these catalogues and calculate a Bayesian evidence. By computing the Bayesian evidence for many extremal events from different catalogue realizations, we construct a distribution to which we may compare. We use this distribution to calculate a statistic in the form of a p-value, which describes how consistent the observed exceptional event is with extremal simulated events.

In Section 2, we develop the statistical framework used in our analysis. In Section 3, we demonstrate a use of our method by assessing the capability of several AGN models and one globular cluster model with hierarchical mergers to produce an event as extreme in total mass as GW190521. We find that it is feasible for both AGN models with sufficiently high natal black hole masses and globular cluster models with zero natal black hole spins to produce GW190521-like events. This suggests that population models that allow for hierarchical mergers can explain the most massive gravitational-wave events we have observed thus far, and are capable of bridging the proposed pair-instability mass gap (see also e.g. Fragione et al. 2020; Kimball et al. 2021; Anagnostou, Trenti & Melatos 2022).

2 METHOD

Our starting point is some population model of interest M that defines the prior distribution for parameter(s) |$\theta$|⁠,

$$\begin{eqnarray} \pi (\theta | M), \end{eqnarray}$$

(1)

for example, the distribution for total mass of binary black hole systems. This distribution, in turn, implies a distribution of detected events, which may have a different shape due to selection effects,

$$\begin{eqnarray} \pi (\theta | M, \text{det}) \propto \pi (\theta | M) \, p_\text{det}(\theta) . \end{eqnarray}$$

(2)

Here, |$p_\text{det}(\theta)$| is the probability that an event with parameters |$\theta$| is detected.

Since we are interested in extremal events, we next consider a catalogue consisting of N events. Each catalogue has an apparently most extreme event such that the maximum likelihood estimator for that event |$\widehat \theta$| is larger (or smaller) than all other events,

$$\begin{eqnarray} \widehat \theta _i \lessgtr \widehat \theta _{j \ne i} . \end{eqnarray}$$

(3)

There is some distribution of |$\theta$| for these apparently most extreme events,

$$\begin{eqnarray} \pi (\theta _\text{ext}| M, \text{det}, N) . \end{eqnarray}$$

(4)

For example, this distribution could represent the distribution of total mass for the apparently most-massive event in a catalogue of |$N=100$| events.

In the limit where |$\theta _\text{ext}$| is measured precisely, then the maximum likelihood estimator approaches the true parameter value

$$\begin{eqnarray} \widehat \theta _\text{ext} \rightarrow \theta _\text{ext}, \end{eqnarray}$$

(5)

and so the apparently most extreme event is the most extreme event.² In this limit, one may calculate a p-value for an extremal event with |$\widehat \theta _\text{ext}$| in a catalogue with N events, as in Fishbach et al. (2020a),

$$\begin{eqnarray} p \equiv \int _{\Gamma _1} {\rm d}\theta \, \pi (\theta _\text{ext}| M, \text{det}, N), \end{eqnarray}$$

(8)

$$\begin{eqnarray} \Gamma _1 = \lbrace \pi (\theta _\text{ext}| M, \text{det}, N) \lt \pi (\widehat \theta _\text{ext}| M, \text{det}, N) \rbrace . \end{eqnarray}$$

(9)

This p-value corresponds to the fraction of draws from |$\pi (\theta _\text{ext}|M,\text{det}, N)$| that produce a prior probability density smaller than what we obtain for the observed value of |$\widehat \theta _\text{ext}$|⁠. If the observed value of total mass is unusual – either too small or too big – then |$\pi (\widehat \theta _\text{ext}|M,\text{det},N)$| will be small, and so the resulting p-value will be accordingly small. On the other hand, if the observed total mass is typical given the model, then |$\pi (\widehat \theta _\text{ext}|M,\text{det},N)$| is large, and so p is |${\cal O}(1)$|⁠.

In reality, we do not measure |$\theta _\text{ext}$| precisely. Rather, each measurement is characterized with a precision determined by the likelihood function |${\cal L}({\rm d}| \theta _\text{ext})$|⁠, which describes the likelihood of observing gravitational-wave event data given the distribution of model parameters for an extremal event. In order to include measurement uncertainty, we define a new metric, ‘the normalized evidence’:

$$\begin{eqnarray} \overline{\mathcal {Z}} &\equiv & \overline{\pi (\theta _\text{ext}| M, \text{det}, N)} \\ &= & \frac{ \int {\rm d}\theta \, {\cal L}({\rm d}| \theta _\text{ext}) \pi (\theta _\text{ext}| M, \text{det}, N) }{ \int {\rm d}\theta \, {\cal L}({\rm d}| \theta _\text{ext}) \pi (\theta _\text{ext}| U) }, \end{eqnarray}$$

(10)

where the overline denotes an average over measurement uncertainty, and |$\pi (\theta | U)$| is a prior uniform in the parameter of interest.³ In the high signal-to-noise ratio (SNR) limit, the likelihood function becomes a delta function peaked at the true parameter values |$\theta _\text{true}$|⁠, so

$$\begin{eqnarray} \overline{\mathcal {Z}} \rightarrow \frac{ \int {\rm d}\theta \, \delta (\theta _\text{ext}-\theta _\text{true}) \pi (\theta _\text{ext}| M, \text{det}, N) }{ \int {\rm d}\theta \, \delta (\theta _\text{ext}-\theta _\text{true}) \pi (\theta _\text{ext}| U) }, \end{eqnarray}$$

(11)

which simplifies to

$$\begin{eqnarray} \overline{\mathcal {Z}} \rightarrow \frac{ \pi (\theta _\text{true}| M, \text{det}, N) }{ \pi (\theta _\text{true}| U) } . \end{eqnarray}$$

(12)

In the high SNR limit, |$\overline{\mathcal {Z}}$| approaches the ratio of prior probability densities at |$\theta _\text{true}$|⁠. Thus, the denominator of equation (10) ensures |$\overline{\mathcal {Z}}$| tracks the parameter of interest. By constructing an empirical distribution of |$\overline{\mathcal {Z}}$| with simulated data – denoted |$\pi (\overline{\mathcal {Z}}| M, \text{det}, N)$| – we can again calculate a p-value quantifying if the observed value |${\overline{\mathcal {Z}}}$| is unusual compared to the distribution expected from simulation:

$$\begin{eqnarray} p \equiv \int _{\Gamma _2} {\rm d}\overline{\mathcal {Z}}^{\prime } \, \pi (\overline{\mathcal {Z}}^{\prime }|M, \text{det}, N), \end{eqnarray}$$

(13)

$$\begin{eqnarray} \Gamma _2 = \bigg \lbrace \pi (\overline{\mathcal {Z}}^{\prime }| M, \text{det}, N) \lt \pi (\overline{\mathcal {Z}}| M, \text{det}, N)\bigg \rbrace . \end{eqnarray}$$

(14)

If the observed value of total mass is unusual – after taking into account measurement uncertainty – then |${\overline{\mathcal {Z}}}$| will be small, and so the resulting p-value will be accordingly small, whereas if |${\overline{\mathcal {Z}}}$| is large, p is |${\cal O}(1)$|⁠. Moreover, this procedure reproduces equation (8) in the high signal-to-noise ratio limit. For example, take a single observed event, at an SNR of either 10 or 1000. These events will have different support for the same model, described by their posterior distributions ‘leaking’ into the model distribution, and should therefore falsify this model to different degrees. It is this support that is captured by equation (13), and not by the maximum likelihood estimator in equation (8).

3 DEMONSTRATION

To demonstrate the above formalism, we test the ability of different models to explain the total mass |$m_\text{tot}$| of the event GW190521. We consider four models:

Gayathri et al. (2023) propose a model for binary black hole formation in an AGN disk with hierarchical mergers. They adopt a one-parameter model parametrized by the maximum allowed natal black hole mass, |$m_\text{max}$|⁠, while fiducial values of AGN parameters are chosen. A seed black hole distribution is generated following a Saltpeter mass function for each model (with a different value of |$m_\text{max}$|⁠), and neutron stars are assumed to be normally distributed in mass with a mean of |$1.49 \,{\rm M}_\odot$| and a standard deviation of |$0.19 \,{\rm M}_\odot$|⁠. These compact objects are then allowed to migrate through an accretion disc around a supermassive black hole, encountering other objects and undergoing subsequent mergers. We consider three variations of this model, with |$m_{\rm max} = 15,\,\,50,\,\,\text{and} \,\, 75\,\,{\rm M}_\odot$|⁠. These three models are labelled ‘AGN’.
Rodriguez et al. (2019)’s model for binary black hole assembly in globular clusters. They use a coupled modelling technique, in which stars are evolved from their zero-age main sequence births using a binary stellar evolution package (Hurley, Pols & Tout 2000; Hurley, Tout & Pols 2002; Chatterjee et al. 2010), while their movement through a globular cluster is also simulated using the C luster Monte Carlo package (Joshi, Rasio & Zwart 2000; Pattabiraman et al. 2013). Eventually, some of these stars become potential GW sources as they form compact objects and undergo binary formation and subsequent mergers. In this paper, we consider the model variation with natal black hole spin |$\chi _{\text{birth}}=0$|⁠. This model is labelled ‘Globular Cluster’.⁴

For each model, we simulate |$N=100$| events for the LIGO H1 and L1 observatories operating at design sensitivity. We create |$n=100$| catalogues of simulated events (⁠|$10^4$| events in total). We assign a spin magnitude to each component drawn from a Gaussian distribution centred at 0.7 and truncated at 0 and 1, to reflect their dynamical origin (e.g. Kimball et al. 2020).

Each event is injected into Gaussian noise colored by the amplitude spectral density noise curves of design-sensitivity LIGO. Each detection has network SNR |$\gt 12$|⁠.

To find the apparently most-massive event in each catalogue, we perform fast parameter estimation on the ten events with the largest true mass. We use the nested sampler dynesty (Speagle 2020) as implemented in Bilby (Ashton et al. 2019) with only 100 live points, which is adequate to estimate the maximum likelihood value of total mass. Our runs use the IMRPhenomPv2 waveform approximant (Schmidt, Hannam & Husa 2012). We choose boundary values of the component mass priors to include the range of masses in the injection set. Due to noise fluctuations, the event with the largest true mass may not be the apparently most-massive event, but tests suggest we can reliably find the apparently most-massive event among these ten. We find that the truly most-massive event is the apparently most-massive event around 80 per cent of the time, while the second truly most-massive event is the apparently most-massive event around 10 per cent of the time. Few events below the third truly most-massive event are chosen as the apparently most-massive event.

We use importance sampling to reweight the result (see e.g. Thrane & Talbot 2019) to obtain the marginal likelihood that we would have obtained using |$\pi (\theta _\text{ext}| M_{\text{true}}, \text{det}, N)$| – the distribution of truly most-massive events in each catalogue. We estimate |$\pi (\theta _\text{ext}| M_{\text{true}}, \text{det}, N)$| by fitting the distribution of total mass and mass ratio posteriors of the truly most-massive events for each model with a kernel density estimator. We then calculate |${\overline{\cal Z}}$| (equation 10) for each event. The value of |${\overline{\cal Z}}$| characterizes how extreme each is compared to the distribution of truly most-massive events.

However, a small value of |${\overline{\cal Z}}$| may indicate either an unusually low-mass event or unusually high-mass event. Therefore, we select the apparently most-massive event in each catalogue as follows: if any events have a higher max-likelihood total-mass estimate greater than the median total mass of |$\pi (\theta _\text{ext}| M_\text{true}, \text{det}, N)$|⁠, we select the event of these with the lowest |${\overline{\cal Z}}$| to be the apparently most massive. If no such events exist, we select the event with the highest max-likelihood total-mass estimate to be the apparently most massive. This ensures that we do not select low-mass outliers. We demonstrate this method in further detail in Appendix A.

The distributions of maximum total mass for each model are shown in Fig. 1. For the apparently most-massive event in each catalogue, we calculate |${\overline{\cal Z}}$| (equation 10) with the nested sampler dynesty as implemented in Bilby with 1000 live points, using the same noise realization for each event as before. We use importance sampling to reweight the result to obtain the marginal likelihood that we would have obtained using |$\pi (\theta _\text{ext}| M_{\text{app}}, \text{det}, N)$|⁠, the distribution of apparently most-massive events, which we once again estimate using a kernel density estimator.

$The distribution of total mass $m_\text{tot}$ for different models. The solid curves are the original models while the histograms show the total-mass distribution of the of the apparently most-massive event in each simulated catalogue. We consider three variations of the AGN model from Gayathri et al. (2023) with different maximum black hole masses $m_\text{max}$ (top left, top right, bottom left). We include a globular cluster model from Rodriguez et al. (2019) (bottom right).$

Figure 1.

The distribution of total mass |$m_\text{tot}$| for different models. The solid curves are the original models while the histograms show the total-mass distribution of the of the apparently most-massive event in each simulated catalogue. We consider three variations of the AGN model from Gayathri et al. (2023) with different maximum black hole masses |$m_\text{max}$| (top left, top right, bottom left). We include a globular cluster model from Rodriguez et al. (2019) (bottom right).

Open in new tab Download slide

As an example, in Fig. 2 we plot |$\ln \overline{\cal Z}$| versus the maximum-likelihood value of |$m_\text{tot}$| for the apparently most-massive event in each simulated catalogue, for the |$m_\text{max} = 50 \,{\rm M}_\odot$| AGN model. The most common value of |$m_\text{tot}$| is |$\approx 130\,\mathrm{M_\odot }$|⁠, consistent with expectations from Fig. 1. As expected, these typical values of |$m_\text{tot}$| tend to produce relatively large values of |$\ln \overline{\cal Z}$|⁠.

$Lognormalized evidence $\ln {\overline{\mathcal {Z}}}$ versus max-a-posteriori total-mass estimate for the apparently most-massive events of the $m_\text{max} = 50\, {\rm M}_\odot$ AGN model. Lower $\ln {\overline{\mathcal {Z}}}$ values correspond to catalogues that are relatively unusual – either because the maximum total mass is either unusually small or unusually large. The highest values of $\ln {\overline{\mathcal {Z}}}$ correspond to typical catalogues with usual values for the maximum total mass. The lognormalized evidence for GW190521, calculated using this model, is also plotted against the max-a-posteriori total mass of GW190521.$

Figure 2.

Lognormalized evidence |$\ln {\overline{\mathcal {Z}}}$| versus max-a-posteriori total-mass estimate for the apparently most-massive events of the |$m_\text{max} = 50\, {\rm M}_\odot$| AGN model. Lower |$\ln {\overline{\mathcal {Z}}}$| values correspond to catalogues that are relatively unusual – either because the maximum total mass is either unusually small or unusually large. The highest values of |$\ln {\overline{\mathcal {Z}}}$| correspond to typical catalogues with usual values for the maximum total mass. The lognormalized evidence for GW190521, calculated using this model, is also plotted against the max-a-posteriori total mass of GW190521.

Open in new tab Download slide

Continuing with the |$m_\text{max} = 50\, {\rm M}_\odot$| AGN model as an example, we use the empirical distribution of |$\ln \overline{\cal Z}$| (shown in orange in Fig. 3) to calculate a p-value for GW190521 (⁠|$\ln \overline{\cal Z}=-6.23$|⁠; black-dashed line). Since 99 per cent of the simulations produce |$\ln \overline{\cal Z}$| values larger than the one obtained for GW190521, the p-value is |$p=0.01$| (disfavoured at the two-sigma level). This implies the total mass of GW190521 is very unusual (at the two sigma level) compared to the distribution of expected apparently most-massive events after |$N=100$| events and given the Gayathri et al. (2023) model.

$Probability distribution of lognormalized evidence $(\ln {\overline{\mathcal {Z}}})$ for distributions of the apparently most-massive, detectable events drawn from equation (4) (histograms), against $\ln {{\overline{\mathcal {Z}}}^{\prime }}$, the lognormalized evidence for GW190521 (black-dashed line), for AGN models with $m_\text{max} = 15 \text{} \,{\rm M}_\odot \text{ (top left), } 50\, \text{} {\rm M}_\odot \text{ (top right),}\text{ and } 75 \text{}\, {\rm M}_\odot \text{ (bottom left)}$, as well as a globular cluster model with $\chi _\text{birth} = 0$ (bottom right). The p-value calculation in equation (13) is the integral over the region to the left of $\ln {{\overline{\mathcal {Z}}}^{\prime }}$.$

Figure 3.

Probability distribution of lognormalized evidence |$(\ln {\overline{\mathcal {Z}}})$| for distributions of the apparently most-massive, detectable events drawn from equation (4) (histograms), against |$\ln {{\overline{\mathcal {Z}}}^{\prime }}$|⁠, the lognormalized evidence for GW190521 (black-dashed line), for AGN models with |$m_\text{max} = 15 \text{} \,{\rm M}_\odot \text{ (top left), } 50\, \text{} {\rm M}_\odot \text{ (top right),}\text{ and } 75 \text{}\, {\rm M}_\odot \text{ (bottom left)}$|⁠, as well as a globular cluster model with |$\chi _\text{birth} = 0$| (bottom right). The p-value calculation in equation (13) is the integral over the region to the left of |$\ln {{\overline{\mathcal {Z}}}^{\prime }}$|⁠.

Open in new tab Download slide

We repeat this process for the |$m_\text{max} = 15\, \text{ and } \,75\, {\rm M}_\odot$| AGN models, as well as the globular cluster model. We summarize the results in Table 1 (see also Fig. 3). We find |$p\simeq 0$| for the |$m_\text{max} = 15 \,{\rm M}_\odot$| AGN model, indicating this model almost never produces an event as massive as GW190521. We find the |$m_\text{max} = 75\, {\rm M}_\odot$| AGN model has |$p = 0.61$|⁠, suggesting the model produces GW190521-like events often. Finally, we find the globular cluster model has |$p = 0.12$|⁠, suggesting that it is reasonable to draw a GW190521-like event from this model.

Table 1.

Open in new tab

Associated p-value for each model based on AGN natal black hole mass |$m_\text{max}$| or globular cluster (GC) scenario.

Model	\|$15 \,{\rm M}_\odot$\|	\|$50 \,{\rm M}_\odot$\|	\|$75 {\rm M}_\odot$\|	GC
p-Value	\|$\sim 0$\|	0.01	0.61	0.12

Table 1.

Open in new tab

Associated p-value for each model based on AGN natal black hole mass |$m_\text{max}$| or globular cluster (GC) scenario.

Model	\|$15 \,{\rm M}_\odot$\|	\|$50 \,{\rm M}_\odot$\|	\|$75 {\rm M}_\odot$\|	GC
p-Value	\|$\sim 0$\|	0.01	0.61	0.12

4 DISCUSSION

We have demonstrated the use of our statistical framework, which makes use of the ‘normalized evidence’ |$\overline{\mathcal {Z}}$| in testing several models of binary formation using the exceptional event GW190521. We rule out the AGN model with |$m_\text{max} = 15 \,\text{} {\rm M}_\odot$|⁠. The AGN model with |$m_\text{max} = 50\, \text{} {\rm M}_\odot$| is disfavoured in explaining GW190521 (at the two-sigma level). The AGN model with |$m_\text{max} = 75\, \text{} {\rm M}_\odot$| and the globular cluster model plausibly provide adequate explanations for GW190521.

In order to compare our method with the |$m_\text{tot}$| max-likelihood-estimated p-value calculation from Fishbach et al. (2020a), we carry out a simulation. We consider outlier events for the |$m_\text{max} = 50 \,{\rm M}_\odot$| AGN model and consider two different situations. In both cases, we observe an outlier in total mass with maximum-likelihood value of |$m_\text{tot} = 165 {\rm M}_\odot$|⁠. However, in one case, the network signal-to-noise ratio is relatively low (SNR |$=12$|⁠) while in the other case the signal-to-noise ratio is high (SNR |$=45$|⁠).

In the Fishbach et al. (2020a) framework, these two events are assigned the same p-value because they both have the same maximum-likelihood value for |$m_\text{tot}$|⁠. These p-values, however, do not fold in all the available information. Since the posterior for |$m_\text{tot}$| narrows with increasing signal-to-noise ratio, we can be relatively more confident that the SNR |$=45$| event is incompatible with the AGN model because the total mass is more assuredly measured outside the predicted range of the model. Since our approach includes information about the likelihood width, the SNR |$=45$| event yields a lower p-value than the SNR |$=12$| event.

To illustrate, we inject two events with different distances into coloured Gaussian noise and perform parameter estimation as above. The results are illustrated in Fig. 4. The SNR |$=45$| event yields |$p = 0.01$|⁠, indicating the |$m_\text{max} = 50 \,{\rm M}_\odot$| AGN model poorly explains the event. The SNR |$=12$| event yields |$p=0.05$|⁠, which also disfavours the |$50\, {\rm M}_\odot$| AGN model, but with less statistical significance.

$Lognormalized evidence $\ln {\overline{\mathcal {Z}}}$ versus max-a-posteriori total-mass estimate for the apparently most-massive events of the $m_{\rm max} = 50\, {\rm M}_\odot$ AGN model, as in Fig. 2. The red and blue stars indicate the $\ln {\overline{\mathcal {Z}}}$ calculated for an event with the same peak total-mass likelihood of $165 {\rm M}_\odot$, but with a network SNR of 12 and 45. The low-SNR event’s posterior leaks into the model distribution to a greater extent than the high-SNR event, and is therefore assigned a higher p-value. The high-SNR event is more confidently in the tail of the distribution, with a lower associated p-value.$

Figure 4.

Lognormalized evidence |$\ln {\overline{\mathcal {Z}}}$| versus max-a-posteriori total-mass estimate for the apparently most-massive events of the |$m_{\rm max} = 50\, {\rm M}_\odot$| AGN model, as in Fig. 2. The red and blue stars indicate the |$\ln {\overline{\mathcal {Z}}}$| calculated for an event with the same peak total-mass likelihood of |$165 {\rm M}_\odot$|⁠, but with a network SNR of 12 and 45. The low-SNR event’s posterior leaks into the model distribution to a greater extent than the high-SNR event, and is therefore assigned a higher p-value. The high-SNR event is more confidently in the tail of the distribution, with a lower associated p-value.

Open in new tab Download slide

For future work, we propose to apply this method more broadly in order to falsify models as explanations for exceptional events. Key questions include the following:

Which models are consistent with the most massive binary black holes?
Which models are consistent with the most extreme spins?
Which models are consistent with the most extreme mass ratios, such as GW190814 (Abbott et al. 2020b)?
Similarly, which models are consistent with the smallest secondary masses, like the |$m_2=2.6 \,{\rm M}_\odot$| object observed in GW190814 (Abbott et al. 2020b)?

There are many ways to evaluate the explanatory power of different population models, each of which provides the answer to a different question. The Bayes factor can be used to determine which of two population models provides a better description of the data (e.g. Mould et al. 2023). We note it may be useful to determine if none of the models tested provide an adequate fit – something we do not learn from the Bayes factor, which is a relative comparison of models.

Leave-one-out analyses can be used to determine if one event is a statistical outlier with respect to the rest of the population (e.g. Essick et al. 2022). Our method may be similarly extended to examine whether fits to the entire observed population can explain the most exceptional events observed. If the model is fit to the data and still struggles to explain an exceptional event, it is clearly inadequate.

Finally, the ‘maximum population likelihood’ may be used to test a model’s goodness of fit to observed events (Payne & Thrane 2023). This method characterizes a model’s fit to an entire catalogue of events, requiring parameter estimation on all events in many simulated catalogues. Our method focuses only on exceptional events, and thus vastly reduces the total computation required.

Our framework is an example of ‘model criticism’ (Romero-Shaw, Thrane & Lasky 2022). It is a posterior predictive check in which one infers a model distribution from observed data, then compares samples from that distribution back to data, to check for consistency (e.g. Fishbach et al. 2020b). Specifically, we perform a posterior predictive check on a model for a single observed extreme event. This requires adopting a frequentist approach, as opposed to an entirely Bayesian framework, as well as comparisons of events’ ‘normalized evidences’, rather than comparisons of their posteriors.

We note that a similar approach to model criticism was recently adopted by Amendola et al. (2024), who developed a model-testing framework using a |$p{\text{-}}$|value calculated over a distribution of Bayes factors. They applied this statistic to cosmological supernovae 1a data, and found their method similarly avoids shortcomings of purely Bayesian or frequentist approaches. The multidisciplinary applicability of such a statistic highlights its potential to improve model testing not only in gravitational wave data, but across diverse scientific fields.

ACKNOWLEDGEMENTS

We thank Thomas Callister, Matthew Mould and the referee for their helpful comments. This is LIGO document #P2400175. We acknowledge support from the Australian Research Council (ARC) Centres of Excellence CE170100004 and CE230100016, as well as ARC LE210100002, and ARC DP230103088. LP receives support from the Australian Government Research Training Program. SS is a recipient of an ARC Discovery Early Career Research Award (DE220100241). This material is based upon work supported by NSF’s LIGO Laboratory which is a major facility fully funded by the National Science Foundation. The authors are grateful for computational resources provided by the LIGO Laboratory and supported by National Science Foundation Grants PHY-0757058 and PHY-0823459.

This research has made use of data or software obtained from the Gravitational Wave Open Science Center (gw-openscience.org), a service of LIGO Laboratory, the LIGO Scientific Collaboration, the Virgo Collaboration, and KAGRA. LIGO Laboratory and Advanced LIGO are funded by the United States National Science Foundation (NSF) as well as the Science and Technology Facilities Council (STFC) of the United Kingdom, the Max-Planck-Society (MPS), and the State of Niedersachsen/Germany for support of the construction of Advanced LIGO and construction and operation of the GEO600 detector. Additional support for Advanced LIGO was provided by the Australian Research Council. Virgo is funded, through the European Gravitational Observatory (EGO), by the French Centre National de Recherche Scientifique (CNRS), the Italian Istituto Nazionale di Fisica Nucleare (INFN) and the Dutch Nikhef, with contributions by institutions from Belgium, Germany, Greece, Hungary, Ireland, Japan, Monaco, Poland, Portugal, Spain. The construction and operation of KAGRA are funded by Ministry of Education, Culture, Sports, Science and Technology (MEXT), Japan Society for the Promotion of Science (JSPS), National Research Foundation(NRF), Ministry of Science and ICT (MSIT) in Korea, and Academia Sinica (AS) and the Ministry of Science and Technology (MoST) in Taiwan.

DATA AVAILABILITY

The data underyling this article are publicly available at https://www.gw-openscience.org.

Footnotes

We compare and contrast our method with the technique from Fishbach, Essick & Holz (2020b) at the top of Section 4.

In this limit,

$$\begin{eqnarray} \pi (\theta | M, \text{det}, N) = \Phi (\theta | M, \text{det})^{N-1} \, \pi (\theta | M, \text{det}), \end{eqnarray}$$

(6)

where |$\Phi$| is a cumulative density function

$$\begin{eqnarray} \Phi (\theta | M, \text{det}) = \int _{\theta _\text{min}}^\theta {\rm d}\theta ^{\prime } \, \pi (\theta ^{\prime }| M, \text{det}) . \end{eqnarray}$$

(7)

Astute readers may notice that |$\overline{\mathcal {Z}}$| is equivalent to a Bayes factor comparing the model with the prior given in equation (4) to a prior uniform in total mass.

While we abbreviate our two models as ‘AGN’ and ‘Globular Cluster’, it should go without saying that they many are other versions of binary formation models involving AGNs and globular clusters.

REFERENCES

Acernese

et al. ,

2015

Class. Quantum Grav.

024001

Month:	Total Views:
November 2024	86
December 2024	45
January 2025	42
February 2025	25
March 2025	23
April 2025	27

Article Contents

Are all models wrong? Falsifying binary formation models in gravitational-wave astronomy using exceptional events

ABSTRACT

1 INTRODUCTION

2 METHOD

3 DEMONSTRATION

4 DISCUSSION

ACKNOWLEDGEMENTS

DATA AVAILABILITY

Footnotes

REFERENCES

APPENDIX A: CHOOSING THE APPARENTLY MOST-MASSIVE EVENT IN EACH CATALOGUE

Citations

Views

Altmetric

Email alerts

Astrophysics Data System

Citing articles via

Latest

Most Read

Most Cited

This Feature Is Available To Subscribers Only