Abstract

We derive a Bayesian framework for incorporating selection effects into population analyses. We allow for both measurement uncertainty in individual measurements and, crucially, for selection biases on the population of measurements, and show how to extract the parameters of the underlying distribution based on a set of observations sampled from this distribution. We illustrate the performance of this framework with an example from gravitational-wave astrophysics, demonstrating that the mass ratio distribution of merging compact-object binaries can be extracted from Malmquist-biased observations with substantial measurement uncertainty.

1 INTRODUCTION

The problem of extracting the distributional properties of a population of sources based on a set of observations drawn from that distribution is a common one, frequently labelled as hierarchical modelling (e.g. Hogg, Myers & Bovy 2010; Bovy, Hogg & Roweis 2011 call this ‘extreme deconvolution’). In practical applications, one often has to deal with selection effects: the observed population will have a Malmquist bias (Malmquist 1922, 1925) whereby the loudest or brightest sources are most likely to be detected, and it is necessary to correct for this bias in order to extract the true source population (e.g. Farr et al. 2014; Foreman-Mackey, Hogg & Morton 2014). In other applications, significant measurement uncertainties in the individual observations must be accounted for (e.g. Farr & Mandel 2018). Of course, these two complications – measurement uncertainties and selection effects – are often present simultaneously.

There have been multiple attempts to address the problem of population-based inference with both selection effects and significant measurement uncertainties. The earliest correct published solution to this problem, as far as we are aware, belongs to Loredo (2004). However, despite the availability of this solution, it is easy to be lured into a seemingly straightforward but incorrect derivation. The most common mistake is the modification of the model population distribution to account for the selection function, i.e. the inclusion of the probability of detecting a particular event only as a multiplicative term in the probability of observing that event. This detection probability is usually included as the probability marginalized over all realizations of the data, ignoring the fact that we know the particular data realization that has been observed. For a given data realization the probability that a source is detected, which is a property purely of the data, is by definition equal to one for any data set associated with an observation we are analysing. On the other hand, as shown below, it is critical to include the detection probability in the normalization factor to account for the different numbers of events expected to be observed under different population models.

We sketched out the correct approach to including selection effects in Mandel, Farr & Gair (2016) (which is superseded by the present manuscript) and Abbott et al. (2016). Other correct applications in the literature include Fishbach & Holz (2017), Fishbach, Holz & Farr (2018), and Feeney et al. (2019). Here, we expand and clarify the earlier treatment of Loredo (2004) by presenting two different approaches to solving this problem below: a bottom-up and a top-down derivation, showing that they yield the same result. Some among us find one or the other approach to be more clear, and we hope that including both will also benefit readers.

We illustrate the derived methodology with two examples. The first is the classic example of measuring a luminosity function with a flux-limited survey. The second is an example from gravitational-wave astronomy: the measurement of the mass ratio of merging binary neutron stars. We show that ≳ 1000 observations at a signal-to-noise ratio (SNR) of ≳ 20 will be necessary to accurately measure the mass ratio distribution. This feat, which can be accomplished with third-generation ground-based gravitational-wave detectors, could elucidate the details of neutron star formation.

2 PROBLEM STATEMENT AND NOTATION

We consider a population of events or objects, each described by a set of parameters |$\vec{\theta }$|⁠. These parameters represent the characteristics of individual events. For example, in the case of compact binary coalescences observed by LIGO and Virgo these would include the masses, spin magnitudes and spin orientations of the two components, the location of the source on the sky, the distance of the source, the orientation and eccentricity of the binary orbit etc. The distribution of events in the population is described via parameters |$\vec{\lambda }$|⁠, so that the number density of objects follows |$\frac{\mathrm{d}N}{\mathrm{d}\vec{\theta }} (\vec{\lambda }) = N p_\textrm{pop}(\vec{\theta }|\vec{\lambda }{}^{\prime })$|⁠. In the gravitational-wave context, these parameters could represent properties of the population like the slope of the mass function of black holes in compact binaries, or the shape of the spin magnitude distribution, or the mixing fractions of different subpopulations. They could also represent physical ingredients used in population synthesis calculations, for example the parameters of the initial mass function, stellar metallicity distribution or stellar winds and the properties of common envelope evolution or of the distribution of supernova kicks. In this second case, the distribution of the individual event properties |$p_\textrm{pop}(\vec{\theta }|\vec{\lambda }{}^{\prime })$| could be obtained from the output of population synthesis codes for that particular choice of input physics. We have separated |$\vec{\lambda }$| into the overall normalization for the number or rate of events N and the set of parameters describing the shape of the distribution alone |$\vec{\lambda }{}^{\prime }$|⁠. For instance, if the underlying distribution is modelled as a multidimensional Gaussian, |$\vec{\lambda }$| would consist of the mean vector and covariance matrix; alternatively, a non-parametric distribution could be described with a (multidimensional) histogram, in which case |$\vec{\lambda }$| represents the weights of various histogram bins.

This distribution is sampled by drawing a set of Nobs ‘observed events’ with true parameters |$\lbrace \vec{\theta }_i\rbrace$|⁠, for i ∈ [1, Nobs]. For each object in the population we make a noisy measurement of |$\vec{\theta }_i$|⁠, represented by a likelihood function relating the measured data, |$\vec{d}_i$|⁠, to the parameters of the event, |$\vec{\theta }$|⁠: |$p\left(\vec{d}_i \mid \vec{\theta }_i \right)$|⁠.

Moreover, based on the observed data, some objects are classed as ‘observable’ and others are ‘unobservable.’ For example, a survey may impose a per-pixel or per-aperture threshold on the flux for inclusion of point sources in a catalogue, or a gravitational wave detector may only report events whose SNR rises above some predetermined threshold. This detection probability can be estimated empirically for a search pipeline via a large injection campaign. In some cases, it can be modelled analytically; for example, for low-mass compact binaries, the gravitational-wave strain in the frequency domain is proportional to the 5/6 power of the chirp mass Mc, so the detection probability scales as the surveyed volume,1|$\propto M_\mathrm{ c}^{15/6}$|⁠. Throughout this article, we will assume that whether or not an event is counted as a detection is a property only of the data for each object and so there exists an indicator function |$\mathbf {I}(\vec{d})$| that is equal to 1 for ‘observable’ objects that would be classified as detections and 0 otherwise; this is by far the most common case for astronomical observations.2

Our ultimate goal is to determine the population properties |$\vec{\lambda }$|⁠. Of course, we cannot uniquely reconstruct |$\vec{\lambda }$| using a limited set of observations with selection biases and measurement uncertainties. The best we can do is compute the posterior probability on |$\vec{\lambda }$|⁠, the distribution on distributions, given the observations, which, in the usual Bayesian formalism, is given by
(1)
where |$p(\lbrace \vec{d}_i\rbrace |\vec{\lambda })$| is the likelihood of observing the data set given the population properties, |$\pi (\vec{\lambda })$| is the prior on |$\vec{\lambda }$| and the evidence |$p(\lbrace \vec{d}_i\rbrace)$| is the integral of the numerator over all |$\vec{\lambda }$|⁠. This evidence can be used to select between different models for representing the distribution, as in Farr et al. (2011). In the next two sections, we present two alternative ways of deriving |$p(\lbrace \vec{d}_i\rbrace |\vec{\lambda })$|⁠.

3 BOTTOM-UP DERIVATION

First, we follow the bottom-up approach of deriving the likelihood for obtaining a particular set of observations given the population parameters, by starting with a simple problem without either measurement uncertainties or selection effects and gradually building up the problem complexity. For the moment, we assume that we are only interested in the shape of the population distribution, and ignore the normalization, or rate, of objects in the population; we discuss estimation of both the rate and shape of a population at the end of this section and in Section 4.

In the absence of measurement uncertainties, the data can be directly converted into event parameters |$\lbrace \vec{\theta }_i\rbrace$|⁠, for i ∈ [1, Nobs]. The total probability of making this particular set of independent observations is
(2)
The normalization factor here accounts for the overall probability of making an observation given a particular choice of |$\vec{\lambda }$| (it will be equal to 1 if ppop is properly normalized, but we keep the normalization term for completeness).
In practice, there is often a selection bias involved: some events are easier to observe than others. This can be characterized by a detection probability |$p_\textrm{det}(\vec{\theta })$|⁠. We assume for now that when systems are observed their parameters can be measured perfectly, i.e. we directly measure |$\lbrace \vec{\theta }_i\rbrace$|⁠. This effectively says that the noise in the measurement is negligible3 and the selection effects can be applied directly to the event parameters: |$p_\textrm{det}(\vec{\theta }) = \mathbf {I}(\vec{\theta })$|⁠, i.e. events are either always detected or never detected depending on their parameters. With the selection effect included, equation (2) becomes (e.g. Chennamangalam et al. 2013; Farr et al. 2015)
(3)
where the second equality follows because, by definition, |$\mathbf {I}(\vec{\theta })=1$| for any event we have included among the set of detections.
In general, we do not have the luxury of directly measuring the parameters of a given event, |$\vec{\theta }_i$|⁠. Instead, we measure the data set |$\vec{d}_i$| which encodes these parameters but also includes some random noise. For a given data set and search pipeline, we assume that the detectability is deterministic: if the data exceed some threshold (e.g. a threshold on the SNR), then the event is detectable, otherwise it is not. In other words, the detection probability for a given set of parameters introduced earlier is, in fact, an integral over the possible data sets given those parameters:
(4)
In the gravitational-wave context, detection is usually well approximated as a cut on the observed SNR and so this detection probability is the likelihood distribution of observed SNRs. There are two stochastic components to the observed SNR – fluctuations in the detector noise which change the observed SNR relative to the intrinsic SNR, and fluctuations in the intrinsic SNR due to variations in the source parameters. For an example of the latter, the expected signal amplitude is a strong function of the mass – a selection effect that is critical to consider when inferring the underlying distribution of binary black hole masses from the observed events (Abbott et al. 2016; Fishbach & Holz 2017). As another example, the intrinsic SNR also depends on extrinsic parameters of the binary, i.e. the sky location and orientation of the system. That dependence is largely encoded in the distribution of the parameter Θ described in Finn & Chernoff (1993). The function |$p_\textrm{det}(\vec{\theta })$| encodes both these types of intrinsic selection effect, plus marginalization over instrumental noise fluctuations.
Using equation (4), we can write the probability of observing a particular data set (where ‘observing’ implies that the data are above the threshold, hence included as one of our k observations) given the assumed distribution parametrized by |$\vec{\lambda }^{\prime }$| as
(5)
where the normalization factor |$\alpha (\vec{\lambda }^{\prime })$| is given by
(6)
This normalization factor can be interpreted as the fraction of events in the Universe that would be detected for a particular population model, characterized by the population parameters |$\vec{\lambda }^{\prime }$|⁠.
Thus, in the presence of both measurement uncertainty and selection effects, equations (2) and (3) become:
(7)
The presence of the likelihood |$p(\vec{d}_i|\vec{\theta })$| in this equation is a reminder that we do not have a perfect measurement of the parameters of a given event. The likelihood can be rewritten in terms of the posterior probability density function (PDF) |$p(\vec{\theta }_i|\vec{d}_i)$| that is computed in the course of single-event parameter estimation using some assumed prior |$\pi (\vec{\theta })$|⁠:
(8)
Thus, each term of the product in equation (7) is a normalized convolution integral of the population with the posterior PDF (Mandel 2010).
In practice, the posterior PDF |$p(\vec{\theta }_i|\vec{d}_i)$| is often discretely sampled with Si samples from the posterior, |$\lbrace ^j\vec{\theta }_i\rbrace$|⁠, for j ∈ [1, Si]. Because the samples are drawn according to the posterior, the parameter space volume associated with each sample is inversely proportional to the local PDF, |$d^j\vec{\theta }_i \propto \left[p(^j\vec{\theta }_i | \vec{d}_i)\right]^{-1}$|⁠. This allows us to easily replace the integral in equation (7) with a discrete sum over PDF samples:
(9)
Finally, the posterior on the underlying population parameters |$\vec{\lambda }^{\prime }$| is given by substituting equation (9) into equation (1):
(10)
Of course, if interested in the distribution of a single parameter, we can marginalize over equation (10) in the usual way, by integrating over the remaining parameters.

We have so far described inference based on the shape of the distribution |$p_\textrm{pop}(\vec{\theta }|\vec{\lambda }^{\prime })$| while ignoring the overall normalization. This is appropriate when the overall normalization on the population counts is not interesting, or when the bulk of the information comes from the distribution properties rather than the detection rate (a single data point). This is a reasonable assumption in the gravitational-wave context, where the astrophysical uncertainty on the rates of compact object mergers covers several order of magnitude. While inferring the rate is of great interest, models may not predict it with sufficient precision for that measurement to have strong constraining power.

In contexts in which the expected number of detections Ndet can be predicted, this can be readily included in the framework. The probability of observing k events is given by the Poisson distribution
(11)
Here, the usual Nobs! term in the denominator is absent because the events are distinguishable by their data; in any case, as a normalization term that depends on the data only, it would not impact inference on |$\vec{\lambda }$|⁠. The expected number of detections once selection effects are included is (cf. equation 23):
(12)
The posterior on the population parameters with the rate included becomes
(13)
Note that if a prior π(N) ∝ 1/N is assumed on the intrinsic event number or rate (Fishbach et al. 2018), equation (13) can be marginalized over N to again yield equation (10) up to a normalization constant, which depends only on the number of observed events and would not impact inference on model parameters:
(14)
where we used
(15)

4 TOP-DOWN DERIVATION

Alternatively, we consider a top-down calculation. If we have observed a representative sample from the population (i.e. a ‘fair draw’), then the appropriate (unnormalized) joint distribution for the parameters |$\left\lbrace \vec{\theta }_i \right\rbrace _{i=1}^{N_\mathrm{total}}$| and observations |$\left\lbrace \vec{d}_i \right\rbrace$| of the i = 1, …, Ntotal objects given the parameters |$\vec{\lambda }$| describing the population (again, |$\vec{\lambda }$| are all parameters describing the population, including the rate, while |$\vec{\lambda }{}^{\prime }$| are parameters that only describe the shape of the population) is
(16)
where
(17)
is the expected number of objects in the population.4 This is the standard likelihood for a hierarchical analysis of an inhomogeneous Poisson process (Loredo & Wasserman 1995; Hogg et al. 2010; Mandel 2010; Youdin 2011; Foreman-Mackey et al. 2014; Farr et al. 2015; Barrett et al. 2018).
If some objects are classed as ‘observable’ (indexed by i) and others are ‘unobservable’ (indexed by j), the complete set of observations partitions into two subsets of cardinality Nobs and Nnobs:
(18)
Again, a key point is that we can perform this partitioning simply by examining the data obtained for each object.
It is common for the data associated with ‘non-observable’ objects to be completely censored; that is it often does not appear in a catalogue or otherwise at all. In this case, it is appropriate to marginalize over the parameters and (unknown) data for the ‘non-observable’ objects. Doing so destroys the distinguishability inherent in the inhomogeneous Poisson distribution, so we must introduce a factor of Nnobs! to account for the overcounting:
(19)
where
(20)
is the expected number of non-detections in the population model. Stopping here we would have a model similar to the ones discussed in Messenger & Veitch (2013) (though that reference did not discuss rate estimation); however, it is common to not even know how many non-detected objects there were in a given survey or data set. In this case we must marginalize – sum, since counting is a discrete operation – over the unknown number of non-detections, Nnobs, yielding
(21)
or
(22)
where Ndet – the compliment of Nndet – is the expected number of detections under the population model:
(23)
This equation is the posterior for a hierarchical analysis of the number density and properties of objects from a data set subject to selection effects (e.g. Gair, Tang & Volonteri 2010; Youdin 2011; Fishbach et al. 2018; Wysocki, Lange & O’Shaughnessy 2018).

This is the same result we derived in Section 3. Each multiplicative term in the numerator of equation (13) from Section 3 is the integral |$\int \mathrm{d}\vec{\theta } p(\vec{d}_i|\vec{\theta }) p_\textrm{pop}(\vec{\theta }|\vec{\lambda }^{\prime })$|⁠, approximated as a Monte Carlo sum over the posterior samples. The denominator of equation (13) is |$\alpha ^{N_\mathrm{obs}}$|⁠. Meanwhile, α = Ndet/N according to equation (12), which is identical to equation (23) from this section. With the substitution |$p_\textrm{pop}(\vec{\theta }|\vec{\lambda }^{\prime }) = (dN/d\vec{\theta }) / N$|⁠, the entire fraction in equation (13) is identical to the first term of equation (22) divided by |$N_\mathrm{det}^{N_\mathrm{obs}}$|⁠, which cancels the last term of equation (13). Thus, we see that equations (13) and (22) are equivalent up to the choice of priors.

As in Section 3, if we re-parametrize |$\frac{\mathrm{d}N}{\mathrm{d}\vec{\theta }}$| so that we can write
(24)
with |$p\left(\vec{\theta }\mid \vec{\lambda }{}^{\prime }\right)$| integrating to 1 over the population for any value of the new parameters |$\vec{\lambda }{}^{\prime }$|⁠, impose a prior p(N) ∝ 1/N, and marginalize over N, we arrive at the treatment of selection functions for estimating population distributions from Loredo (2004) and Abbott et al. (2016). This correspondence only holds with a 1/N scale-invariant prior on the number of objects in the population (see Fishbach et al. 2018 and equation (14) above); other priors are, of course, possible, but will not marginalize to the population analysis above.

Note that the commonly employed technique of modifying |$\frac{\mathrm{d}N}{\mathrm{d}\vec{\theta }}$| to account for the selection function is not correct, and will lead to biased results as long as the selection is dependent only on the observed data.

5 HOW IMPORTANT IS IT TO INCLUDE SELECTION EFFECTS?

It is natural to ask how many events you will need to observe before the incorrect treatment of selection effects starts to influence the results. Any incorrect analysis, i.e. writing down a posterior distribution that is not consistent with the true data-generating process, will lead to a bias in the result and might also change the inferred posterior uncertainty. Asymptotically, the bias remains constant while the uncertainty decreases like the square root of the number of events. Therefore after sufficient observations have been made the result will be inconsistent with the true parameter values. The number of events that can be observed before the bias becomes important depends both on what particular ‘wrong method’ is being used and on the specific problem under consideration. One plausible wrong method is that selection effects will be ignored completely, but more often selection effects are included in an incorrect way. For example, one might write down the likelihood for an individual detected event as
which acknowledges that we have only used detected events (indicated by the flag ‘det’). Then an incorrect assumption is made that the specific data generation process and the question of whether or not the event is detected are independent, so that the first term can be factorized as
This differs from the true result in two ways – the normalization term |$1/\alpha (\vec{\lambda })$| is missing, and there is an extra factor of |$p_{\rm det}(\vec{\theta })$| in the numerator.
A slightly more astute practitioner might realize that the selection bias modifies the probability distribution for the parameters of observed events so that this becomes
but then fail to also condition the likelihood |$p(d|\vec{\theta })$| on detection and use
which includes the correct normalization factor but still has the additional |$p_{\rm det}(\vec{\theta })$| in the numerator. In this latter case, the differences will only become apparent once a sufficient number of events with |$p_{\rm det}(\vec{\theta })$| significantly different from 1 have been observed. The number of events required would scale like the inverse of the fraction of the observable parameter space where selection effects are important, although the exact number of events would also depend on how much information those events contained about |$\vec{\lambda }^{\prime }$|⁠, i.e. how much the properties of those events depend on the properties of the population.

In the former case, every event contributes to a mistake in inference as the factor |$1/\alpha (\vec{\lambda }^{\prime })$| is also missing. The number of events required before the error becomes apparent will then depend on how strongly this varies with the population parameters, which depends on the particular inference problem. For example, in the case of inferring the slope of the black hole mass function from binary black hole mergers observed by LIGO, this would be a strong effect as shallower mass functions give more higher mass events, which are visible to greater distances and so a higher proportion of the total population lies within the LIGO detector horizon (see for example Fishbach & Holz 2017). However, in the case of inferring the Hubble constant using binary neutron star observers with counterparts, the natural prior on the distance distribution is uniform in comoving volume and, since mass redshifting and non-Euclidean cosmological corrections are negligible within the current LIGO horizon, the selection effect is largely independent of the Hubble constant Abbott et al. (2017b). To be concrete, in the example that will be described in the next section, we repeated the analysis using the former of these wrong methods (as a worst-case scenario) and we show the results of that analysis as dashed lines in Fig. 3. That figure shows the probability–probability plot, i.e. the fraction of times the true parameters lie at a particular significance level over many experiments. For true and modelled distributions that are both Gaussians with common variance σ but means that differ by a bias b, the amount by which the p–p plot deviates from the diagonal depends on b/σ (see discussion in Gair & Moore 2015). We see that, for that specific example, with 10 events the bias is already evident in the p–p plot, but at a level consistent with b/σ < 1. So, there is a bias but it is smaller than the typical statistical error. For 100 events the effect is much more pronounced and consistent with b/σ ∼a few, so for 100 events the result will be appreciably biased. These numbers are for a specific problem and the threshold for inclusion of selection effects to avoid bias will vary from problem to problem. It is therefore important to always include selection effects properly in the analysis, unless there is a good reason to believe that they can be ignored, which typically could only be assessed by doing the analysis including selection effects anyway.

6 AN ILLUSTRATION: MEASURING A LUMINOSITY FUNCTION WITH A FLUX-LIMITED SURVEY

Measuring a luminosity function from a flux-limited survey is a classic problem in astronomy that deals with selection effects (see e.g. Malmquist 1922). Here we apply the method discussed in the previous sections to a toy-model, but illustrative, version of this problem.

Suppose the luminosity function of our objects can be modelled by a Schechter function (Schechter 1976):
(25)
with α > −1 and L* > 0 parameters controlling the shape of the distribution and Λ the expected number of objects in the survey volume (i.e. the overall normalization).
Somewhat unrealistically, we suppose we can measure distances to objects perfectly, but that we typically measure fluxes (and therefore luminosities) with |$\sigma _L \simeq 5{{\ \rm per\ cent}}$| uncertainty and that the measurement process results in a lognormal likelihood function:
(26)
We assume a Euclidean universe, so in appropriate units a flux limit for detection of Fth implies a probability of detection for an observed luminosity of
(27)
where z is the redshift (distance) to the object. For computational efficiency, we assume that our objects are uniformly distributed in z for 0 ≤ z ≤ 2 (this assumption reduces the number of unobservable objects compared to a more realistic volumetric distribution). We choose true values of the parameters in this model to be Λ = 100, L* = 1, α = −1/2, and Fth = 1/4π; this latter choice means that the detection probability for an L* object at z = 1 is 50 per cent. For these choices, one draw of a random universe produces the distribution of observed and true luminosities shown in Fig. 1. In this particular draw, we observed 24 objects and missed 80 in our survey.
The distribution of observed (blue) and true (orange) luminosities for a draw from the model discussed in Section 6. Due to selection effects, the distribution of observed luminosities peaks at higher luminosity and falls more rapidly at low luminosity than the true distribution of sources.
Figure 1.

The distribution of observed (blue) and true (orange) luminosities for a draw from the model discussed in Section 6. Due to selection effects, the distribution of observed luminosities peaks at higher luminosity and falls more rapidly at low luminosity than the true distribution of sources.

Applying the ‘top-down’ methodology to this problem, the crucial integral in equation (23) is not analytically tractable, though both the population distribution and the selection function are simple functions. We must evaluate this integral numerically. We choose to do this by sampling over the un-observed population and associated data (subject to the constraint that the fluxes associated to the unobserved population are always below Fth) in an MCMC at the same time as we sample the properties of the population and observed objects. That is we explicitly implement equation (18) as our posterior density, summing over the (unknown) number of non-detected systems. Sampling over the unobserved population with this posterior is a method for numerically evaluating the selection integral. Code and results implementing this model in the stan sampler (Carpenter et al. 2017) can be found at https://github.com/farr/SelectionExample. One result of the sampling is an estimate of the luminosity function parameters L* and α; a joint posterior on these parameters appears in Fig. 2. The analysis also recovers with similar accuracy the expected number of objects in the survey volume (Λ), improved estimates of each object’s intrinsic luminosity (informed by the population model), and luminosity distributions of the set of objects too dim to be observed by the survey, as a by-product of the selection function modelling.

Marginal posterior distribution for the L* and α parameters of the luminosity function (see equation 25) from the model and data described in Section 6. The black lines indicate the true values of the parameters.
Figure 2.

Marginal posterior distribution for the L* and α parameters of the luminosity function (see equation 25) from the model and data described in Section 6. The black lines indicate the true values of the parameters.

7 AN ILLUSTRATION: MEASURING THE MASS RATIO OF BINARY NEUTRON STARS

Do all or the majority of merging binary neutron stars have mass ratios very close to unity? Is the answer to this question redshift- or metallicity-dependent? This question is an important science driver for third-generation gravitational-wave detectors.5 Here, we examine how many neutron star binary mergers must be detected in order to measure the mass-ratio distribution, providing an illustration of the methodology described in the previous sections.

The binary neutron star mass ratio distribution is sensitive to the mass ejections associated with neutron star formation in a supernova and the velocity kicks that neutron stars receive at birth. For example, fig. 3 of Vigna-Gómez et al. (2018) illustrates the differences in the mass ratio distributions under different assumptions about mass fallback and natal kicks. Since models show a preference for equal mass ratios q = m2/m1, we assume a simple single-parameter form for the intrinsic mass ratio distribution:
(28)
where η = q/(1 + q)2 is the symmetric mass ratio. We use the symmetric mass ratio because it tends to have more symmetric error bars than q; when component masses are equal, q = 1 and η = 0.25.
The likelihood function on the data is, in general, quite complex (Veitch et al. 2015), and depends on a multitude of other parameters, such as spins, which must then be marginalized over to obtain |$p(\vec{d}|\eta)$|⁠. We will approximate the problem by viewing the data as a point estimate of the symmetric mass ratio |$\hat{\eta }$| (one can think of it as a maximum-likelihood estimate) with a Gaussian likelihood function given by
(29)
We use a simple Fisher-information-matrix analysis with a noise power spectral density shape representative of a potential third-generation detector6 to estimate the expected measurement uncertainty ση. We follow Poisson & Will (1995) in using frequency-domain post-Newtonian waveforms, which can be analytically differentiated and are adequate for binary neutron star analysis, allowing us to rapidly estimate the accuracy of inference. We do not impose priors, include a spin-orbit coupling term but ignore the spin–spin coupling term as suggested by Poisson & Will (1995). We derive the following simple fit to the measurement uncertainty on η for a signal from a canonical 1.4 + 1.4 M binary with non-spinning components as a function of the event SNR ρ:
(30)
This fit is accurate to better than 10 per cent for ρ > 18. The inverse of the Fisher information matrix is no longer a good estimate for the covariance matrix at lower values of ρ where the linear signal approximation breaks down, the log-likelihood ceases to be well approximated by a quadratic (Vallisneri 2008), and the prior constraints on variables strongly correlated with η, such as the spin parameters, become increasingly important. In any case, the mass ratio constraints become very poor at low ρ; for example, despite a ρ of 32, the mass ratio of the binary neutron star merger GW170817 could only be constrained to q ∈ [0.4, 1] at 90 per cent confidence (Abbott et al. 2017a).
The SNR at a given distance scales as |$M_\mathrm{ c}^{5/6}$|⁠, where Mc ∝ η3/5 is the chirp mass, consistent with the inspiral amplitude scaling. We assume that the distance D to the event is drawn from a p(D) ∝ D2 distribution consistent with a flat, isotropic universe, and is known perfectly. With this simplification, the SNR as used in equation (30) follows
(31)
The observed SNR |$\hat{\rho }$| follows the same scaling, but with the dependence on the data |$\hat{\eta }$|⁠, not the true event mass ratio η. In line with comments on the validity of the Fisher information matrix we will use a detection threshold |$\hat{\rho } \ge 18$| in this simplified treatment; the detectability conditioned on the observed data is thus independent of the source properties.

We test the self-consistency of the inference on λ by creating 100 mock populations with random values of λ drawn from the flat prior λ ∈ [0, 0.1]. For each population, we compute the posterior distribution on λ following the methodology described above. We then ask for the quantile of the true value of λ within this posterior. Fig. 3 shows the cumulative distribution of this quantile value, the so-called p–p plot. If posteriors are self-consistent, we expect the truth to fall within the X per cent Bayesian credible interval X per cent of the time, i.e. the p–p plot should be diagonal (e.g. Cook, Gelman & Rubin 2006; Sidery et al. 2014; Veitch et al. 2015). We confirm that the p–p plot is consistent with the diagonal within statistical fluctuations.

The p–p plot of the cumulative distribution of the quantile of the true value of λ within its posterior as estimated from 10 (solid orange curve) and 100 (solid blue curve) mock data sets. These are consistent with the diagonal (dashed black line). For comparison we show the corresponding results, as dashed lines, from using one particular wrong method, as described in Section 5.
Figure 3.

The p–p plot of the cumulative distribution of the quantile of the true value of λ within its posterior as estimated from 10 (solid orange curve) and 100 (solid blue curve) mock data sets. These are consistent with the diagonal (dashed black line). For comparison we show the corresponding results, as dashed lines, from using one particular wrong method, as described in Section 5.

The width of the 90 per cent credible interval Δλ as a function of the number of detections; the true value of λ is 0.05 in all mock catalogues. The fluctuations relative to the $\Delta \lambda \propto N_\mathrm{det}^{-1/2}$ trend are due to the stochastic nature of the detected sample.
Figure 4.

The width of the 90 per cent credible interval Δλ as a function of the number of detections; the true value of λ is 0.05 in all mock catalogues. The fluctuations relative to the |$\Delta \lambda \propto N_\mathrm{det}^{-1/2}$| trend are due to the stochastic nature of the detected sample.

Having tested the method and its implementation, we now analyse the uncertainty in the inferred value of λ. This time, we fix the value of λ at λ = 0.05 when generating mock data catalogues, but vary the number of simulated events, with a subset of the events labelled as detectable. We compute the width of the 90 per cent credible interval on λ, defined here as stretching from the 5th to the 95th percentile of the posterior. In figure, we plot this width Δλ against the number of detectable events.

We find that ∼1000 detections at ρ ≥ 18 are necessary in order to measure λ to an accuracy δλ ≈ 0.01. Distributions with λ = 0.01 and λ = 0.02 yield median values of η (q) of 0.243 (0.71) and 0.236 (0.62), respectively, so at least a thousand detections are required in order to make meaningful inference on the mass ratio distribution with a view to distinguishing evolutionary models. An even greater number of detections would be required in each of several redshift bins in order to search for redshift-dependent changes in the mass ratio distribution – perhaps |$\mathcal {O}(10000)$|⁠, given the plausible variation of the mass ratio distribution with redshift.

ACKNOWLEDGEMENTS

IM and WF thank Tom Loredo for useful discussions and the Statistical and Applied Mathematical Sciences Institute, partially supported by the National Science Foundation under Grant DMS-1127914, for hospitality. IM’s work was performed in part at Aspen Center for Physics, which is supported by National Science Foundation grant PHY-1607611; IM’s visit there was partially supported by a grant from the Simons Foundation. IM thanks Stephen Justham, Vicky Kalogera, and Fred Rasio for discussions related to the illustrative example. We thank Arya Farahi for alerting us to a typo in the manuscript, and the anonymous referee for a number of insightful comments.

Footnotes

1

In practice, there are very weak deviations from this power law due to the imperfect – noisy – measurement of signal amplitude.

2

An example where the selection may be parameter- rather than data-dependent is in surveys of objects that have been selected based on data in yet other surveys; Maggie Lieu pointed us to X-ray selected populations of galaxy clusters in a weak-lensing catalogue. This can still be treated within the framework proposed here, by considering the combined likelihood for both data sets and marginalizing over the ‘discarded’ data from the survey used for selection.

3

This is a very artificial model since all detectors have noise and the reason that |$p_\textrm{det}(\vec{\theta })$| is not equal to one is because of that noise. However, it serves to illustrate the basic idea.

4

The rationale for writing this as a double-integral, when the integral over |$\vec{d}$| is in fact trivial – since the likelihood is normalized over |$\vec{d}$| – will become apparent below.

5

This was identified as a key goal during an ongoing study commissioned by the Gravitational Wave International Committee (GWIC), https://gwic.ligo.org/3Gsubcomm/charge.shtml.

6

We assume that the noise spectral density is proportional to the LIGO A + design, https://dcc.ligo.org/LIGO-T1800042/public.

REFERENCES

Abbott
B. P.
et al. .,
2016
,
Phys. Rev. X
,
6
,
041015

Abbott
B. P.
et al. .,
2017a
,
Phys. Rev. Lett
.,
119
,
161101

Abbott
B. P.
et al. .,
2017b
,
Nature
,
551
,
85

Barrett
J. W.
,
Gaebel
S. M.
,
Neijssel
C. J.
,
Vigna-Gómez
A.
,
Stevenson
S.
,
Berry
C. P. L.
,
Farr
W. M.
,
Mandel
I.
,
2018
,
MNRAS
,
477
,
4685

Bovy
J.
,
Hogg
D. W.
,
Roweis
S. T.
,
2011
,
Ann. Appl. Stat.
,
5
,
1657

Carpenter
B.
et al. .,
2017
,
J. Stat. Softw.
,
76
,
1

Chennamangalam
J.
,
Lorimer
D. R.
,
Mandel
I.
,
Bagchi
M.
,
2013
,
MNRAS
,
431
,
874

Cook
S. R.
,
Gelman
A.
,
Rubin
D. B.
,
2006
,
J. Comput. Graph. Stat.
,
15
,
675

Farr
W. M.
,
Mandel
I.
,
2018
,
Science
,
361
,
aat6506

Farr
W. M.
,
Sravan
N.
,
Cantrell
A.
,
Kreidberg
L.
,
Bailyn
C. D.
,
Mandel
I.
,
Kalogera
V.
,
2011
,
ApJ
,
741
,
103

Farr
W. M.
,
Mandel
I.
,
Aldridge
C.
,
Stroud
K.
,
2014
, preprint (arXiv:1412.4849)

Farr
W. M.
,
Gair
J. R.
,
Mandel
I.
,
Cutler
C.
,
2015
,
Phys. Rev. D
,
91
,
023005

Feeney
S. M.
,
Peiris
H. V.
,
Williamson
A. R.
,
Nissanke
S. M.
,
Mortlock
D. J.
,
Alsing
J.
,
Scolnic
D.
,
2019
,
Phys. Rev. Lett.
,
122
,
061105

Finn
L. S.
,
Chernoff
D. F.
,
1993
,
Phys. Rev. D
,
47
,
2198

Fishbach
M.
,
Holz
D. E.
,
2017
,
ApJ
,
851
,
L25

Fishbach
M.
,
Holz
D. E.
,
Farr
W. M.
,
2018
,
ApJ
,
863
,
L41

Foreman-Mackey
D.
,
Hogg
D. W.
,
Morton
T. D.
,
2014
,
ApJ
,
795
,
64

Gair
J. R.
,
Moore
C. J.
,
2015
,
Phys. Rev. D
,
91
,
124062

Gair
J. R.
,
Tang
C.
,
Volonteri
M.
,
2010
,
Phys. Rev. D
,
81
,
104014

Hogg
D. W.
,
Myers
A. D.
,
Bovy
J.
,
2010
,
ApJ
,
725
,
2166

Loredo
T. J.
,
2004
, in
Fischer
R.
,
Preuss
R.
,
Toussaint
U. V.
, eds,
AIP Conf. Proc., vol. 735. Accounting for Source Uncertainties in Analyses of Astronomical Survey Data
.
Am. Inst. Phys
,
New York
, p.
195

Loredo
T. J.
,
Wasserman
I. M.
,
1995
,
ApJS
,
96
,
261

Malmquist
K. G.
,
1922
,
Meddelanden fran Lunds Astronomiska Observatorium Serie I
,
100
,
1

Malmquist
K. G.
,
1925
,
Meddelanden fran Lunds Astronomiska Observatorium Serie I
,
106
,
1

Mandel
I.
,
2010
,
Phys. Rev. D
,
81
,
084029

Mandel
I.
,
Farr
W. M.
,
Gair
J. R.
,
2016
, Available at: https://dcc.ligo.org/LIGO-P1600187/public

Messenger
C.
,
Veitch
J.
,
2013
,
New J. Phys.
,
15
,
053027

Poisson
E.
,
Will
C. M.
,
1995
,
Phys. Rev. D
,
52
,
848

Schechter
P.
,
1976
,
ApJ
,
203
,
297

Sidery
T.
et al. .,
2014
,
Phys. Rev. D
,
89
,
084060

Vallisneri
M.
,
2008
,
Phys. Rev. D
,
77
,
042001

Veitch
J.
et al. .,
2015
,
Phys. Rev. D
,
91
,
042003

Vigna-Gómez
A.
et al. .,
2018
,
MNRAS
,
481
,
4009

Wysocki
D.
,
Lange
J.
,
O’Shaughnessy
R.
,
2018
, preprint (arXiv:1805.06442)

Youdin
A. N.
,
2011
,
ApJ
,
742
,
38

This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://dbpia.nl.go.kr/journals/pages/open_access/funder_policies/chorus/standard_publication_model)