ABSTRACT

Type Ia supernovae (SNe Ia) are standarizable candles whose observed light curves can be used to infer their distances, which can in turn be used in cosmological analyses. As the quantity of observed SNe Ia grows with current and upcoming surveys, increasingly scalable analyses are necessary to take full advantage of these new data sets for precise estimation of cosmological parameters. Bayesian inference methods enable fitting SN Ia light curves with robust uncertainty quantification, but traditional posterior sampling using Markov Chain Monte Carlo (MCMC) is computationally expensive. We present an implementation of variational inference (VI) to accelerate the fitting of SN Ia light curves using the BayeSN hierarchical Bayesian model for time-varying SN Ia spectral energy distributions. We demonstrate and evaluate its performance on both simulated light curves and data from the Foundation Supernova Survey with two different forms of surrogate posterior–a multivariate normal and a custom multivariate zero-lower-truncated normal distribution–and compare them with the Laplace Approximation and full MCMC analysis. To validate of our variational approximation, we calculate the Pareto-smoothed importance sampling diagnostic, and perform variational simulation-based calibration. The VI approximation achieves similar results to MCMC but with an order-of-magnitude speed-up for the inference of the photometric distance moduli. Overall, we show that VI is a promising method for scalable parameter inference that enables analysis of larger data sets for precision cosmology.

1 INTRODUCTION

1.1 Scientific context

Type Ia supernovae (SNe Ia) are well known as ‘standardizable candles’ – astronomical objects for which there are strong correlations between their absolute brightness and apparent features of their light curves (e.g. Rust 1975; Pskovskii 1977; Phillips 1993; Tripp 1998). They are widely used in precision cosmology, and played a key role in the discovery of dark energy (Riess et al. 1998; Perlmutter et al. 1999). In the current debate (for a review see e.g. Knox & Millea 2020; Freedman 2021) over the value of the Hubble–Lemaître constant |$H_0$|⁠, the SN Ia distance ladder anchored by Cepheid variables (Riess et al. 2022) is in tension with the cosmic microwave background (Planck Collaboration VI,2020).

Over the past few decades, higher fidelity and greater volumes of data have been obtained for SNe Ia (e.g. Hamuy et al. 1996a; Riess et al. 1999; Astier et al. 2006; Frieman et al. 2008, 2015; Kessler et al. 2009b; Hicken et al. 2012; Foley et al. 2018; Scolnic et al. 2018; Brout et al. 2019; Abbott et al. 2024). This has allowed the sophistication of SN Ia models to grow from simple scaling relations and templates (e.g. Rust 1975; Pskovskii 1977; Phillips 1993; Hamuy et al. 1996b), to learned representations of light curves (e.g. Riess, Press & Kirshner 1995, 1996a; Jha, Riess & Kirshner 2007; Mandel et al. 2009; Burns et al. 2011; Mandel, Narayan & Kirshner 2011) and spectral energy distributions (SEDs; e.g. Guy et al. 2005, 2007; Conley et al. 2008; Saunders et al. 2018; Boone 2021; Kenworthy et al. 2021; Thorp et al. 2021; Mandel et al. 2022; Grayling et al. 2024).

For as long as SNe Ia have been used as standardizable candles for cosmology, the impact of dust in their host galaxies has been discussed extensively by astronomers (e.g. Branch & Tammann 1992; Sandage & Tammann 1993; Riess, Press & Kirshner 1996a, b; Phillips et al. 1999; Krisciunas et al. 2000, 2007; Mandel et al. 2011, 2017; Burns et al. 2014; Brout & Scolnic 2021; Popovic et al. 2021; Thorp & Mandel 2022; Karchev, Trotta & Weniger 2023; Grayling et al. 2024; Karchev et al. 2024; Thorp et al. 2024). Dust along the line of sight to a supernova or other astrophysical source acts to dim the light from the source. Preferential dimming of bluer wavelengths also causes reddening of the light (see e.g. Draine 2003; Salim & Narayanan 2020, for general reviews on astrophysical dust). Dust in the Milky-Way Galaxy is extensively mapped (e.g. Schlegel, Finkbeiner & Davis 1998; Schlafly & Finkbeiner 2011; Schlafly et al. 2016), so it is possible to correct for its effects. However, the host galaxies of individual SNe Ia are not mapped in this way, so extinction due to host galaxy dust must be modelled probabilistically. This is one of the primary motivations for the BayeSN hierarchical SED model for SNe Ia (Thorp et al. 2021; Mandel et al. 2022; Grayling et al. 2024). Current implementations of the BayeSN model use the No U-Turn Sampler (NUTS; Hoffman & Gelman 2014) to perform full Bayesian inference given a set of SN Ia data. This has been applied to studies of host galaxy dust (Thorp et al. 2021; Thorp & Mandel 2022; Grayling et al. 2024), SN Ia siblings (Ward et al. 2023b), estimation of the Hubble–Lemaître constant (Dhawan et al. 2023), analyses of new near-infrared (NIR) SN Ia data sets (Jones et al. 2022; Peterson et al. 2023; Thorp et al. 2024), and lensed supernova time-delay estimation (Pierel et al. 2024). A simulation-based inference implementation of BayeSN has also been developed recently by Karchev et al. (2023, 2024). To ensure BayeSN remains scalable in anticipation of the large data sets expected from future surveys, we present here a variational implementation of the model for accelerated Bayesian inference of distances from SN Ia light curves. This is built on top of the Graphics Processing Unit (GPU)-accelerated BayeSN code developed by Grayling et al. (2024).

1.2 The need for speed

The data sets used in SN Ia cosmology have already been growing substantially over recent years (see e.g. Phillips et al. 2019; Dhawan et al. 2022; Aleo et al. 2023; Abbott et al. 2024). With new, powerful telescopes and space missions observing more astronomical objects than ever before, the field of astrophysics is generally becoming more data-driven. The Vera C. Rubin Legacy Survey of Space and Time (LSST) is expected to make over 32 trillion observations in over 20 billion galaxies and stars (LSST Dark Energy Science Collaboration 2018; Ivezić et al. 2019), and will be transformative for SN Ia and transient science. The Nancy Grace Roman Space Telescope will be similarly transformative for SN Ia cosmology at NIR wavelengths (Hounsell et al. 2018; Rose et al. 2021). Scalable analysis methods are critical to maximizing the scientific output from these projects. Moreover, the ‘industry standard’ pipelines and validation frameworks (e.g. Kessler et al. 2009a; Kessler & Scolnic 2017; Hinton & Brout 2020) that will be used in the core SN Ia cosmology analyses for next generation surveys typically require methods that can be rapidly applied to tens to hundreds of thousands of simulated SNe for bias modelling.

Over the past two decades, parameter inference problems in astrophysics and cosmology have often been framed in a Bayesian way, and solved with Markov Chain Monte Carlo (MCMC) samplers and other related numerical methods. The development and uptake of user-friendly interfaces to general purpose sampling algorithms (e.g. Lewis & Bridle 2002; Feroz, Hobson & Bridges 2009; Foreman-Mackey et al. 2013; Handley, Hobson & Lasenby 2015; Speagle 2020) has greatly facilitated this. While posterior exploration with MCMC remains the gold-standard for Bayesian inference, it is often limited by computational time and resources.

Variational inference (VI; see Section 1.3 for an introductory overview) has been shown to improve computational efficiency by orders of magnitude compared to MCMC when applied to various astrophysical problems (e.g. Regier et al. 2019; Gunapati et al. 2022; Liu et al. 2023; Rizzato & Sellentin 2023). de Soto et al. (2024) use VI for parametric light curve fitting for real-time supernova classification, and Sánchez et al. (2021) use VI amortized with a neural network to fit SN Ia light curves. Villar (2022) also present an amortized inference approach, using masked autoregressive flows (Papamakarios, Pavlakou & Murray 2017) to approximate the posterior distribution (see e.g. Rezende & Mohamed 2015). Alternative variational methods – particularly variational auto-encoders (VAEs) – have also been applied effectively to supernovae and other transients. Boone (2021) and Villar et al. (2021) present VAEs to learn latent representations in the context of supernova SED fitting and light-curve classification, respectively. With a VAE, the neural network weights of the encoder and decoder are optimized to learn a latent representation by minimizing the average reconstruction loss across all objects in the training set. In contrast, here we use a pre-trained physically motivated model (BayeSN) for SN Ia SEDs and light-curve data and use VI to approximate the true posterior for each individual SN Ia separately.

1.3 Overview of variational inference

VI is a machine learning approach that approximates an intractable posterior that cannot be directly sampled with a simpler ‘surrogate’ distribution, also known as a ‘guide’ (Jordan et al. 1999). The parameters defining this surrogate distribution are then optimized to minimize the difference between the surrogate and true posteriors (Jordan et al. 1999). For data x, model parameters |$\boldsymbol {\phi }$|⁠, and variational parameters |$\boldsymbol {\zeta }$|⁠, the surrogate posterior |$q^{*}_{\boldsymbol {\zeta }}(\boldsymbol {\phi })$| is determined as follows:

(1)

where |$p(\boldsymbol {\phi }|x)$| is the true posterior, |$D_\text{KL}$| denotes the Kullback–Leibler divergence (Kullback & Leibler 1951), and |$\mathcal {Q}$| denotes a chosen family of distributions (Blei, Kucukelbir & McAuliffe 2017). As the true posterior |$p(\boldsymbol {\phi }|x)$| is intractable and thus unable to be directly computed, VI maximizes the evidence lower bound (ELBO; Jordan et al. 1999; Wainwright & Jordan 2008; Blei et al. 2017) optimize the parameters |$\boldsymbol {\zeta }$| defining the surrogate posterior1:

(2)
(3)

The parameters |$\boldsymbol {\zeta }$| (hereafter referred to as the ‘variational parameters’) define the surrogate posterior |$q_{\boldsymbol {\zeta }}(\boldsymbol {\phi })$|⁠. In practice, these expectations over q are approximated by generating samples (‘particles’) of model parameters from the surrogate posterior for a given set of variational parameters and computing the Monte Carlo average over these samples. The variational parameters |$\boldsymbol {\zeta }$| are optimized with gradient descent using the stochastic variational inference algorithm (Hoffman et al. 2013; Kingma & Welling 2013; Wingate & Weber 2013; Ranganath, Gerrish & Blei 2014; Kucukelbir et al. 2017). Validation of variational inference algorithms (e.g. Yao et al. 2018; Huggins et al. 2020) can be conducted using simulation-based calibration (Cook, Gelman & Rubin 2006; Talts et al. 2018), Pareto-smoothed importance sampling (PSIS; Vehtari et al. 2024), or the Wasserstein distance (Kantorovich & Rubinstein 1958; Vaserstein 1969).

1.4 This work

This paper investigates whether VI can be an appropriate replacement for MCMC when using the BayeSN model (Mandel et al. 2022) to estimate the distances to SNe Ia from their light curves. We use stochastic variational inference (SVI) in NumPyro (Bingham et al. 2019; Phan, Pradhan & Jankowiak 2019) to fit for two different forms of surrogate posterior: a multivariate normal distribution, and a multivariate zero-lower-truncated normal (MVZLTN) distribution. We evaluate the performance of the two VI implementations, compared to MCMC and the Laplace approximation, on simulated and real data (from the Foundation survey; Foley et al. 2018; Jones et al. 2019). We explore the trade-off between computational efficiency and model accuracy and validate the performance of VI in each case (following Yao et al. 2018). In terms of parameter estimation, both of the tested VI posterior forms perform comparably to MCMC, but with an order of magnitude speed-up. We then discuss the potential implications of this new variational implementation of BayeSN and when VI should be used to approximate posterior distributions for inference of astrophysical parameters.

In Section 2, we review the details of the BayeSN model, and in Section 3 we describe our new VI implementation. In Section 4, we describe the real (Section 4.1) and simulated (Section 4.2) data. Our results are presented in Section 5, with additional discussion in Section 6, and concluding remarks in Section 7.

2 THE bayesn MODEL

2.1 Overview

The full details of the BayeSN model can be found in Mandel et al. (2022), with subsequent model developments discussed in Thorp et al. (2021), Ward et al. (2023b), and Grayling et al. (2024). In summary, BayeSN is a hierarchical Bayesian model for the SEDs of SNe Ia. BayeSN defines a forward model for SN Ia light curves including a fixed spectral template (Hsiao et al. 2007; Hsiao 2009), a functional principal component-based representation of intrinsic SED variation, a physically motivated parametric model for host galaxy dust extinction (Fitzpatrick 1999), and a model for residual spectro-temporal perturbations. Throughout this work, we use the version of the model trained by Thorp et al. (2021) – hereafter the T21 model – using data from the Foundation Supernova Survey (Foley et al. 2018; Jones et al. 2019). These data are described further in Section 4.1.

In the BayeSN model (following Mandel et al. 2022, equation 12), the host dust-extinguished rest-frame SED |$S_\mathrm{s}(t,\lambda _\mathrm{r})$|⁠, of a supernova s, is defined by

(4)

with reference to a spectral template |$S_0(t,\lambda _\mathrm{r})$| (Hsiao et al. 2007, 2009), as a function of rest-frame phase t (relative to the time of maximum B-band brightness), and wavelength |$\lambda _\mathrm{r}$|⁠. Here, the supernova-level latent parameters are: a light curve shape parameter/functional principal component score |$\theta _1^s$|⁠; the V-band dust extinction |$A_V^s$|⁠; a time- and wavelength-dependent residual perturbation function |$\epsilon _\mathrm{s}(t,\lambda _\mathrm{r})$|⁠; and a ‘grey’ (t- and |$\lambda _\mathrm{r}$|-independent) magnitude offset |$\delta M_\mathrm{s}$|⁠. BayeSN’s |$\theta _1^s$| parameter correlates with the slope of the SED surface and the width–luminosity relation in the optical (Mandel et al. 2022) and is closely related to the |$\Delta m_{15}$| parameter (i.e. the decline in magnitude in the 15 d following peak; see discussion in appendix E of Grayling et al. 2024). The population-level hyperparameters are: a warping function applied to the mean SED |$W_0(t,\lambda _\mathrm{r})$|⁠; the first functional principal component |$W_1(t,\lambda _\mathrm{r})$|⁠; and the Fitzpatrick (1999) dust law |$\xi (A_V; R_V)$|⁠. The overall normalization factor is fixed |$M_0\equiv -19.5$|⁠. The dust law slope parameter can be assumed to be the same for all SNe (as in Mandel et al. 2022; Thorp et al. 2024), or allowed to vary on a supernova-by-supernova basis (as in T21; Thorp & Mandel 2022; Grayling et al. 2024; Thorp et al. 2024). The BayeSN model flux for a supernova s in a passband b is given by equation 6 in Mandel et al. (2022):

(5)

Here, |$\mu _\mathrm{s}$| is the SN distance modulus, |$z_\mathrm{s}$| is the redshift, |$A_{V,s}^\text{MW}$| is the Milky Way dust extinction along the SN’s line of sight (based on the Schlafly & Finkbeiner 2011 dust maps), |$\xi ^\text{MW}$| is the Milky Way dust extinction law (Fitzpatrick 1999), |$\mathit{ Z}\equiv 27.5$| is a fixed zero-point (equal to that of SNANA; Kessler et al. 2009a), and |$\mathbb {B}_b$| is the normalized transmission function of band b. The integral in equation (5) is taken over all rest-frame wavelengths observable by band b.

In practice the functional parameters in the model, |$W_0(t,\lambda _\mathrm{r})$|⁠, |$W_1(t, \lambda _\mathrm{r})$|⁠, and |$\epsilon (t,\lambda _\mathrm{r})$|⁠, are represented by natural cubic splines2 defined by vectors/matrices of knots |$\boldsymbol {W}_0$|⁠, |$\boldsymbol {W}_1$|⁠, and |$\boldsymbol {e}_\mathrm{s}$|⁠. Additionally, the time of maximum may not be perfectly known a priori, meaning a correction factor |$\Delta t_\text{max}^s$| may be needed. This gives a set of supernova-level latent parameters |$\boldsymbol {\phi }_\mathrm{s} = (A_V^s, \mu _\mathrm{s}, \delta M_\mathrm{s}, \theta _1^s, \Delta t_\text{max}^s, \boldsymbol {e}_\mathrm{s})$|⁠, and population-level hyperparameters |$\boldsymbol {H} = (\boldsymbol {W}_0, \boldsymbol {W}_1, R_V, \tau _A, \boldsymbol {\Sigma }_e, \sigma _0)$|⁠. The latter three hyperparameters specify respectively the population distributions of |$A_V^s$|⁠, |$\boldsymbol {e}_\mathrm{s}$|⁠, and |$\delta M_\mathrm{s}$|⁠:

(6)
(7)
(8)

The prior on |$\theta _1^s$| is a unit Gaussian, |$P(\theta _1^s) = \mathit{N}(\theta _1^s| 0,1)$|⁠, and we adopt a broad and uninformative prior on distances |$P(\mu _\mathrm{s}) = \mathit{N}(\mu _\mathrm{s}|\mu _{\Lambda \text{CDM}}(z_\mathrm{s}), 5^2)$|⁠, where |$\mu _{\Lambda \text{CDM}}(z_\mathrm{s})$| is the estimated distance under a flat Λ cold dark matter (⁠|$\Lambda$|CDM) cosmology. In this work, we adopt a Gaussian prior on the time of maximum correction,3|$P(\Delta t^s_\text{max})=\mathit{N}(\Delta t^s_\text{max}|0, 5^2)$|⁠. In this work, we are interested in BayeSN as a distance estimator (c.f. T21 §4.6; Mandel et al. 2022 §2.8) – i.e. given a supernova’s photometry, what is the posterior on |$\mu _\mathrm{s}$| (and the other latent parameters in |$\boldsymbol {\phi }_\mathrm{s}$|⁠). For this reason, we will fix the values of the hyperparameters to their posterior mean values, |$\hat{\boldsymbol {H}}$|⁠, estimated by T21 from the full Foundation data set (Foley et al. 2018; Jones et al. 2019).4 For a supernova |$\mathrm{ s}$|⁠, we will have N observed fluxes |$\hat{\boldsymbol {f}}_\mathrm{s} = (\hat{f}_{s,1},\dots ,\hat{f}_{s,N})^\top$|⁠, with corresponding observation dates.5|$\hat{\boldsymbol {T}}_\mathrm{s} = (\hat{T}_{s,1},\dots ,\hat{T}_{s,N})^\top$|⁠, measurement errors |$\hat{\boldsymbol {\sigma }}_\mathrm{s}=(\hat{\sigma }_{s,1},\dots ,\hat{\sigma }_{s,N})^\top$|⁠, and rest-frame phases |$\hat{\boldsymbol {t}}_\mathrm{s} = (\hat{\boldsymbol {T}}_\mathrm{s}-\hat{T}_\text{ref}^s)/(1+z_\mathrm{s})$|⁠. The phases are defined with reference to an estimated date of maximum light |$\hat{T}^s_\text{ref}$|⁠. The posterior distribution of |$\boldsymbol {\phi }_\mathrm{s}$| will be given by

(9)

where

(10)

the passband for observation n is denoted by |$b_n$|⁠, |$\Delta t_\text{max}^s$| is a correction for the time of maximum,6 and the model flux |$f(\hat{t}_{s,n};b_n,\boldsymbol {\phi }_\mathrm{s}, \hat{\boldsymbol {H}})$| is evaluated using equations (4) and (5). This posterior distribution is the target of the variational approximation we develop in the subsequent sections of this paper.

2.2 The challenge of dust extinction

The dust extinction parameter |$A_V$| can take on any value in the range [0,|$\infty$|⁠), and can be correlated with other parameters in the posterior distribution. The exponential prior in equation (6) is physically motivated, and favours low values.7 For SNe Ia with little-to-no dust extinction (⁠|$A_V$| close to zero), there can be considerable non-zero posterior density at |$A_V\approx 0$|⁠. However, negative values of |$A_V$| are unphysical, and should not be permitted. This presents a challenge for variational inference, as we need an approximate posterior with the following properties:

  • |$A_V$| constrained positive, all other parameters unbounded;

  • potential correlation between all parameters (i.e. ‘full rank’); and

  • non-negligible density allowed for |$A_V$| values arbitrarily close to zero.

A simple solution which satisfies points (i) and (ii) is to fit a multivariate normal to the joint posterior distribution of |$\log A_V$| and all other parameters. This is one of the approximations we will consider in subsequent sections of this paper. The disadvantage of this approach is that the the approximate posterior marginal on |$A_V$| will be effectively lognormal, tending towards zero density as |$A_V\rightarrow 0$|⁠. Underestimating the posterior density at |$A_V=0$| can potentially lead to overestimation of extinction in the low-dust limit. A multivariate normal across all parameters (including |$A_V$|⁠) would satisfy points (ii) and (iii), but would allow for potentially problematic negative values of |$A_V$|⁠. Points (i) and (iii) could be satisfied simultaneously by using a variational surrogate posterior that is a product of an independent truncated normal distribution on |$A_V$|⁠, and a multivariate normal for all other parameters. To satisfy all three desiderata, we propose using a variational guide where a zero-lower-truncated normal (ZLTN) distribution approximates the marginal posterior on |$A_V$|⁠, and the conditional distribution of the remaining parameters given |$A_V$| is multivariate Gaussian (the exact form is defined in Section 3.3.2). For |$A_V$| far away from zero, the ZLTN distribution recovers the same behaviour as a non-truncated Gaussian distribution, but in the low-|$A_V$| limit the behaviour is more desirable. Using this distribution allows for a more flexible posterior approximation, while still providing the computational speed-up of VI over MCMC.

3 VARIATIONAL INFERENCE IMPLEMENTATION

In our VI implementation, we aim to fit 28 BayeSN model parameters: |$A_V, \mu +\delta M, \theta _1, \Delta t_\text{max}$|⁠, and |$\boldsymbol {e}$|⁠, where |$\boldsymbol {e}$| is 24 entries for the residual realization. For identifiability reasons, we treat the sum of |$\mu$| and |$\delta M$| as a single parameter, and separate them in post-processing. For each posterior form, we provide first a theoretical overview followed by the implementation details. In general, the BayeSN model was implemented using the NumPyro probabilistic programming language. Within NumPyro we use the SVI class to perform Stochastic Variational Inference and use the AutoGuide functionality to define the form of the surrogate posterior, and constrain the variational parameters |$\boldsymbol {\zeta }$|⁠.

3.1 The Laplace Approximation

The Laplace Approximation approximates a posterior probability distribution as Gaussian centred around the MAP estimate. Here, we approximate the posterior distribution over the parameter vector |$\boldsymbol {\phi } = \lbrace \log A_V, \mu + \delta M, \theta _1, \Delta t_\text{max}, \boldsymbol {e}\rbrace$| using the Laplace approximation, where a log transform is used to enforce a positivity constraint on |$A_V$|⁠. For some multivariate posterior distribution |$p(\boldsymbol {\phi }| x)$|⁠, with parameters |$\boldsymbol {\phi }$| and data x, we can perform a second-order Taylor expansion about the MAP estimate |$\hat{\boldsymbol {\phi }}$|⁠:

(11)

Here |$\nabla {\log p(\boldsymbol {\phi }| x)} \bigr \vert _{\boldsymbol {\phi } = \hat{\boldsymbol {\phi }}} = 0$| by the definition of MAP, and |$\boldsymbol {\Xi } = - \nabla \nabla {\log p(\boldsymbol {\phi }| x)} \bigr \vert _{\boldsymbol {\phi } = \hat{\boldsymbol {\phi }}}$| is the Hessian of the negative log-posterior at |$\hat{\boldsymbol {\phi }}$| (see e.g. Rasmussen & Williams 2006; Gelman et al. 2014). The posterior distribution can thus be approximated by a Gaussian distribution,

(12)

The Laplace Approximation was implemented using the AutoLaplaceApproximation class within NumPyro with parameters initialized to the median of their respective priors. The AutoLaplaceApproximation interface is the same as all other AutoGuides in NumPyro, but the ELBO is maximized between the true posterior and a Delta function guide to find the MAP estimate.

3.2 The Multivariate Normal distribution

Here, we define the parameter vector |$\boldsymbol {\phi } = \lbrace \log A_V, \mu + \delta M, \theta _1, \Delta t_\text{max}, \boldsymbol {e}\rbrace$| (again using a log transform to constrain |$A_V > 0$|⁠) and approximate the joint posterior with a surrogate of the form:

(13)

where variational parameters |$\boldsymbol {\zeta } = (\boldsymbol {\mu }, \boldsymbol {\Sigma })$| with mean vector |$\boldsymbol {\mu }$| and covariance matrix |$\boldsymbol {\Sigma }$|⁠. The multivariate normal guide with a full covariance matrix was implemented using the AutoMultivariateNormal class within NumPyro.

3.3 The Multivariate Zero-Lower-Truncated Normal distribution

Here, we define the generic multivariate zero-lower-truncated normal (MVZLTN) distribution as a multivariate distribution with the first variable being characterized by a ZLTN distribution (as defined below) and the remainder following a multivariate normal distribution conditional on the first variable. As described in Section 2.2, we use this distribution to impose a positivity constraint on |$A_V$| while maintaining a full covariance matrix between all model parameters |$\boldsymbol {\phi } = \lbrace A_V, \mu + \delta M, \theta _1, \Delta t_\text{max}, \boldsymbol {e}\rbrace$|⁠. To allow for non-zero density at |$A_V=0$|⁠, we are not imposing a log transform on |$A_V$|⁠.

3.3.1 1D ZLTN distribution

If a generic one-dimensional truncated normal distribution is truncated on the left at |$\phi =a$| and on the right at |$\phi =b$|⁠, then the ZLTN distribution has |$a=0$| and |$b=\infty$|⁠. The probability density function (PDF) has the form

(14)

where |$\mu$| and |$\sigma$| are the mean and standard deviation of the untruncated distribution, and |$\psi (\phi)$| and |$\Psi (\phi)$| are respectively the PDF and cumulative density function of a standard unit normal distribution.

3.3.2 Multivariate ZLTN distribution

We define an N-dimensional distribution over |$\boldsymbol {\phi } = (\phi _\mathrm{t},\phi _u^1,\dots ,\phi _u^{N-1})^\top$|⁠, characterized by a mean vector |$\boldsymbol {\mu }$| and a covariance matrix |$\boldsymbol {\Sigma }$|⁠. The first variable (⁠|$\phi _\mathrm{t}$|⁠) has positive support and is ZLTN distributed. We call this the truncated variable. The remaining |$N-1$| variables, |$\boldsymbol {\phi }_u=(\phi _u^1,\dots ,\phi _u^{N-1})^\top$|⁠, are untruncated and follow a multivariate normal distribution conditional on |$\phi _\mathrm{t}$|⁠. The mean and covariance parameters can be partitioned as:

(15)

where |$\mu _\mathrm{t}$| is a scalar, |$\boldsymbol {\mu }_u$| is a vector of length |$N-1$|⁠, |$\sigma _{t}$| is a scalar, |$\boldsymbol {\Sigma }_{ut}$| is a vector of length |$N-1$|⁠, and |$\boldsymbol {\Sigma }_{uu}$| is an |$(N-1) \times (N-1)$| matrix. Here |$\mu _\mathrm{t}$| and |$\sigma _{t}^2$| are parameters of a ZLTN distribution, and the rest are parameters of a multivariate normal distribution.

First, we define the marginal probability density of the truncated parameter |$\phi _\mathrm{t}$| as |$\operatorname{ZLTN}(\phi _\mathrm{t}|\, \mu _\mathrm{t}, \sigma _{\mathrm{ t}}^2)$|⁠. The conditional distribution of the remaining parameters is then:

(16)

where the conditional mean is

(17)

and the conditional covariance is

(18)

The log probability density of a full N-dimensional parameter vector |$\boldsymbol {\phi }=(\phi _\mathrm{t},\boldsymbol {\phi }_u)^\top$| can be calculated as the sum of the log probability density of the truncated parameter, and that of all other variables conditioned on the truncated one:

(19)

The first term on the right-hand side is calculating the log probability density of the multivariate Gaussian distribution in equation (16) and the second term is the log probability density of a 1D ZLTN distribution (equation 14). Useful marginal probability densities over subvectors of |$\boldsymbol {\phi }$| are derived in Appendix  A.

3.3.3 Using multivariate ZLTN as a surrogate

In our application, we use the MVZLTN distribution constructed above, with a joint density computed as in equation (19), as the surrogate posterior distribution:

(20)

where |$\phi _\mathrm{t} = A_V$| is the parameter whose distribution is truncated to be positive. The variational parameters are |$\boldsymbol {\zeta } = (\boldsymbol {\mu }, \boldsymbol {\Sigma })$| as defined in equation (15).

3.3.4 Custom autoguide implementation

We first extend the Distribution class in NumPyro to create a custom joint MVZLTN distribution with the first variable following a ZLTN distribution and all other variables a multivariate normal distribution conditional on the first. We override the sample() and log_prob() methods and implement new ones using the equations above. We then extend the AutoContinous guide class and create a new type of AutoGuide class that has a MVZLTN posterior. Utilizing the AutoGuide functionality in NumPyro enables us to fit a MVZLTN posterior to any model.

4 DATA

To test our variational inference implementation of BayeSN, we fit both real and simulated SN Ia light curves.

4.1 Foundation Supernova Survey

As our test data set, we use the first data release of the Foundation Supernova Survey (Foley et al. 2018; Jones et al. 2019). These data include |$griz$| (⁠|$\sim 3500$|–9500 Å) light curves of 180 spectroscopically normal SNe Ia observed on the Pan-STARRS telescope. The light curves have a fairly homogeneous time sampling, with observations following a nominal sequence of |$(-5, 0, 5, 10, 18, 26, 35)$| d relative to peak brightness (Foley et al. 2018). The filter set, cadence, and telescope used make these data a good representation of other current data sets to which a scalable Variational-BayeSN model will be directly applicable (e.g. the Young Supernova Experiment; Aleo et al. 2023; and Pan-STARRS Medium Deep Survey; Rest et al. 2014; Scolnic et al. 2018; Villar et al. 2020). They are also a valuable precursor to the data expected from LSST. Additionally, previous BayeSN analyses (Thorp et al. 2021; Ward et al. 2023b) have made use of the Foundation data, and they form the training set of the T21 model used throughout this work. We limit ourselves here to the 157 Foundation SNe Ia used in T21.

4.2 Simulated data

As well as analysing real data from Foundation, we also perform tests using data simulated from the T21BayeSN forward model. These data are simulated to closely mimic the properties of the real Foundation data, but also give us access to ground truth parameter values. To simulate our data we sampled parameter values from the following priors8:

(21)
(22)
(23)
(24)

where the distance modulus |$\mu$| was calculated from the sampled redshift value z (assuming a flat |$\Lambda$|CDM cosmology with |$H_0 = 73.24~\text{km}\, \text{s}^{-1}\, \text{Mpc}^{-1}$|⁠, |$\Omega _0=0.28$|⁠, |$\Omega _\Lambda =0.72$|⁠; Riess et al. 2016). We then used the BayeSN forward model to simulate SEDs and ‘observed’ light curves using these sampled parameters, with simulated photometric errors of 0.05 mag, and a Gaussian error of 0.001 applied to the redshifts to mimic spectroscopic redshift error. In our fits, we fix redshift to this ‘observed’ value, rather than the true z (reflecting the way in which spectroscopic redshifts are treated in real analyses). In practice, this error only affects the redshift and time dilation of the SED model, and the effect is negligible at low-z after integration through the bandpasses and the addition of photometric error.

The simulations followed an idealized version of the Foundation survey strategy, with a cadence of four rest-frame days, with the earliest observation at eight rest-frame days before peak, and the latest at 36 rest-frame days after peak in |$griz$| (with observations in the four bands being coincident in time). The nominal observing sequence in the Foundation survey (Foley et al. 2018) was |$[-5, 0, 5, 10, 18, 26, 35]$| d relative to peak (in the observer frame), meaning our simulations cover a comparable range of phases and have similar sampling close to the peak of the light curve. Our assumed photometric uncertainties of 0.05 mag are comparable to the mean photometric uncertainty of 0.045 mag for the full Foley et al. (2018) data set.

5 RESULTS AND VALIDATION

Both variational approximations (MVZLTN and multivariate normal) and the Laplace Approximation are iteratively optimized, where each iteration consists of a single loss calculation and optimizer update step. In all cases, we fit the Laplace Approximation first for 15 000 iterations, which in practice was found to be sufficient for finding the MAP estimate. For the MVZLTN and multivariate normal guides, we then initialize on the median values of |$\Delta t_\text{max}, \theta _1, A_V$|⁠, and |$\mu$| from the Laplace approximation and train further for 10 000 iterations with the new guide. We use the Adam optimizer (Kingma & Ba 2015) with a learning rate of 0.01 for the Laplace approximation and 0.005 for the MVZLTN and multivariate normal guides to optimize the ELBO computed with five ‘particles’, indicating the number of samples from the surrogate posterior used to approximate the expectations in the ELBO (equation 3). After optimizing the variational parameters of the guide for a given supernova, we then draw 1000 parameter samples from the resulting approximate posterior distribution.

As a point of comparison, we also fit the same data with MCMC using the nuts method in NumPyro (Hoffman & Gelman 2014). We use four chains with 250 warm-up samples and 250 additional samples in each, as is done by previous BayeSN analyses (T21; Thorp & Mandel 2022; Grayling et al. 2024) to ensure a sufficient effective sample size (computed by NumPyro based on autocorrelation length of the chains; see Geyer 1992, 2011; Gelman et al. 2014; Hoffman & Gelman 2014). Post-warmup, the NUTS sampler obtains effective samples at a rate |$\approx 0.5$|–2 per iteration. Only the 250 post-warmup samples in each chain are used, yielding 1000 total posterior samples (with effective sample sizes |$\approx 400$|–2000)9 for each SN.

5.1 Results on simulated data

Figs 1 and 2 show the results of fitting 1000 simulated supernovae with MVZLTN VI and multivariate normal VI compared to the MCMC fits and the true parameter values used to generate the simulated light curves. The top row compares both VI and MCMC fits against the truth, the middle row shows the residuals of both VI and MCMC compared to the truth, and the bottom row shows the residuals between VI and MCMC.

A comparison of MVZLTN VI posterior fits (blue points) and MCMC posterior fits (red open circles) with the true parameter values for 1000 simulate SN Ia for $\mu$, $\theta _1$, and $A_V$. The top row compares the median posterior value versus the true parameter value, the middle row shows the residuals (fit–true) for MCMC and MVZLTN VI, and the bottom row the residuals (VI–MCMC) as a function of the true parameter value. Black dotted lines in the bottom row denote the median residual (MVZLTN–MCMC) value. The red error bars in the bottom row indicate the typical posterior standard deviation derived from MCMC fits to SNe at a set of 10 representative values spanning each parameter range. Note that the true $A_V$ is plotted on a $\log _{10}$ scale to better show the behaviour at low $A_V$.
Figure 1.

A comparison of MVZLTN VI posterior fits (blue points) and MCMC posterior fits (red open circles) with the true parameter values for 1000 simulate SN Ia for |$\mu$|⁠, |$\theta _1$|⁠, and |$A_V$|⁠. The top row compares the median posterior value versus the true parameter value, the middle row shows the residuals (fit–true) for MCMC and MVZLTN VI, and the bottom row the residuals (VI–MCMC) as a function of the true parameter value. Black dotted lines in the bottom row denote the median residual (MVZLTN–MCMC) value. The red error bars in the bottom row indicate the typical posterior standard deviation derived from MCMC fits to SNe at a set of 10 representative values spanning each parameter range. Note that the true |$A_V$| is plotted on a |$\log _{10}$| scale to better show the behaviour at low |$A_V$|⁠.

A comparison of multivariate normal (MN) VI posterior fits (blue points) and MCMC posterior fits (red open circles) with the true parameter values for 1000 simulated SN Ia for $\mu$, $\theta _1$, and $A_V$. The top row compares the median posterior value versus the true parameter value, the middle row shows the residuals (fit–true) for MCMC and multivariate normal VI, and the bottom row the residuals (VI–MCMC) as a function of the true parameter value. Black dotted lines in the bottom row denote the median residual (MN–MCMC) value. The red error bars in the bottom row indicate the typical posterior standard deviation derived from MCMC fits to SNe at a set of 10 representative values spanning each parameter range. Note that the true $A_V$ is plotted on a $\log _{10}$ scale to better show the behaviour at low $A_V$.
Figure 2.

A comparison of multivariate normal (MN) VI posterior fits (blue points) and MCMC posterior fits (red open circles) with the true parameter values for 1000 simulated SN Ia for |$\mu$|⁠, |$\theta _1$|⁠, and |$A_V$|⁠. The top row compares the median posterior value versus the true parameter value, the middle row shows the residuals (fit–true) for MCMC and multivariate normal VI, and the bottom row the residuals (VI–MCMC) as a function of the true parameter value. Black dotted lines in the bottom row denote the median residual (MN–MCMC) value. The red error bars in the bottom row indicate the typical posterior standard deviation derived from MCMC fits to SNe at a set of 10 representative values spanning each parameter range. Note that the true |$A_V$| is plotted on a |$\log _{10}$| scale to better show the behaviour at low |$A_V$|⁠.

Both the VI and MCMC posterior distributions are able to recover the true parameter values well. The median VI-true residuals (seen in the second row of Figs 1 and 2) standardized by the corresponding VI standard deviations for |$\mu$|⁠, |$\theta _1$|⁠, and |$A_V$|⁠, respectively are −0.04|$\sigma$|⁠, 0.04|$\sigma$|⁠, and 0.11|$\sigma$| for the MVZLTN posterior and −0.07|$\sigma$|⁠, −0.05|$\sigma$|⁠, and 0.15|$\sigma$| for the multivariate normal posterior. For the MCMC approximation, the corresponding median standardized residuals are −0.04|$\sigma$|⁠, 0.03|$\sigma$|⁠, and 0.03|$\sigma$| for |$\mu$|⁠, |$\theta _1$|⁠, and |$A_V$|⁠, respectively.

The residuals between the VI and MCMC parameter samples are small compared to the typical errors from the MCMC. The bottom row of Figs 1 and 2 shows the distribution of residual values (VI–MCMC) in the context of representative standard deviations of the MCMC marginal distributions for each of |$\mu$|⁠, |$\theta _1$|⁠, and |$A_V$|⁠. The median VI–MCMC residuals scaled by the MCMC standard deviations for |$\mu$|⁠, |$\theta _1$|⁠, and |$A_V$| are 0.002|$\sigma$|⁠, 0.02|$\sigma$|⁠, and 0.06|$\sigma$| for the MVZLTN posterior and −0.01|$\sigma$|⁠, −0.07|$\sigma$|⁠, and 0.07|$\sigma$| for the multivariate normal posterior.

At |$A_V \lesssim 0.01$|⁠, the VI residuals are almost exclusively positive relative to the truth. This is to be expected as we are constraining the marginal distribution of |$A_V$| to be positive, hence for low values of true |$A_V$| it is unlikely that the VI fit could underestimate them without being negative. At larger value of true |$A_V$|⁠, there is a more even distribution of positive and negative residual values.

For Variational-BayeSN, the best-case scenario is that the VI approximation matches the performance of the MCMC, which would indicate no trade-off between computational efficiency and the quality of the posterior approximation. However, this would also require the chosen form of the surrogate posterior to perfectly match that of the true posterior, which in reality is unlikely to be completely true. Thus, some variance is to be expected both between the MCMC samples and true parameters, and the VI approximation and MCMC samples.

As shown in the last row of Figs 1 and 2 and the numerical summaries presented above, both the MVZLTN and multivariate normal VI approximations slightly overestimate the marginal values of |$A_V$| compared to MCMC, while the multivariate normal VI also appears to underestimate |$\theta _1$|⁠. Meanwhile, in both cases, there are neglible differences between the VI and MCMC estimates of the distance moduli |$\mu$|⁠. The marginal distributions of the VI approximations will be further validated in Section 5.3.2.

5.2 Results on real data

We fit the Foundation Dataset (described in Section 4.1), which consists of 157 real SNe Ia. Fig. 3 shows the marginal posterior for |$A_V$|⁠, |$\theta _1$|⁠, and |$\mu$| for SN2017cjv, a low extinction SN Ia, with MCMC, MVZLTN VI, multivariate normal VI, and the Laplace approximation, while Fig. 4 shows the same for ASASSN-16ay, a highly extinguished SN Ia. For |$A_V$| far from zero as in Fig. 4, the four different types of posterior approximations agree quite well and are virtually indistinguishable.

(top) Zoomed-in plot of the estimated marginal posterior distributions over $A_V$. For the VI and Laplace approximation, we plot the analytic form of the approximate posterior (see Appendix A for the MVZLTN marginals). (bottom) Corner plot showing the posterior distributions from MCMC (red), MVZLTN VI (black), multivariate normal VI (blue), and Laplace Approximation (green), for SN2017cjv, an SN with low $A_V$.
Figure 3.

(top) Zoomed-in plot of the estimated marginal posterior distributions over |$A_V$|⁠. For the VI and Laplace approximation, we plot the analytic form of the approximate posterior (see Appendix  A for the MVZLTN marginals). (bottom) Corner plot showing the posterior distributions from MCMC (red), MVZLTN VI (black), multivariate normal VI (blue), and Laplace Approximation (green), for SN2017cjv, an SN with low |$A_V$|⁠.

Corner plot showing the posterior distributions from MCMC (red), MVZLTN VI (black), multivariate normal VI (blue), and Laplace Approximation (green), for ASASSN-16ay, a SN with high $A_V$.
Figure 4.

Corner plot showing the posterior distributions from MCMC (red), MVZLTN VI (black), multivariate normal VI (blue), and Laplace Approximation (green), for ASASSN-16ay, a SN with high |$A_V$|⁠.

The top left subplot in Fig. 3 shows the marginal distribution of |$A_V$| for each of the four posterior approximations. As expected the multivariate normal VI and the Laplace approximation show lognormal behaviour with zero density as |$A_V$| approaches 0. The MCMC distribution shows non-negligible density as |$A_V$| approaches 0, which is most closely matched by the MVZLTN VI posterior. Of the variational approximations, the MVZLTN appears to model a low-extinction SN’s posterior the best.

Fig. 5 displays a Hubble diagram of all SNe in the Foundation data set fit with MVZLTN VI. The root-mean-squared error (RMSE) is 0.130 mag, which is consistent with that from the same analysis done by Thorp et al. (2021) using Hamiltonian Monte Carlo (HMC). We perform the same analysis with the multivariate normal VI, Laplace Approximation and MCMC, which yield RMSE values of 0.129, 0.127, and 0.130 mag, respectively.

5.3 Validation of VI performance

It is possible to optimize the parameters for a surrogate posterior of a given form using VI but still not approximate the true posterior well. We use the methods for validating the performance of a VI model presented by Yao et al. (2018) (hereafter Y18) to evaluate the quality of our variational approximations.

5.3.1 Pareto-smoothed importance sampling

The PSIS (Vehtari et al. 2024) diagnostic evaluates the goodness of fit of the surrogate posterior distribution compared to the true joint probability (Y18). For each of N posterior samples |$\boldsymbol {\phi }_{1},\dots ,\boldsymbol {\phi }_N$|⁠, the importance ratio |$r_n$| is calculated as:

(25)

for data x, parameters |$\boldsymbol {\phi }_n$|⁠, joint density |$p(\boldsymbol {\phi }_n, x)$|⁠, and surrogate posterior |$q_\zeta (\boldsymbol {\phi }_n)$|⁠. The M largest importance ratios – where |$M = \min (N/5, 3 \sqrt{N}$|⁠) – are fit to a generalized Pareto distribution yielding a shape parameter |$\hat{k}$| (Vehtari et al. 2024). If |$\hat{k} < 0.5$|⁠, this indicates that the variational approximation is a good approximation to the true joint density, while if |$0.5 < \hat{k} < 0.7$|⁠, the approximation is still acceptable (Y18). A |$\hat{k} > 0.7$| suggests that the variational approximation has large uncertainty (Y18).

In calculating the PSIS diagnostic for our MVZLTN VI fits of the Foundation data set, we fit the Foundation data set as described at the beginning of Section 5 and use |$N = 5000$| parameter samples from the posterior (using > 2000 samples as suggested by Vehtari et al. 2024) to calculate |$\hat{k}$|⁠. Fig. 6 shows the distribution of |$\hat{k}$| values as a function of the difference in the median distance modulus |$\mu$| from the MVZLTN VI and MCMC fits. The values of the residuals do not seem to correlate with the |$\hat{k}$| diagnostic and, save for one outlier, appear approximately normally distributed around a |$\mu$| residual of zero. Vehtari et al. (2024) note that performance of the PSIS diagnostic deteriorates as the number of parameters in the posterior increases, and Huggins et al. (2020) find that the diagnostic does not always correlate with the quality of the posterior approximation as intended. Overall, we do not find the PSIS |$\hat{k}$| diagnostic to be effective in assessing the quality of our variational approximations, at least in terms of producing a distance modulus point estimate; however, we acknowledge the importance of validating the performance of our VI implementation.

(top) Hubble diagram of distance modulus $\mu$ versus redshift z for 157 supernovae in the Foundation data set fit with MVZLTN VI. Black line indicates expected Hubble relation under flat $\Lambda$CDM with $H_0 = 73.24~\text{km}\, \text{s}^{-1}\, \text{Mpc}^{-1}$ and $\Omega _0 = 0.28$ (Riess et al. 2016). Error bars indicate $1\sigma$ for each supernova. The RMSE for this data set is 0.13. (bottom) Hubble residuals (difference between fitted $\mu$ values and expected Hubble relation shown in top panel) for 157 supernovae in Foundation data set. Black horizontal line at zero is for reference, while dotted black curves represent the peculiar velocity uncertainty envelope. Error bars represent $1\sigma$.
Figure 5.

(top) Hubble diagram of distance modulus |$\mu$| versus redshift z for 157 supernovae in the Foundation data set fit with MVZLTN VI. Black line indicates expected Hubble relation under flat |$\Lambda$|CDM with |$H_0 = 73.24~\text{km}\, \text{s}^{-1}\, \text{Mpc}^{-1}$| and |$\Omega _0 = 0.28$| (Riess et al. 2016). Error bars indicate |$1\sigma$| for each supernova. The RMSE for this data set is 0.13. (bottom) Hubble residuals (difference between fitted |$\mu$| values and expected Hubble relation shown in top panel) for 157 supernovae in Foundation data set. Black horizontal line at zero is for reference, while dotted black curves represent the peculiar velocity uncertainty envelope. Error bars represent |$1\sigma$|⁠.

Median distance modulus $\mu$ residual (MVZLTN VI $-$ MCMC) versus PSIS $\hat{k}$ diagnostic calculated from $N = 5000$ parameter samples for the Foundation data set. Dotted line indicates $\hat{k} = 0.7$, a threshold for a good variational approximation of the posterior (Vehtari et al. 2024).
Figure 6.

Median distance modulus |$\mu$| residual (MVZLTN VI |$-$| MCMC) versus PSIS |$\hat{k}$| diagnostic calculated from |$N = 5000$| parameter samples for the Foundation data set. Dotted line indicates |$\hat{k} = 0.7$|⁠, a threshold for a good variational approximation of the posterior (Vehtari et al. 2024).

5.3.2 Variational simulation-based calibration

The VSBC diagnostics determine whether a point estimate of a single parameter from its marginal distribution from a variational approximation is likely to be biased (Y18).

For our simulated results shown in Section 5.1, we generate 1000 true parameter values from the priors shown in Section 4, and simulate a set of SN Ia |$griz$| light curves using each set of these parameters, corresponding to a single simulated SN Ia. We then fit the posterior using each of the outlined methods, and generate 1000 samples from the posterior. To compute the VSBC diagnostics, we calculate |$p_{s\phi }$|⁠, the probability of a posterior sample of model parameter |$\phi$| from the variational approximation for simulated SN Ia s being greater than the true value of |$\phi$| (drawn from the priors listed in Section 4) that was used in creating the simulated data for SN Ia s. In practice, this is simply the fraction of our 1000 posterior samples that are below the true simulated parameter value:

(26)

for simulated SN Ia |$s = {1,\dots ,1000}$| and model parameters |$\phi \in \lbrace \mu , \theta _1, A_V\rbrace$|⁠. We then investigate whether the distributions of these |$p_{s\phi }$| values are symmetric across all simulated SNe for each model parameter |$\phi$|⁠. Y18 note that if the distribution is not symmetric, then the point estimate derived from the VI approximation is unlikely to be close to the median value of the true posterior.

In each panel of Fig. 7, we show the histogram of |$p_{s\phi }$| for |$\phi = \mu$|⁠, |$A_V$|⁠, or |$\theta _1$| across all supernovae |$s=1\dots 1000$| for each type of posterior fit. To investigate the symmetry of each marginal distribution, we perform a two-sided Kolmogorov–Smirnov (KS) test (Kolmogorov 1933; Smirnov 1948; Massey 1951) on the distributions of |$p_{\phi }$| and |$1-p_{\phi }$| as in Y18. The KS test yields a frequentist p-value, here denotes as |$p_{\mathrm{ KS}}$|⁠, which if less than 0.05 suggests that |$p_{\phi }$| and |$1-p_{\phi }$| are unlikely to be drawn from the same distribution and thus the distribution of |$p_{\phi }$| is not symmetric. These |$p_{\mathrm{ KS}}$|-values are annotated on each subplot in Fig. 7. Note that this technically violates the requirement of the KS test that the two distributions being compared are independent; however, we chose to follow the procedure for the VSBC diagnostic as described by Y18.

Results from VSBC diagnostic for MCMC, MVZLTN, multivariate normal, and Laplace Approximation posteriors, showing histograms of the probability of a single parameter sample from the posterior being less than the true parameter value, for 1000 simulated SNe. Dotted vertical line indicates mean probability, and solid black line indicates $p=0.5$ for comparison.
Figure 7.

Results from VSBC diagnostic for MCMC, MVZLTN, multivariate normal, and Laplace Approximation posteriors, showing histograms of the probability of a single parameter sample from the posterior being less than the true parameter value, for 1000 simulated SNe. Dotted vertical line indicates mean probability, and solid black line indicates |$p=0.5$| for comparison.

A given marginal distribution from a variational approximation that closely resembles the true posterior would be symmetric about |$p_\phi = 0.5$|⁠, denoted by a black vertical line in Fig. 7. The results of the VSBC diagnostics show that there is a systematic offset in the point estimates from the Laplace approximation for all three parameters. While the two VI methods tend to overestimate |$A_V$|⁠, the marginal distributions of |$\theta _1$| and |$\mu$| remain centred around the true values. The VSBC diagnostics find that the MCMC marginal distributions are unbiased, suggesting both VI approximations slightly overestimate |$A_V$| when compared to the MCMC, but these offsets are small compared to the MCMC uncertainties, as seen in the bottom row of Figs 1 and 2 and the numerical summaries discussed in Section 5.1.

Whereas the PSIS diagnostic evaluates performance on the 28-dimensional joint posterior, the VSBC diagnostics solely operate on the marginal distribution of one parameter, ignoring other parameters even though they may covary. In practice, the most common application of SNe Ia SED fitting models such as BayeSN is to fit the distance–redshift relation for cosmological parameter estimation. For this, the posterior point estimate of the photometric distance modulus |$\mu$| marginalized over all other light curve parameters is needed, and Fig. 7 shows that these point estimates are unbiased in any of the variational approximations.

5.4 Computational performance

The runtime of our NumPyro code depends on how many chains and samples are run for MCMC and how many iterations used in VI. We used the jax.vmap() method to vectorize the code to fit multiple SNe in parallel (Bradbury et al. 2018). The implementation of BayeSN in NumPyro by Grayling et al. (2024) enables the use of GPUs for further computational speed-up.

Table 1 shows the total and per-SN runtimes for fitting the Foundation Dataset, which consists of 157 SNe Ia, and a simulated population of 1000 SNe on both an Apple M2 Pro Central Processing Unit (CPU) and an NVIDIA Ampere A100 GPU. On CPU, MCMC was run in parallel across four cores (with one chain per core). In GPU mode, MCMC was run in parallel across four separate GPUs. In contrast, VI required only a single CPU core or GPU. The runtimes in Table 1 show that approximating the posterior with VI is about an order of magnitude faster in addition using less computational resources.

Table 1.

A table of runtimes (wall clock time) for fitting the Foundation data set (157 SNe) and a simulated population (1000 SNe). Per SN runtimes are illustrative.|$^{*}$|

MethodCPUaGPUb
FoundationSimulationsFoundationSimulations
totalper SNtotalper SNtotalper SNtotalper SN
MCMCc3 h 48 m1 m 27 s16 h 24 m1 m4 m 42 s2 s11 m 20 s0.7 s
Laplace Approx.d7 m 15 s3 s38 m 47 s2 s43 s0.3 s1 m 15 s0.1 s
MultiNormal VId26 m 37 s10 s2 h 24 m9 s42 s0.3 s3 m 2 s0.2 s
MVZLTN VId28 m 22 s11 s2 h 43 m10 s54 s0.3 s2 m 45 s0.2 s
MethodCPUaGPUb
FoundationSimulationsFoundationSimulations
totalper SNtotalper SNtotalper SNtotalper SN
MCMCc3 h 48 m1 m 27 s16 h 24 m1 m4 m 42 s2 s11 m 20 s0.7 s
Laplace Approx.d7 m 15 s3 s38 m 47 s2 s43 s0.3 s1 m 15 s0.1 s
MultiNormal VId26 m 37 s10 s2 h 24 m9 s42 s0.3 s3 m 2 s0.2 s
MVZLTN VId28 m 22 s11 s2 h 43 m10 s54 s0.3 s2 m 45 s0.2 s

Note. |$^{*}$|On GPU, the fractional speed-up is greater when fitting large batches of SNe. All SNe were fit in parallel, which leads to a considerable performance increase on GPU. However, this is not necessarily the case when running on CPU. For a like-for-like comparison between all methods, we ran parallel fits on both CPU and GPU for this work. For a small sample of objects being fit on CPU, faster per-SN times can be achieved by fitting objects in series, which the public release of BayeSN supports.

a

CPU runtimes are based on an Apple M2 Pro

b

GPU runtimes are based on an NVIDIA Ampere A100.

c

MCMC CPU/GPU results used four parallel chains, with one core/GPU per chain.

d

VI/Laplace CPU results were run using a single core.

Table 1.

A table of runtimes (wall clock time) for fitting the Foundation data set (157 SNe) and a simulated population (1000 SNe). Per SN runtimes are illustrative.|$^{*}$|

MethodCPUaGPUb
FoundationSimulationsFoundationSimulations
totalper SNtotalper SNtotalper SNtotalper SN
MCMCc3 h 48 m1 m 27 s16 h 24 m1 m4 m 42 s2 s11 m 20 s0.7 s
Laplace Approx.d7 m 15 s3 s38 m 47 s2 s43 s0.3 s1 m 15 s0.1 s
MultiNormal VId26 m 37 s10 s2 h 24 m9 s42 s0.3 s3 m 2 s0.2 s
MVZLTN VId28 m 22 s11 s2 h 43 m10 s54 s0.3 s2 m 45 s0.2 s
MethodCPUaGPUb
FoundationSimulationsFoundationSimulations
totalper SNtotalper SNtotalper SNtotalper SN
MCMCc3 h 48 m1 m 27 s16 h 24 m1 m4 m 42 s2 s11 m 20 s0.7 s
Laplace Approx.d7 m 15 s3 s38 m 47 s2 s43 s0.3 s1 m 15 s0.1 s
MultiNormal VId26 m 37 s10 s2 h 24 m9 s42 s0.3 s3 m 2 s0.2 s
MVZLTN VId28 m 22 s11 s2 h 43 m10 s54 s0.3 s2 m 45 s0.2 s

Note. |$^{*}$|On GPU, the fractional speed-up is greater when fitting large batches of SNe. All SNe were fit in parallel, which leads to a considerable performance increase on GPU. However, this is not necessarily the case when running on CPU. For a like-for-like comparison between all methods, we ran parallel fits on both CPU and GPU for this work. For a small sample of objects being fit on CPU, faster per-SN times can be achieved by fitting objects in series, which the public release of BayeSN supports.

a

CPU runtimes are based on an Apple M2 Pro

b

GPU runtimes are based on an NVIDIA Ampere A100.

c

MCMC CPU/GPU results used four parallel chains, with one core/GPU per chain.

d

VI/Laplace CPU results were run using a single core.

Note that while the per-SN time is listed for comparison between methods, light curves are fit in parallel and the listed time is not representative of the time it would take to fit a single supernova by itself. For the GPUs in particular, the per-SN runtime decreases when fitting larger SN populations. When fitting few SNe, it is likely faster to use the CPU. Each SN was fit as described at the beginning of Section 5.

6 DISCUSSION

We use both simulated and real data to evaluate the performance of our variational approximations. While the real-world use case of these methods is inferring parameters from observations with no known true values, fitting simulated data where the true parameters are known allows for additional insight into the accuracy and limitations of both VI and MCMC. In practice, a performant VI approximation will yield similar results to MCMC, which our results show on both simulated and real data.

There are several reasons to use VI over MCMC. The first is increased computational efficiency in both time and computing resources. The variational approximation also provides an analytic expression for the posterior distribution, which enables several types of additional analyses, such as calculating importance sampling ratios or the Hessian matrix. An analytic expression for the posterior also enables robust uncertainty propagation, and can be used as a prior in downstream analyses. MCMC, nested sampling, or other posterior sampling methods are likely still the best option for models where the form of the posterior cannot be assumed, is multimodal, or is another form that is difficult to express analytically. However, the compressed representation of the posterior in terms of the variational parameters |$\boldsymbol {\zeta }$| is also advantageous in terms of data volume, as saving many posterior samples is infeasible in the large-data regime (see e.g. discussion in Malz et al. 2018).

Here, we consider a particularly difficult use case with one parameter that is constrained to be non-negative. The performance of all of the different types of variational posteriors on the non-constrained parameters is quite good, and for most use cases using a Multivariate Gaussian guide or even the Laplace approximation will achieve similar results to MCMC in a fraction of the computational time (as seen in Fig. 4) while also providing an analytic posterior and robust uncertainties. This is especially promising in light of the expected quantity of data from the Vera C. Rubin Observatory/LSST in the coming years.

The ZLTN distribution is asymmetric, and so, when using it to approximate the marginal distribution of |$A_V$|⁠, the mode and the median are not the same. However, we found in practice that using the median sample as the point estimate led to better results. In calculating point estimates of |$\mu$|⁠, we marginalize over the other parameters. While it is most important to estimate the distance modulus |$\mu$| accurately for precision cosmology, this cannot happen without proper modeling of |$A_V$|⁠. There is a posterior covariance between |$\mu$| and |$A_V$| as both have a dimming effect. The degeneracy is not perfect, as |$A_V$| has a wavelength-dependent effect. Nevertheless, probabilistically modelling the impact of dust on the SN Ia SED is critical to accurate distance estimation.

KL divergence is not symmetric (⁠|$\mathrm{ KL}[P|| Q] \ne \mathrm{ KL}[Q|| P]$|⁠), and this distinction can have important implications (Margossian, Pillaud-Vivien & Saul 2024). VI uses the ELBO as its loss function, which minimizes the reverse KL divergence |$D_\text{KL}[q_\zeta (\phi)|| p(\phi , x)]$| between the true posterior |$p(\phi , x)$| and the surrogate posterior |$q_\zeta (\phi)$|⁠. When approximating a multimodal distribution with a normal distribution, the reverse KL divergence is ‘mode-locking’, meaning that it will perfectly fit one mode of the distribution with zero density around the other modes (Minka 2005; Turner & Sahani 2011). By contrast, the forward KL divergence |$D_\text{KL}[p(\phi , x)|| q_\zeta (\phi)]$| is ‘mode-averaging’, meaning that it will approximate a multimodal distribution by spanning its density across all modes and any areas in between modes (Minka 2005; Turner & Sahani 2011). There are other algorithms, such as expectation propagation, that use the forward KL divergence to avoid getting ‘stuck’ in one mode of the posterior (Minka 2001). In this work, we are not approximating a multimodal posterior distribution, but this is an important consideration for generalized use of VI (Margossian et al. 2024).

7 FUTURE WORK & CONCLUSIONS

In this work, we present a VI implementation of BayeSN using NumPyro with multivariate normal and MVZLTN surrogate posterior and compare the results on both simulated and real data to MCMC and Laplace Approximation posteriors. We evaluate the PSIS and VSBC diagnostic to validate the quality of our variational approximation, and show that VI can achieve similar results to MCMC but with reduced runtime and computational resources.

All of the code for this work is publicly available at https://github.com/asmuzsoy/bayesn-VI-paper. The code for the custom AutoGuide used to fit the MVZLTN surrogate posterior can easily be modified to allow other types of custom posterior forms for a single variable while still maintaining a fully conditional multivariate distribution. The VI implementation has also been incorporated into the larger public BayeSN release by Grayling et al. (2024), available at https://github.com/bayesn/bayesn.

A promising future avenue would be to extend the application of VI to the BayeSN training process, where the model’s hyperparameters are inferred. The work we have done here suggests that the possibility for speed-up is large when switching to VI over MCMC. An additional benefit would be that analytic approximate posteriors on the hyperparameters could be easily propagated into downstream analyses. A further extension that would be valuable in the LSST is the inclusion of redshift as a free parameter, so uncertainties associated with photometric host-galaxy redshifts can be marginalized over. This could easily be included in the existing BayeSN framework. Since photometric redshift PDFs (which would be used as a prior in a BayeSN light curve fit) are typically non-analytic and non-Gaussian, they likely be provided in a summarized form such as quantiles (Malz et al. 2018). Frameworks have been developed for the inclusion of such priors in probabilistic programming languages (e.g. Perepolkin, Goodrich & Sahlin 2023).10

There are also alternate methods to potentially better model the dust extinction coefficient |$A_V$|⁠. While we model its posterior with a ZLTN distribution with all other variables normal conditional on |$A_V$|⁠, the structure of the joint posterior could also be modelled with a copula for a greater deal of flexibility (Tran, Blei & Airoldi 2015; Han et al. 2016). Within astronomy, copulas have recently been used by Patil et al. (2023) to model the distribution of stellar chemical abundances in the Milky Way. Modelling the effect of dust extinction is critical to accurate estimation of distances to SN Ia for precision cosmology.

There have been multiple examples of applying VAEs, which use neural networks to learn non-linear relationships, to extract physical parameters from transient light curves (Boone 2021; Villar et al. 2021). BayeSN explicitly models physical parameters such as the distance modulus |$\mu$| and dust extinction parameter |$A_V$|⁠, and includes additional parameters to model the remaining variance in the SED (see equation 4). A variation on BayeSN could combine these two approaches, explicitly modelling the SED dependence on physical parameters while allowing a neural network to learn a non-linear representation of the rest of the model. This would not be as interpretable as the current BayeSN model, but would leverage the non-linear representation power of neural networks while explicitly maintaining the known key physical parameters.

As the size of astronomical data sets increase and in anticipation of the Rubin Observatory/LSST, variational approximations such as the one in this work could be key to accelerating data analysis pipelines that are currently using slower sampling methods. In the era of precision cosmology, being able to use as much data as possible without draining expensive computational time and resources enables robust estimates and tighter constraints for cosmological parameters.

ACKNOWLEDGEMENTS

ASMU was supported by a Churchill Scholarship and a National Science Foundation Graduate Research Fellowship. ST was supported by the Cambridge Centre for Doctoral Training in Data-Intensive Science funded by the UK Science and Technology Facilities Council (STFC), and in part by the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (Grant Agreement No. 101018897 CosmicExplorer). MG and KSM acknowledge funding from the European Research Council under the European Union’s Horizon 2020 research and innovation programme (ERC Grant Agreement No. 101002652). This project has been made possible through the ASTROSTAT-II collaboration, enabled by the EU Horizon 2020, Marie Skłodowska-Curie Grant Agreement No. 873089. This manuscript is based on work that was originally developed in the MPhil thesis of ASMU at the University of Cambridge (Uzsoy 2022).

ASMU would like to thank Douglas Finkbeiner, Andrew Saydjari, Souhardya Sengupta, Tamara Broderick, Ryan Giordano, Gautham Narayan, Rick Kessler, Suhail Dhawan, Sam Ward, Ben Boyd, Erin Hayes, Ashley Villar, and the Villar Time-Domain Astrophysics group for helpful discussions and suggestions.

This work was performed using resources provided by the Cambridge Service for Data Driven Discovery (CSD3) operated by the University of Cambridge Research Computing Service (www.csd3.cam.ac.uk), provided by Dell EMC and Intel using Tier-2 funding from the Engineering and Physical Sciences Research Council (capital grant EP/T022159/1), and DiRAC funding from the Science and Technology Facilities Council (www.dirac.ac.uk).

This work made use of the pyro (Bingham et al. 2019), numpyro (Phan et al. 2019), corner (Foreman-Mackey 2016), numpy (Harris et al. 2020), scipy (Virtanen et al. 2020), matplotlib (Hunter 2007), astropy (Astropy Collaboration 2013), extinction (Barbary 2021), arviz (Kumar et al. 2019) and jax (Bradbury et al. 2018) Python packages.

DATA AVAILABILITY

The data for the 180 supernovae in the Foundation DR1 cosmology sample (Foley et al. 2018; Jones et al. 2019) are publicly available at https://github.com/djones1040/Foundation_DR1.

The simulated data shown, as well as all code used in this work, are publicly available on GitHub at https://github.com/asmuzsoy/bayesn-VI-paper.

Footnotes

1

Note that in this work, |$\log$| refers to natural logarithm (base e), while |$\log _{10}$| specifies base 10.

2

The Fitzpatrick (1999) dust law is also defined as a cubic spline.

3

This choice implies a 95 per cent prior probability that the original maximum date was estimated to within |$10\times (1+z_\mathrm{s})$| observer-frame days of the truth.

4

These hyperparameter estimates can be found in the BAYESN.T21 directory at https://github.com/bayesn/bayesn-model-files

5

These are Modified Julian Dates (MJD) in observer-frame days.

6

Rather than fitting for the date of maximum |$T_\text{max}^s$| directly, we adopt a parametrization that is defined with reference to a prior estimate |$\hat{T}^s_\text{ref}$|⁠. The correction factor is related to |$T_\text{max}^s$| by: |$\Delta t_\text{max}^s=(T^s_\text{max}-\hat{T}^s_\text{ref})/(1+z_\mathrm{s})$|⁠.

7

The popular choice of an exponential prior (equation 6) on dust extinction was originated by Jha et al. (2007), motivated by the fact that viewing angles that produce high extinction must be rarer. Theoretical studies (Hatano, Branch & Deaton 1998; Commins 2004; Riello & Patat 2005) had also favoured extinction values following an exponential distribution for late-type SN Ia hosts (see also Holwerda et al. 2015). This assumption has been scrutinized by several recent works (Wojtak, Hjorth & Hjortlund 2023; Ward et al. 2023a).

8

We use the notation |$\text{Exponential}(\tau)$| to denote an exponential distribution with a scale parameter |$\tau$|⁠. Here, |$\tau _A$| is a population mean |$A_V$|⁠.

9

Effective sample sizes larger than the number of posterior samples (i.e. superefficiency) are possible for NUTS, which can generate chains with negative autocorrelation (see discussion in Vehtari et al. 2021; Stan Development Team 2024).

10

See also the earlier approach by B. Goodrich based on Chebyshev polynomials: https://github.com/bgoodri/StanCon2020. Preliminary investigations using the Stan implementation of BayeSN showed this to be viable in an MCMC analysis (S. M. Ward, private communication). This will be integrated into a future release of the NumPyro BayeSN code: https://github.com/bayesn/bayesn/pull/37.

REFERENCES

Abbott
T. M. C.
et al. ,
2024
,
ApJ
,
973
,
L14

Aleo
P. D.
et al. ,
2023
,
ApJS
,
266
,
9

Astier
P.
et al. ,
2006
,
A&A
,
447
,
31

Astropy Collaboration
,
2013
,
A&A
,
558
,
A33

Barbary
K.
,
2021
,
Astrophysics Source Code Library, record ascl:2102.026

Bingham
E.
et al. ,
2019
,
J. Mach. Learn. Res.
,
20
,
1

Blei
D. M.
,
Kucukelbir
A.
,
McAuliffe
J. D.
,
2017
,
J. Am. Stat. Assoc.
,
112
,
859

Boone
K.
,
2021
,
AJ
,
162
,
275

Bradbury
J.
et al. ,
2018
,
JAX: composable transformations of Python + NumPy programs
.
http://github.com/google/jax

Branch
D.
,
Tammann
G. A.
,
1992
,
ARA&A
,
30
,
359

Brout
D.
,
Scolnic
D.
,
2021
,
ApJ
,
909
,
26

Brout
D.
et al. ,
2019
,
ApJ
,
874
,
106

Burns
C. R.
et al. ,
2011
,
AJ
,
141
,
19

Burns
C. R.
et al. ,
2014
,
ApJ
,
789
,
32

Commins
E. D.
,
2004
,
New Astron. Rev.
,
48
,
567

Conley
A.
et al. ,
2008
,
ApJ
,
681
,
482

Cook
S. R.
,
Gelman
A.
,
Rubin
D. B.
,
2006
,
J. Comput. Graph. Stat.
,
15
,
675

de Soto
K. M.
et al. ,
2024
,
ApJ
,
974
,
169

Dhawan
S.
et al. ,
2022
,
MNRAS
,
510
,
2228

Dhawan
S.
,
Thorp
S.
,
Mandel
K. S.
,
Ward
S. M.
,
Narayan
G.
,
Jha
S. W.
,
Chant
T.
,
2023
,
MNRAS
,
524
,
235

Draine
B. T.
,
2003
,
ARA&A
,
41
,
241

Feroz
F.
,
Hobson
M. P.
,
Bridges
M.
,
2009
,
MNRAS
,
398
,
1601

Fitzpatrick
E. L.
,
1999
,
PASP
,
111
,
63

Foley
R. J.
et al. ,
2018
,
MNRAS
,
475
,
193

Foreman-Mackey
D.
,
2016
,
J. Open Source Softw.
,
1
,
24

Foreman-Mackey
D.
,
Hogg
D. W.
,
Lang
D.
,
Goodman
J.
,
2013
,
PASP
,
125
,
306

Freedman
W. L.
,
2021
,
ApJ
,
919
,
16

Friedman
A. S.
et al. ,
2015
,
ApJS
,
220
,
9

Frieman
J. A.
et al. ,
2008
,
AJ
,
135
,
338

Gelman
A.
,
Carlin
J. B.
,
Stern
H. S.
,
Dunson
D. B.
,
Vehtari
A.
,
Rubin
D. B.
,
2014
,
Bayesian Data Analysis, 3rd edn. Chapman and Hall/CRC Texts in Statistical Science
,
CRC Press/Taylor and Francis
, New York, NY

Geyer
C. J.
,
1992
,
Stat. Sci.
,
7
,
473

Geyer
C. J.
,
2011
, in
Brooks
S.
,
Gelman
A.
,
Jones
G. L.
,
Meng
X.-L.
, eds,
Handbook of Markov Chain Monte Carlo
.
Chapman and Hall/CRC
, New York, NY, p.
3
,

Grayling
M.
,
Thorp
S.
,
Mandel
K. S.
,
Dhawan
S.
,
Uzsoy
A. S. M.
,
Boyd
B. M.
,
Hayes
E. E.
,
Ward
S. M.
,
2024
,
MNRAS
,
531
,
953

Gunapati
G.
,
Jain
A.
,
Srijith
P. K.
,
Desai
S.
,
2022
,
Publ. Astron. Soc. Aust.
,
39
,
e001

Guy
J.
,
Astier
P.
,
Nobili
S.
,
Regnault
N.
,
Pain
R.
,
2005
,
A&A
,
443
,
781

Guy
J.
et al. ,
2007
,
A&A
,
466
,
11

Hamuy
M.
et al. ,
1996a
,
AJ
,
112
,
2408

Hamuy
M.
,
Phillips
M. M.
,
Suntzeff
N. B.
,
Schommer
R. A.
,
Maza
J.
,
Smith
R. C.
,
Lira
P.
,
Aviles
R.
,
1996b
,
AJ
,
112
,
2438

Han
S.
,
Liao
X.
,
Dunson
D.
,
Carin
L.
,
2016
, in
Gretton
A.
,
Robert
C. C.
, eds,
Proc. Machine Learning Research, Vol. 51, Proceedings of the 19th International Conference on Artificial Intelligence and Statistics
.
PMLR
,
Cadiz, Spain
, p.
829

Handley
W. J.
,
Hobson
M. P.
,
Lasenby
A. N.
,
2015
,
MNRAS
,
453
,
4384

Harris
C. R.
et al. ,
2020
,
Nature
,
585
,
357

Hatano
K.
,
Branch
D.
,
Deaton
J.
,
1998
,
ApJ
,
502
,
177

Hicken
M.
et al. ,
2012
,
ApJS
,
200
,
12

Hinton
S.
,
Brout
D.
,
2020
,
J. Open Source Softw.
,
5
,
2122

Hoffman
M. D.
,
Gelman
A.
,
2014
,
J. Mach. Learn. Res.
,
15
,
1593

Hoffman
M. D.
,
Blei
D. M.
,
Wang
C.
,
Paisley
J.
,
2013
,
J. Mach. Learn. Res.
,
14
,
1303

Holwerda
B. W.
,
Reynolds
A.
,
Smith
M.
,
Kraan-Korteweg
R. C.
,
2015
,
MNRAS
,
446
,
3768

Hounsell
R.
et al. ,
2018
,
ApJ
,
867
,
23

Hsiao
E. Y.
,
2009
,
PhD thesis
,
Univ. Victoria

Hsiao
E. Y.
,
Conley
A.
,
Howell
D. A.
,
Sullivan
M.
,
Pritchet
C. J.
,
Carlberg
R. G.
,
Nugent
P. E.
,
Phillips
M. M.
,
2007
,
ApJ
,
663
,
1187

Huggins
J.
,
Kasprzak
M.
,
Campbell
T.
,
Broderick
T.
,
2020
, in
Chiappa
S.
,
Calandra
R.
eds,
Proc. Machine Learning Research, Vol. 108, Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics
.
PMLR
, p.
1792
,

Hunter
J. D.
,
2007
,
Comput. Sci. Eng.
,
9
,
90

Ivezić
Ž.
et al. ,
2019
,
ApJ
,
873
,
111

Jha
S.
,
Riess
A. G.
,
Kirshner
R. P.
,
2007
,
ApJ
,
659
,
122

Jones
D. O.
et al. ,
2019
,
ApJ
,
881
,
19

Jones
D. O.
et al. ,
2022
,
ApJ
,
933
,
172

Jordan
M. I.
,
Ghahramani
Z.
,
Jaakkola
T. S.
,
Saul
L. K.
,
1999
,
Mach. Learn.
,
37
,
183

Kantorovich
L.
,
Rubinstein
G. S.
,
1958
,
Vestnik Leningrad Univ.
,
13
,
52

Karchev
K.
,
Trotta
R.
,
Weniger
C.
,
2023
,
preprint (arXiv:2311.15650)
https://arxiv.org/abs/2311.15650

Karchev
K.
,
Grayling
M.
,
Boyd
B. M.
,
Trotta
R.
,
Mandel
K. S.
,
Weniger
C.
,
2024
,
MNRAS
,
530
,
3881

Kenworthy
W. D.
et al. ,
2021
,
ApJ
,
923
,
265

Kessler
R.
,
Scolnic
D.
,
2017
,
ApJ
,
836
,
56

Kessler
R.
et al. ,
2009a
,
PASP
,
121
,
1028

Kessler
R.
et al. ,
2009b
,
ApJS
,
185
,
32

Kingma
D. P.
,
Ba
J.
,
2015
,
preprint (arXiv:1412.6980)
https://arxiv.org/abs/1412.6980

Kingma
D. P.
,
Welling
M.
,
2013
,
preprint
()

Knox
L.
,
Millea
M.
,
2020
,
Phys. Rev. D
,
101
,
043533

Kolmogorov
A.
,
1933
,
G. Istituto Ital. Attuari
,
4
,
83

Krisciunas
K.
,
Hastings
N. C.
,
Loomis
K.
,
McMillan
R.
,
Rest
A.
,
Riess
A. G.
,
Stubbs
C.
,
2000
,
ApJ
,
539
,
658

Krisciunas
K.
et al. ,
2007
,
AJ
,
133
,
58

Kucukelbir
A.
,
Tran
D.
,
Ranganath
R.
,
Gelman
A.
,
Blei
D. M.
,
2017
,
J. Mach. Learn. Res.
,
18
,
1

Kullback
S.
,
Leibler
R. A.
,
1951
,
Ann. Math. Stat.
,
22
,
79

Kumar
R.
,
Carroll
C.
,
Hartikainen
A.
,
Martin
O.
,
2019
,
J. Open Source Softw.
,
4
,
1143

LSST Dark Energy Science Collaboration
,
2018
,
preprint
()

Lewis
A.
,
Bridle
S.
,
2002
,
Phys. Rev. D
,
66
,
103511

Liu
R.
,
McAuliffe
J. D.
,
Regier
J.
,
LSST Dark Energy Science Collaboration
,
2023
,
J. Mach. Learn. Res.
,
24
,
1

Malz
A. I.
,
Marshall
P. J.
,
DeRose
J.
,
Graham
M. L.
,
Schmidt
S. J.
,
Wechsler
R.
,
(LSST Dark Energy Science Collaboration)
,
2018
,
AJ
,
156
,
35

Mandel
K. S.
,
2011
,
PhD thesis
,
Astronomy Department, Harvard Univ

Mandel
K. S.
,
Wood-Vasey
W. M.
,
Friedman
A. S.
,
Kirshner
R. P.
,
2009
,
ApJ
,
704
,
629

Mandel
K. S.
,
Narayan
G.
,
Kirshner
R. P.
,
2011
,
ApJ
,
731
,
120

Mandel
K. S.
,
Scolnic
D. M.
,
Shariff
H.
,
Foley
R. J.
,
Kirshner
R. P.
,
2017
,
ApJ
,
842
,
93

Mandel
K. S.
,
Thorp
S.
,
Narayan
G.
,
Friedman
A. S.
,
Avelino
A.
,
2022
,
MNRAS
,
510
,
3939

Margossian
C. C.
,
Pillaud-Vivien
L.
,
Saul
L. K.
,
2024
,
preprint
()

Massey
F. J.
Jr.,
1951
,
J. Am. Stat. Assoc.
,
46
,
68

Minka
T. P.
,
2001
, in
Breese
J. S.
,
Koller
D.
, eds,
UAI ’01: Proceedings of the 17th Conference in Uncertainty in Artificial Intelligence
.
Morgan Kaufmann
, San Francisco, CA, p.
362

Minka
T.
,
2005
,
Technical Report MSR-TR-2005-173
,
Divergence measures and message passing
.

Papamakarios
G.
,
Pavlakou
T.
,
Murray
I.
,
2017
, in
Guyon
I.
,
Luxburg
U. V.
,
Bengio
S.
,
Wallach
H.
,
Fergus
R.
,
Vishwanathan
S.
,
Garnett
R.
, eds,
Advances in Neural Information Processing Systems, Vol. 30, NIPS 2017
.
Curran Associates, Inc
, Red Hook, NY,
preprint
()

Patil
A. A.
,
Bovy
J.
,
Jaimungal
S.
,
Frankel
N.
,
Leung
H. W.
,
2023
,
MNRAS
,
526
,
1997

Perepolkin
D.
,
Goodrich
B.
,
Sahlin
U.
,
2023
,
Comput. Stat. Data Anal.
,
187
,
107795

Perlmutter
S.
et al. ,
1999
,
ApJ
,
517
,
565

Peterson
E. R.
et al. ,
2023
,
MNRAS
,
522
,
2478

Phan
D.
,
Pradhan
N.
,
Jankowiak
M.
,
2019
,
preprint
()

Phillips
M. M.
,
1993
,
ApJ
,
413
,
L105

Phillips
M. M.
,
Lira
P.
,
Suntzeff
N. B.
,
Schommer
R. A.
,
Hamuy
M.
,
Maza
J.
,
1999
,
AJ
,
118
,
1766

Phillips
M. M.
et al. ,
2019
,
PASP
,
131
,
014001

Pierel
J. D. R.
et al. ,
2024
,
ApJ
,
967
,
50

Planck Collaboration VI
,
2020
,
A&A
,
641
,
A6

Popovic
B.
,
Brout
D.
,
Kessler
R.
,
Scolnic
D.
,
Lu
L.
,
2021
,
ApJ
,
913
,
49

Pskovskii
I. P.
,
1977
,
Sov. Astron.
,
21
,
675

Ranganath
R.
,
Gerrish
S.
,
Blei
D.
,
2014
, in
Kaski
S.
,
Corander
J.
, eds,
Proc. Machine Learning Res. Vol. 33, Proceedings of the Seventeenth International Conference on Artificial Intelligence and Statistics
.
PMLR
,
Reykjavik, Iceland
, p.
814
,

Rasmussen
C. E.
,
Williams
C. K. I.
,
2006
,
Gaussian Processes for Machine Learning. Adaptive Computation and Machine Learning
,
MIT Press
, Cambridge, MA

Regier
J.
,
Miller
A. C.
,
Schlegel
D.
,
Adams
R. P.
,
McAuliffe
J. D.
,
Prabhat
,
2019
,
Ann. Applied Statistics
,
13
,
1884

Rest
A.
et al. ,
2014
,
ApJ
,
795
,
44

Rezende
D.
,
Mohamed
S.
,
2015
, in
Bach
F.
,
Blei
D.
, eds,
Proc. Machine Learning Res., Vol. 37, Proceedings of the 32nd International Conference on Machine Learning
.
PMLR
,
Lille, France
, p.
1530
,

Riello
M.
,
Patat
F.
,
2005
,
MNRAS
,
362
,
671

Riess
A. G.
,
Press
W. H.
,
Kirshner
R. P.
,
1995
,
ApJ
,
438
,
L17

Riess
A. G.
,
Press
W. H.
,
Kirshner
R. P.
,
1996a
,
ApJ
,
473
,
88

Riess
A. G.
,
Press
W. H.
,
Kirshner
R. P.
,
1996b
,
ApJ
,
473
,
588

Riess
A. G.
et al. ,
1998
,
AJ
,
116
,
1009

Riess
A. G.
et al. ,
1999
,
AJ
,
117
,
707

Riess
A. G.
et al. ,
2016
,
ApJ
,
826
,
56

Riess
A. G.
et al. ,
2022
,
ApJ
,
934
,
L7

Rizzato
M.
,
Sellentin
E.
,
2023
,
MNRAS
,
521
,
1152

Rose
B. M.
et al. ,
2021
,
preprint
()

Rust
B. W.
,
1975
,
Bull. Am. Astron. Soc.
,
7
,
236

Salim
S.
,
Narayanan
D.
,
2020
,
ARA&A
,
58
,
529

Sánchez
A.
,
Cabrera
G.
,
Huijse
P.
,
Förster
F.
,
2021
, in
Machine Learning and the Physical Sciences Workshop, 35th Conference on Neural Information Processing Systems (NeurIPS)
.

Sandage
A.
,
Tammann
G. A.
,
1993
,
ApJ
,
415
,
1

Saunders
C.
et al. ,
2018
,
ApJ
,
869
,
167

Schlafly
E. F.
,
Finkbeiner
D. P.
,
2011
,
ApJ
,
737
,
103

Schlafly
E. F.
et al. ,
2016
,
ApJ
,
821
,
78

Schlegel
D. J.
,
Finkbeiner
D. P.
,
Davis
M.
,
1998
,
ApJ
,
500
,
525

Scolnic
D. M.
et al. ,
2018
,
ApJ
,
859
,
101

Smirnov
N.
,
1948
,
Ann. Math. Stat.
,
19
,
279

Speagle
J. S.
,
2020
,
MNRAS
,
493
,
3132

Stan Development Team
,
2024
,
Stan Modelling Language Users Guide and Reference Manual v.2.34
.

Talts
S.
,
Betancourt
M.
,
Simpson
D.
,
Vehtari
A.
,
Gelman
A.
,
2018
,
preprint
()

Thorp
S.
,
Mandel
K. S.
,
2022
,
MNRAS
,
517
,
2360

Thorp
S.
,
Mandel
K. S.
,
Jones
D. O.
,
Ward
S. M.
,
Narayan
G.
,
2021
,
MNRAS
,
508
,
4310
(T21)

Thorp
S.
,
Mandel
K. S.
,
Jones
D. O.
,
Kirshner
R. P.
,
Challis
P. M.
,
2024
,
MNRAS
,
530
,
4016

Tran
D.
,
Blei
D.
,
Airoldi
E. M.
,
2015
, in
Cortes
C.
,
Lawrence
N.
,
Lee
D.
,
Sugiyama
M.
,
Garnett
R.
, eds,
Advances in Neural Information Processing Systems, Vol. 28, NIPS 2015
.
Curran Associates, Inc
,
Red Hook, NY

Tripp
R.
,
1998
,
A&A
,
331
,
815

Turner
R. E.
,
Sahani
M.
,
2011
, in
Barber
D.
,
Cemgil
A. T.
,
Chiappa
S.
, eds,
Bayesian Time Series Models
.
Cambridge University Press
,
Cambridge
, p.
104

Vaserstein
L. N.
,
1969
,
Probl. Peredachi Inf.
,
5
,
64

Vehtari
A.
,
Gelman
A.
,
Simpson
D.
,
Carpenter
B.
,
Bürkner
P.-C.
,
2021
,
Bayesian Anal.
,
16
,
667

Vehtari
A.
,
Simpson
D.
,
Gelman
A.
,
Yao
Y.
,
Gabry
J.
,
2024
,
J. Mach. Learn. Res.
,
25
,
1

Villar
V. A.
,
2022
,
preprint
()

Villar
V. A.
et al. ,
2020
,
ApJ
,
905
,
94

Villar
V. A.
,
Cranmer
M.
,
Berger
E.
,
Contardo
G.
,
Ho
S.
,
Hosseinzadeh
G.
,
Lin
J. Y.-Y.
,
2021
,
ApJS
,
255
,
24

Virtanen
P.
et al. ,
2020
,
Nat. Methods
,
17
,
261

Wainwright
M. J.
,
Jordan
M. I.
,
2008
,
Found. Trends Mach. Learn.
,
1
,
1

Ward
S. M.
,
Dhawan
S.
,
Mandel
K. S.
,
Grayling
M.
,
Thorp
S.
,
2023a
,
MNRAS
,
526
,
5715

Ward
S. M.
et al. ,
2023b
,
ApJ
,
956
,
111

Wingate
D.
,
Weber
T.
,
2013
,
preprint
()

Wojtak
R.
,
Hjorth
J.
,
Hjortlund
J. O.
,
2023
,
MNRAS
,
525
,
5187

Yao
Y.
,
Vehtari
A.
,
Simpson
D.
,
Gelman
A.
,
2018
, in
Dy
J.
,
Krause
A.
, eds,
Proc. Machine Learning Res., Vol. 80, Proceedings of the 35th International Conference on Machine Learning
.
PMLR
, Cambridge, MA, p.
5581
,
(Y18)

APPENDIX A: MARGINAL DENSITIES OF THE MVZLTN

In this appendix, we present analytic results for the marginal probability densities of subsets of |$M< N$| variables of the N-dimensional MVZLTN distribution defined in Section 3.3.2. For the results in this section, we will consider an N-vector |$\boldsymbol {\phi }=(\phi _\mathrm{t}, \boldsymbol {\phi }_u)^\top$|⁠, with truncated variable |$\phi _\mathrm{t}$|⁠, and |$N-1$| untruncated variables |$\boldsymbol {\phi }_u$|⁠. The full |$\boldsymbol {\phi }$| vector will have an MVZLTN density,

(A1)

with the mean and covariance parameters partitioned as in equation (15), and with the conditional mean and covariance of |$\boldsymbol {\phi }_u|\phi _\mathrm{t}$| given by equations (17) and (18).

In the following subsections, we will derive all univariate and bivariate marginals. We have validated numerical implementations of these analytic expressions against Monte Carlo samples from the MVZLTN joint distribution. Marginal densities over |$2< M< N$| variables are straightforward generalizations of these results.

A1 Univariate marginal of the truncated variable

By definition, the marginal density of the truncated variable |$\phi _\mathrm{t}$| is just a univariate ZLTN distribution,

(A2)

as defined in equation (14).

A2 Bivariate marginal of the truncated variable and an untruncated variable

Marginalization over any number of untruncated variables can be achieved trivially, and yields a bivariate ZLTN. Marginalizing over all but the ith untruncated variables gives

(A3)

where in the second term we have picked out the ith element of the conditional mean (equation 17) and the |$(i,i)$|th element of the conditional covariance matrix (equation 18).

A3 Univariate marginal of an untruncated variable

From equation (A3), we can derive the marginal of the ith untruncated variable, |$\phi _u^i$|⁠, by integrating over the truncated variable, |$\phi _\mathrm{t}$|⁠. We begin by expanding the first term in equation (A3) using the definition of the ZLTN (equation 14), expanding |$\mu _{u|t}^i$| using equation (17), and rewriting the second term as a Gaussian density over |$\phi _\mathrm{t}$|⁠:

(A4)

First, we handle the case of |$\Sigma _{ut}^i \ne 0$|⁠. The preceding expression can be written as:

(A5)

where |$y = \mu _\mathrm{t} + \sigma _\mathrm{t}^2\times (\phi _u^i-\mu _u^i)/\Sigma _{ut}^i$| and |$s_\mathrm{t}^2 = \Sigma _{uu|t}^{i,i} \times \sigma _\mathrm{t}^4/(\Sigma _{ut}^i)^2$|⁠. We take the product of the two Gaussian densities, giving

(A6)

where |$\tilde{\mu }_\mathrm{t} = \tilde{\sigma }_\mathrm{t}^2 (\sigma _\mathrm{t}^{-2} \mu _\mathrm{t} + s_\mathrm{t}^{-2} y)$| and |$\tilde{\sigma }_\mathrm{t}^2 = [\sigma _\mathrm{t}^{-2} + s_\mathrm{t}^{-2}]^{-1}$|⁠. Notice that the variable of interest |$\phi _u^i$| is contained in y and therefore |$\tilde{\mu }_\mathrm{t}$|⁠. The remaining integral will then have the solution |$[1-\Psi (-\tilde{\mu }_\mathrm{t}/\tilde{\sigma }_\mathrm{t})] = \Psi (\tilde{\mu }_\mathrm{t}/\tilde{\sigma }_\mathrm{t})$|⁠, giving

(A7)

In the case of |$\Sigma _{ut}^i = 0$|⁠, the marginalization simplifies considerably to:

(A8)

A4 Bivariate marginal of two untruncated variables

Finally, we will derive the bivariate joint density over the ith and jth untruncated variables, which we will include in a reduced untruncated vector |$\boldsymbol {\phi }_u^{\prime } = (\phi _u^i, \phi _u^j)^\top$|⁠. Similarly, we will let |$\boldsymbol {\mu }_u^{\prime } = (\mu _u^i, \mu _u^j)^\top$| and |$\boldsymbol {\Sigma }_{ut}^{\prime } = (\boldsymbol {\Sigma }_{ut}^i, \boldsymbol {\Sigma }_{ut}^j)^\top$|⁠, and will define a |$2 \times 2$| submatrix of the full conditional covariance,

(A9)

Analogously to equation (A3), the trivariate marginal density over |$\phi _\mathrm{t}$| and |$\boldsymbol {\phi }_u^{\prime }$| will be given by

(A10)

First, we handle the case of |$\boldsymbol {\Sigma }_{ut}^{\prime } \ne \boldsymbol {0}$|⁠. The preceding expression can be written as:

(A11)

where we have explicitly expanded out the conditional mean (equation 17), and define for convenience: |$\boldsymbol {y} = \boldsymbol {\phi }_u^{\prime }-\boldsymbol {\mu }_u^{\prime } + \boldsymbol {\Sigma }_{ut}^{\prime } \mu _\mathrm{t}/\sigma _\mathrm{t}^2$|⁠, |$\boldsymbol {x} = \boldsymbol {\Sigma }_{ut}^{\prime } /\sigma _\mathrm{t}^2$|⁠, and |$\boldsymbol {W} = \boldsymbol {\Sigma }_{uu|t}^{\prime }$|⁠. We can recognize the second term in equation (A11) as being the likelihood for |$\phi _\mathrm{t}$| in a linear model, |$\boldsymbol {y}=\phi _\mathrm{t} \boldsymbol {x}$|⁠, with correlated Gaussian errors on |$\boldsymbol {y}$| with covariance |$\boldsymbol {W}$| (see e.g. Gelman et al. 2014, equation 14.11). We can separate out the |$\phi _\mathrm{t}$|-dependent part of this term to obtain

(A12)

where |$\hat{\phi }_\mathrm{t} = s_\mathrm{t}^2 \, \boldsymbol {x}^\top \boldsymbol {W}^{-1} \boldsymbol {y}$| and |$s_\mathrm{t}^2 = (\boldsymbol {x}^\top \boldsymbol {W}^{-1} \boldsymbol {x})^{-1}$| (see Mandel 2011, appendix B5), provided that these inverses are available. In the linear regression analogy, |$\hat{\phi }_\mathrm{t}$| and |$s_\mathrm{t}^2$| are the posterior mean and variance of |$\phi _\mathrm{t}$| under a flat improper prior (Gelman et al. 2014, §14.7). The |$\phi _\mathrm{t}$|-dependent part of equation (A11) can then be simplified into a product of two Gaussian densities, and integrated over |$\phi _\mathrm{t}$|⁠:

(A13)

Here, we have defined for convenience |$\tilde{\tau }_\mathrm{t}^{2} = [\sigma _\mathrm{t}^{-2} + s_\mathrm{t}^{-2}]^{-1}$| and |$\tilde{\phi }_\mathrm{t} = \tilde{\tau }_\mathrm{t}^2 (\sigma _\mathrm{t}^{-2} \mu _\mathrm{t} + s_\mathrm{t}^{-2} \hat{\phi }_\mathrm{t})$| when taking the product of the two Gaussian densities. The three steps taken in equation (A13) correspond closely to those taken in equations (A5)–(A7) when deriving the univariate marginal of |$\phi _u^i$|⁠. We can reintroduce the terms from equation (A12) that we omitted in equation (A13) to write the full marginal density of |$\boldsymbol {\phi }_u^{\prime }$| as

(A14)

where it is important to remember that |$\boldsymbol {y}$|⁠, |$\hat{\phi }_\mathrm{t}$|⁠, and |$\tilde{\phi }_\mathrm{t}$| are all functions of the two untruncated variables of interest |$\boldsymbol {\phi }_u^{\prime } = (\phi _u^i, \phi _u^j)^\top$|⁠.

In the special case of |$\boldsymbol {\Sigma }_{ut}^{\prime } = \boldsymbol {0}$|⁠, the marginalization over |$\phi _\mathrm{t}$| simplifies considerably:

(A15)

where |$\boldsymbol {\Sigma }_{uu}^{\prime }$| is the |$[i,j],[i,j] 2\times 2$| submatrix of |$\boldsymbol {\Sigma }_{uu}$|⁠.

This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.