Assessment of gradient-based samplers in standard cosmological likelihoods

Symbols and notations with corresponding meanings.

Symbol	Meaning
\|$N_{\theta }$\|	Number of training points
\|$N_{k}$\|	Number of wavenumbers
\|$N_{z}$\|	Number of redshifts
\|$\boldsymbol{x}$\|	The data vector of size N
\|$\boldsymbol{\theta }$\|	Inputs to the emulator of size p
\|$\Theta$\|	Input training set of size \|$N_{\theta }\times p$\|
\|$\boldsymbol{y}_{k}$\|	Output of size \|$N_{k}$\|
\|$\boldsymbol{y}_{G}$\|	Output of size \|$N_{z}$\|
\|$\textsf {Y}_{k}$\|	Output training set of size \|$N_{\theta }\times N_{k}$\|
\|$\textsf {Y}_{G}$\|	Output training set of size \|$N_{\theta }\times N_{z}$\|

Symbol	Meaning
\|$N_{\theta }$\|	Number of training points
\|$N_{k}$\|	Number of wavenumbers
\|$N_{z}$\|	Number of redshifts
\|$\boldsymbol{x}$\|	The data vector of size N
\|$\boldsymbol{\theta }$\|	Inputs to the emulator of size p
\|$\Theta$\|	Input training set of size \|$N_{\theta }\times p$\|
\|$\boldsymbol{y}_{k}$\|	Output of size \|$N_{k}$\|
\|$\boldsymbol{y}_{G}$\|	Output of size \|$N_{z}$\|
\|$\textsf {Y}_{k}$\|	Output training set of size \|$N_{\theta }\times N_{k}$\|
\|$\textsf {Y}_{G}$\|	Output training set of size \|$N_{\theta }\times N_{z}$\|

Table 1.

Symbols and notations with corresponding meanings.

Symbol	Meaning
\|$N_{\theta }$\|	Number of training points
\|$N_{k}$\|	Number of wavenumbers
\|$N_{z}$\|	Number of redshifts
\|$\boldsymbol{x}$\|	The data vector of size N
\|$\boldsymbol{\theta }$\|	Inputs to the emulator of size p
\|$\Theta$\|	Input training set of size \|$N_{\theta }\times p$\|
\|$\boldsymbol{y}_{k}$\|	Output of size \|$N_{k}$\|
\|$\boldsymbol{y}_{G}$\|	Output of size \|$N_{z}$\|
\|$\textsf {Y}_{k}$\|	Output training set of size \|$N_{\theta }\times N_{k}$\|
\|$\textsf {Y}_{G}$\|	Output training set of size \|$N_{\theta }\times N_{z}$\|

Symbol	Meaning
\|$N_{\theta }$\|	Number of training points
\|$N_{k}$\|	Number of wavenumbers
\|$N_{z}$\|	Number of redshifts
\|$\boldsymbol{x}$\|	The data vector of size N
\|$\boldsymbol{\theta }$\|	Inputs to the emulator of size p
\|$\Theta$\|	Input training set of size \|$N_{\theta }\times p$\|
\|$\boldsymbol{y}_{k}$\|	Output of size \|$N_{k}$\|
\|$\boldsymbol{y}_{G}$\|	Output of size \|$N_{z}$\|
\|$\textsf {Y}_{k}$\|	Output training set of size \|$N_{\theta }\times N_{k}$\|
\|$\textsf {Y}_{G}$\|	Output training set of size \|$N_{\theta }\times N_{z}$\|

Another quantity of interest is the second derivative of |$\mathcal {L}$| at any point in parameter space. This is essentially the Hessian matrix,

$$\begin{eqnarray} \textsf {H}_{\Omega }\equiv \frac{\partial }{\partial \Omega }\, \left(\frac{\partial \mathcal {L}}{\partial \Omega }\right)^{\textrm {T}} \end{eqnarray}$$

(8)

and differentiating equation (7) with respect to |$\Omega$|⁠,

$$\begin{eqnarray} \textsf {H}_{\Omega }=-\left[\dfrac{\partial }{\partial \Omega }\left(\dfrac{\partial \mu }{\partial \Omega }\right)^{\textrm {T}}\right]\textsf {C}^{-1}(\boldsymbol{x}-\boldsymbol{\mu })+\dfrac{\partial \mu }{\partial \Omega }\textsf {C}^{-1}\left(\dfrac{\partial \mu }{\partial \Omega }\right)^{\textrm {T}}. \end{eqnarray}$$

(9)

Note that |$\textsf {H}_{\Omega }\in \mathbb {R}^{p\times p}$|⁠. The parameter vector |$\Omega =\hat{\Omega }$| which yields

$$\begin{eqnarray} \left.\frac{\partial \mathcal {L}}{\partial \Omega }\right|_{\Omega =\hat{\Omega }}=0, \end{eqnarray}$$

corresponds to the maximum likelihood point. For an unbiased estimate of |$\Omega =\hat{\Omega }$|⁠, the Fisher information matrix is an expectation of the Hessian matrix, that is, |$\textsf {F}_{\Omega }=\langle \textsf {H}_{\Omega }\rangle$|⁠. The inverse of the Hessian matrix at this point is the covariance matrix of the parameters. See Tegmark, Taylor & Heavens (1997) for a detailed explanation.

3 METRICS

This section describes the different metrics we will use to compare the performance of different samplers, both in terms of their ability to sample the posterior, and of the quality of the resulting parameter chains.

3.1 Kullback–Leibler divergence

Suppose we know the exact distribution, |$q(\boldsymbol{x})\sim \mathcal {N}(\boldsymbol{\mu },\, \textsf {C})$| and the distribution recovered by a given sampler |$p(\boldsymbol{x})\sim \mathcal {N}(\boldsymbol{\bar{x}},\, \hat{\textsf {C}})$|⁠, and that both of them are multivariate normal. The level of agreement between both distributions can be quantified through the Kullback–Leibler (KL) divergence, which is defined as:

$$\begin{eqnarray} D_{\textrm {KL}}(p||q)=\int _{-\infty }^{+\infty } p(\boldsymbol{x})\, \textrm {log}\, \left[\dfrac{p(\boldsymbol{x})}{q(\boldsymbol{x})}\right]\, \textrm {d}\boldsymbol{x}. \end{eqnarray}$$

(10)

For two multivariate normal distributions, this can be computed analytically as:

$$\begin{eqnarray} D_{\textrm {KL}}=\dfrac{1}{2}\left[\textrm {tr}(\textsf {C}^{-1}\hat{\textsf {C}})-d + (\boldsymbol{\mu }-\bar{\boldsymbol{x}})^{\textrm {T}}\textsf {C}^{-1}(\boldsymbol{\mu }-\bar{\boldsymbol{x}}) + \textrm {ln}\, \dfrac{|\textsf {C}|}{|\hat{\textsf {C}}|}\right], \end{eqnarray}$$

(11)

where d is the number of variables. The KL divergence can be understood as a distance measure between distributions. For example, in equation (11), |$D_{\textrm {KL}}\rightarrow 0$| as |$\boldsymbol{\bar{x}} \rightarrow \boldsymbol{\mu }$| and |$\hat{\textsf {C}} \rightarrow \textsf {C}$|⁠. This metric will only be used for the multivariate normal example in Section 5. Computing the KL divergence, involving non-linear models, will require numerical methods for evaluating high-dimensional integrations.

3.2 Effective sample size

The effective sample size is an approximate measure of the number of effectively uncorrelated samples in a given Markov chain. It is defined as:

$$\begin{eqnarray} n_{\textrm {eff}} = \dfrac{m \cdot n}{1+2\sum _{t=1}^{\infty }\rho _{t}}, \end{eqnarray}$$

(12)

where m is the number of chains, n is the length of each chain, and |$\rho _{t}$| is the autocorrelation of a sequence at lag t (Gelman et al. 2015). Suppose we have the samples, |$\lbrace \theta _{i}\rbrace _{i=1}^{n}$|⁠, the autocorrelation, |$\rho _{t}$| at lag t can be calculated using

$$\begin{eqnarray} \rho _{t} = \dfrac{1}{\sigma ^{2}}\int _{\Theta }\theta ^{(i)}\, \theta ^{(i+t)}\, p(\theta)\, \textrm {d}\theta -\dfrac{\mu ^{2}}{\sigma ^{2}} \end{eqnarray}$$

(13)

for a probability distribution function |$p(\theta)$| with mean |$\mu$| and variance |$\sigma ^{2}$|⁠. Ultimately, we are interested in the computational cost of each effective sample, since we wish to minimize the time taken to obtain a reasonably sampled posterior distribution. For this reason, in this work we will use a scaled effective sample size, which takes into account the total number of likelihood evaluations:

$$\begin{eqnarray} N_{\textrm {eff}} = \dfrac{n_{\textrm {eff}}}{N_{\mathcal {L}}}, \end{eqnarray}$$

(14)

where |$N_{\mathcal {L}}$| is the total number of calls of the likelihood function. In a typical (non-gradient) MCMC sampler, this is simply the number of times the posterior probability is evaluated, while in HMC, this would correspond to the total number of steps performed in all the leapfrog moves.

3.3 Potential scale reduction factor

We also report the potential scale reduction factor, R (Gelman & Rubin 1992) which measures the ratio of the average of the variance within each chain to the variance of all the samples across the chains. If the chains are in equilibrium, then the value of R should be close to one.

Let us assume that we have a set of samples, |$\theta _{ij}$|⁠, where |$i\in [1,n]$| and |$j\in [1,m]$|⁠. As in the previous section, n is the length of the chain and m is the number of chains. The between-chain variance can be calculated using:

$$\begin{eqnarray} B=\dfrac{n}{m-1}\sum _{j=1}^{m}(\bar{\theta }_{j}-\bar{\theta })^{2} \end{eqnarray}$$

(15)

where |$\bar{\theta }_{j}=\frac{1}{n}\sum _{i=1}^{n}\theta _{ij}$| and |$\bar{\theta }=\frac{1}{m}\sum _{j=1}^{m}\bar{\theta }_{j}$|⁠. Furthermore, the within-chain variance is defined as:

$$\begin{eqnarray} W = \dfrac{1}{m}\sum _{j=1}^{m}s_{j}^{2}, \end{eqnarray}$$

(16)

where |$s_{j}^{2}=\frac{1}{n-1}\sum _{i=1}^{n}(\theta _{ij}-\bar{\theta }_{j})^{2}$|⁠. The variance estimator of the within- and between- chain variances can be estimated as:

$$\begin{eqnarray} v = \dfrac{n-1}{n}W + \dfrac{1}{n}B. \end{eqnarray}$$

(17)

Then, the potential scale reduction factor is calculated as:

$$\begin{eqnarray} R = \sqrt{\dfrac{v}{W}} . \end{eqnarray}$$

(18)

The sampling algorithm is usually started at different initial points in parameter space and in practice, |$R-1\sim 0.01$| is deemed to be a reasonable upper limit for convergence to be achieved.

4 DES ANALYSIS

This section presents our first comparison between HMC and MCMC using, as an example, the |$3 \times 2$| point analysis of the first-year data set from the Dark Energy Survey (DES-Y1).

4.1 Data

We use the data¹ produced in the re-analysis of DES Y1 carried out by García-García et al. (2021). The data vector consists of tomographic angular power spectra, including galaxy clustering autocorrelations of the redMaGiC sample in five redshift bins, cosmic shear auto- and cross-correlations of the Y1 Gold sample in four redshift bins, and all the cross-correlation between clustering and shear samples. See Fig. 1 for the two sets of redshift distributions. The methodology used to construct this data vector was described in detail in García-García et al. (2021), and will not be repeated here. Following García-García et al. (2021), we apply a physical scale cut of |$k_{\textrm {max}}=0.15\, {\rm Mpc}^{-1}$| for galaxy clustering, and angular scale cut |$\ell _{\textrm {max}}=2000$| for cosmic shear. This leads to a total of 405 data points.

Figure 1.

Panels (a) and (b) show the estimated redshift distributions of the lens and source galaxies. In particular, we have five bins for the lens galaxies and four bins for the source galaxies. These bins are used in the calculation of the power spectra for galaxy clustering and cosmic shear. These distributions were estimated and made publicly available by Abbott et al. (2018).

Fig. 2 shows the shear–shear component of the data vector, including the measured bandpowers (blue points with error bars) and the best-fitting theoretical prediction (orange curve). Moreover, in Fig. 3, we show the joint correlation matrix for all the data points. The lower block shows the correlation among the data points for galaxy clustering, the middle block along the diagonal for the clustering-shear correlations, and the upper right block for cosmic shear. We assume a Gaussian likelihood of the form

$$\begin{eqnarray} p(\boldsymbol{x}|\boldsymbol{\theta },\, \boldsymbol{\beta }) = \dfrac{1}{\sqrt{|2\pi \textsf {C}|}}\, \textrm {exp}\left[-\dfrac{1}{2}(\boldsymbol{x}-\boldsymbol{\mu })^{\textrm {T}}\textsf {C}^{-1}(\boldsymbol{x}-\boldsymbol{\mu })\right], \end{eqnarray}$$

(19)

where |$\boldsymbol{x}$| is the data vector, |$\boldsymbol{\mu }(\boldsymbol{\theta }, \boldsymbol{\beta })$| is the forward theoretical model as a function of the cosmological parameters, |$\boldsymbol{\theta }$| and the nuisance parameters, |$\boldsymbol{\beta }$|⁠, and |$\textsf {C}$| is the data covariance matrix.

$The data vector for the cosmic shear bandpowers is shown by the error bars in blue and the assumed set of cosmological parameter is $\sigma _{8}=0.841$, $\Omega _{\mathrm{ c}}=0.229$, $\Omega _{\mathrm{ b}}=0.043$, $h=0.717$, $n_{\mathrm{ s}}=0.960$. This corresponds to the mean of the samples obtained when sampling the posterior distribution with NUTS and the emulator. The orange curves show the cosmic shear power spectra. Since we have four tomographic redshift bins (see Fig. 1), we have a total of 10 auto- and cross-power spectra for cosmic shear.$

Figure 2.

The data vector for the cosmic shear bandpowers is shown by the error bars in blue and the assumed set of cosmological parameter is |$\sigma _{8}=0.841$|⁠, |$\Omega _{\mathrm{ c}}=0.229$|⁠, |$\Omega _{\mathrm{ b}}=0.043$|⁠, |$h=0.717$|⁠, |$n_{\mathrm{ s}}=0.960$|⁠. This corresponds to the mean of the samples obtained when sampling the posterior distribution with NUTS and the emulator. The orange curves show the cosmic shear power spectra. Since we have four tomographic redshift bins (see Fig. 1), we have a total of 10 auto- and cross-power spectra for cosmic shear.

$The correlation matrix for the galaxy clustering and cosmic shear data. The block matrices along the diagonal show the correlation for the different combinations of probes (g and $\gamma$).$

Figure 3.

The correlation matrix for the galaxy clustering and cosmic shear data. The block matrices along the diagonal show the correlation for the different combinations of probes (g and |$\gamma$|⁠).

4.2 Theory

In this section, we discuss the forward model used to model the bandpowers. This also entails a set of nuisance parameters to account for the systematics in the model. Throughout this section, we will denote the normalized redshift distributions for the sources as |$p_{\gamma ,i}(z)$| and for the lens galaxies as |$p_{g,i}(z)$|⁠. For example, for the source tomographic distribution i,

$$\begin{eqnarray} p_{\gamma ,i}(z) = \dfrac{n_{\gamma ,i}}{\int n_{\gamma ,i}\,\,\mathrm{ d}z}, \end{eqnarray}$$

(20)

where |$n_{\gamma ,i}(z)$| is the unnormalized distribution.

4.2.1 Power spectra

Assuming a simple bias model for clustering, one which does not depend on redshift, the radial weight function for galaxy clustering in terms of the comoving radial distance, |$\chi$| is:

$$\begin{eqnarray} q_{g,i}(\chi) = b_{i}p_{g,i}(z)\, \dfrac{\mathrm{ d}z}{\mathrm{ d}\chi }, \end{eqnarray}$$

(21)

where |$b_{i}$| is the galaxy bias. Under the Limber approximation (Limber 1953; LoVerde & Afshordi 2008), the power spectrum for galaxy clustering can be written as:

$$\begin{eqnarray} C_{\ell ,i}^{gg}=\int \, \mathrm{ d}\chi \, \dfrac{q_{g,i}^{2}(\chi)}{\chi ^{2}}\, P_{\delta }(k_\ell ,z) \end{eqnarray}$$

(22)

where |$P_{\delta }(k,z)$| is the three-dimensional non-linear matter power spectrum and |$k_\ell =(\ell + {1}/{2})/\chi$|⁠. On the other hand, the lensing efficiency for tomographic bin i is:

$$\begin{eqnarray} q_{\gamma , i} = G_{\ell }\, \dfrac{3H_{0}^{2}\Omega _{\mathrm{ m}}}{2c^{2}}\, \dfrac{\chi }{a(\chi)}\int _{\chi }^{\chi _{H}}\, \mathrm{ d}\chi ^{\prime }\, p_{\gamma ,i}\, \dfrac{\chi ^{\prime }-\chi }{\chi }, \end{eqnarray}$$

(23)

where c is the speed of light, |$H_{0}$| is the Hubble constant, a is the scale factor and

$$\begin{eqnarray} G_{\ell } = \dfrac{\sqrt{\ell (\ell + 2)(\ell + 1)(\ell -1)}}{(\ell + {1}/{2})^{2}}. \end{eqnarray}$$

The power spectrum due to the correlation of the lens galaxy positions in bin i with the source galaxy shear in bin j is given by:

$$\begin{eqnarray} C_{\ell ,ij}^{g\gamma } = (1+m_{j})\, \int \dfrac{q_{g,i}(\chi)\, q_{\gamma ,j}(\chi)}{\chi ^{2}}\, P_{\delta }(k_\ell ,z)\, \mathrm{ d}\chi, \end{eqnarray}$$

(24)

where |$m_{j}$| is the multiplicative shear bias. Moreover, the shear power spectrum is given by:

$$\begin{eqnarray} C_{\ell ,ij}^{\gamma \gamma } = (1+m_{i})(1+m_{j})\, \int \dfrac{q_{\gamma ,i}(\chi)\, q_{\gamma , j}(\chi)}{\chi ^{2}}\, P_{\delta }(k_\ell ,z)\, \mathrm{ d}\chi. \end{eqnarray}$$

(25)

Until now, as part of the modelling framework, we have the galaxy bias, |$b_{i}$| for galaxy clustering and the multiplicative bias, |$m_{i}$| for the shear bias. Given that we have five tomographic bins for the lens galaxies and four tomographic bins for the source galaxies (see Fig. 1), the total number of galaxy biases and multiplicative biases is nine. In the next section, we elaborate on additional nuisance parameters which are accounted for in the modelling framework.

4.2.2 Intrinsic alignment and shifts

The uncertainty in the redshift distributions for both the lens and source galaxies is modelled as a shift in the redshift, that is,

$$\begin{eqnarray} n_{i}(z)\rightarrow n_{i}(z-\Delta z_{i}) \end{eqnarray}$$

(26)

hence leading to another set of nine nuisance parameters. The shifts are assumed to follow a Gaussian distribution and we adopt the same priors as employed by Abbott et al. (2018). Moreover, we also account for the intrinsic alignment contribution to the shear signal (Joachimi et al. 2011; Abbott et al. 2018; García-García et al. 2021). Intrinsic alignments are described through the non-linear alignment model of Hirata & Seljak (2004) and Bridle & King (2007). In this case, the intrinsic alignment (IA) can be simply included as an additive contribution with a radial kernel which is proportional to the tomographic redshift distributions

$$\begin{eqnarray} q_{\textrm {IA},i}(\chi) = -G_{\ell }A_{\textrm {IA}}(z)p(z)\dfrac{\mathrm{ d}z}{\mathrm{ d}\chi }, \end{eqnarray}$$

(27)

where

$$\begin{eqnarray} A_{\textrm {IA}}(z) = 5\times 10^{-14}\, \times \, A_{\textrm {IA},0}\, \left(\dfrac{1+z}{1+z_{0}}\right)^{\eta }\, \dfrac{\rho _{c}\Omega _{\mathrm{ m}}}{D(z)}, \end{eqnarray}$$

(28)

where |$z_{0}=0.62$|⁠, |$\rho _{c} = 2.775\times 10^{11}\,\,h^2 \textrm {M}_{\odot } \textrm {Mpc}^{-3}$| is the critical density of the Universe, |$\Omega _{\mathrm{ m}}$| is the matter density, |$D(z)$| is the growth factor. Two nuisance parameters in this model are |$A_{\textrm {IA}, 0}$| and |$\eta$| which are also inferred in the sampling procedure. In short, the total number of parameters in the assumed model is 25, five cosmological parameters and 20 nuisance parameters.

4.3 Emulation

The default method for computing the linear matter power spectrum in jax-cosmo is the fitting formula derived by Eisenstein & Hu (1998). An alternative approach is to emulate the linear matter power spectrum as calculated by a Boltzmann solver such as class (Blas, Lesgourgues & Tram 2011; Lesgourgues 2011), which we adopt in this work. In particular, using a similar approach developed in Mootoovaloo et al. (2022), we decompose the linear matter power spectrum in two parts:

$$\begin{eqnarray} P_{l}(k, z, \boldsymbol{\theta }) = G(z, \boldsymbol{\theta })\, P_{l}^{0}(k,\boldsymbol{\theta }), \end{eqnarray}$$

(29)

where |$P_{l}^{0}$| is the linear matter power spectrum evaluated at redshift, |$z=0$|⁠. The input training points (the cosmological parameters) are generated using Latin Hypercube (LH) Sampling which ensures the points randomly cover the full space. In particular, the emulator is built over the redshift range of |$z\in [0.0,\, 3.0]$| and the wavenumber range of |$k\in [10^{-4},\, 50]$| in units of |$\textrm {Mpc}^{-1}$|⁠. See Table 1 for the different notations used in this work. Moreover, we use |$N_{\theta } = 1000$| training points according to the prior range shown in Table 2. In particular, we record the targets |$(G\textrm { and }P_{l}^{0})$| over |$N_{z}=20$| redshift values, equally spaced in linear scale for the redshift range and |$N_{k}=30$| wavenumber values, equally spaced in logarithmic scale for the wavenumber range. This gives us two training sets, |$\textsf {Y}_{k}\in \mathbb {R}^{N_{\theta } \times N_{k}}$| and |$\textsf {Y}_{G}\in \mathbb {R}^{N_{\theta }\times N_{z}}$|⁠. We then build 50 independent models as a function of the cosmological parameters.

Table 2.

Priors and inferred values of the cosmological parameters. We report the inferred mean and standard deviation for the different experiments we have run. As anticipated, the inferred mean and standard deviation, irrespective of the two samplers employed, are in close agreement with each other given we are using the same method for computing the different power spectra.

			NUTS				cobaya’s MCMC
Parameters	\|$\Theta$\|	Priors	\|$\mu _{\textrm {emu}}$\|	\|$\sigma _{\textrm {emu}}$\|	\|$\mu _{\textrm {EH}}$\|	\|$\sigma _{\textrm {EH}}$\|	\|$\mu _{\textrm {emu}}$\|	\|$\sigma _{\textrm {emu}}$\|	\|$\mu _{\textrm {EH}}$\|	\|$\sigma _{\textrm {EH}}$\|
Amplitude of density fluctuations	\|$\sigma _{8}$\|	\|$\mathcal {U}[0.60,\, 1.00]$\|	0.840	0.064	0.828	0.063	0.840	0.062	0.830	0.061
CDM density	\|$\Omega _{c}$\|	\|$\mathcal {U}[0.07,\, 0.50]$\|	0.229	0.024	0.229	0.026	0.229	0.023	0.228	0.025
Baryon density	\|$\Omega _{b}$\|	\|$\mathcal {U}[0.03,\, 0.07]$\|	0.043	0.007	0.045	0.007	0.043	0.007	0.045	0.007
Hubble parameter	h	\|$\mathcal {U}[0.64,\, 0.82]$\|	0.719	0.051	0.711	0.050	0.719	0.049	0.712	0.048
Scalar spectral index	\|$n_{s}$\|	\|$\mathcal {U}[0.87,\, 1.07]$\|	0.957	0.056	0.958	0.056	0.958	0.054	0.959	0.054

			NUTS				cobaya’s MCMC
Parameters	\|$\Theta$\|	Priors	\|$\mu _{\textrm {emu}}$\|	\|$\sigma _{\textrm {emu}}$\|	\|$\mu _{\textrm {EH}}$\|	\|$\sigma _{\textrm {EH}}$\|	\|$\mu _{\textrm {emu}}$\|	\|$\sigma _{\textrm {emu}}$\|	\|$\mu _{\textrm {EH}}$\|	\|$\sigma _{\textrm {EH}}$\|
Amplitude of density fluctuations	\|$\sigma _{8}$\|	\|$\mathcal {U}[0.60,\, 1.00]$\|	0.840	0.064	0.828	0.063	0.840	0.062	0.830	0.061
CDM density	\|$\Omega _{c}$\|	\|$\mathcal {U}[0.07,\, 0.50]$\|	0.229	0.024	0.229	0.026	0.229	0.023	0.228	0.025
Baryon density	\|$\Omega _{b}$\|	\|$\mathcal {U}[0.03,\, 0.07]$\|	0.043	0.007	0.045	0.007	0.043	0.007	0.045	0.007
Hubble parameter	h	\|$\mathcal {U}[0.64,\, 0.82]$\|	0.719	0.051	0.711	0.050	0.719	0.049	0.712	0.048
Scalar spectral index	\|$n_{s}$\|	\|$\mathcal {U}[0.87,\, 1.07]$\|	0.957	0.056	0.958	0.056	0.958	0.054	0.959	0.054

Table 2.

			NUTS				cobaya’s MCMC
Parameters	\|$\Theta$\|	Priors	\|$\mu _{\textrm {emu}}$\|	\|$\sigma _{\textrm {emu}}$\|	\|$\mu _{\textrm {EH}}$\|	\|$\sigma _{\textrm {EH}}$\|	\|$\mu _{\textrm {emu}}$\|	\|$\sigma _{\textrm {emu}}$\|	\|$\mu _{\textrm {EH}}$\|	\|$\sigma _{\textrm {EH}}$\|
Amplitude of density fluctuations	\|$\sigma _{8}$\|	\|$\mathcal {U}[0.60,\, 1.00]$\|	0.840	0.064	0.828	0.063	0.840	0.062	0.830	0.061
CDM density	\|$\Omega _{c}$\|	\|$\mathcal {U}[0.07,\, 0.50]$\|	0.229	0.024	0.229	0.026	0.229	0.023	0.228	0.025
Baryon density	\|$\Omega _{b}$\|	\|$\mathcal {U}[0.03,\, 0.07]$\|	0.043	0.007	0.045	0.007	0.043	0.007	0.045	0.007
Hubble parameter	h	\|$\mathcal {U}[0.64,\, 0.82]$\|	0.719	0.051	0.711	0.050	0.719	0.049	0.712	0.048
Scalar spectral index	\|$n_{s}$\|	\|$\mathcal {U}[0.87,\, 1.07]$\|	0.957	0.056	0.958	0.056	0.958	0.054	0.959	0.054

			NUTS				cobaya’s MCMC
Parameters	\|$\Theta$\|	Priors	\|$\mu _{\textrm {emu}}$\|	\|$\sigma _{\textrm {emu}}$\|	\|$\mu _{\textrm {EH}}$\|	\|$\sigma _{\textrm {EH}}$\|	\|$\mu _{\textrm {emu}}$\|	\|$\sigma _{\textrm {emu}}$\|	\|$\mu _{\textrm {EH}}$\|	\|$\sigma _{\textrm {EH}}$\|
Amplitude of density fluctuations	\|$\sigma _{8}$\|	\|$\mathcal {U}[0.60,\, 1.00]$\|	0.840	0.064	0.828	0.063	0.840	0.062	0.830	0.061
CDM density	\|$\Omega _{c}$\|	\|$\mathcal {U}[0.07,\, 0.50]$\|	0.229	0.024	0.229	0.026	0.229	0.023	0.228	0.025
Baryon density	\|$\Omega _{b}$\|	\|$\mathcal {U}[0.03,\, 0.07]$\|	0.043	0.007	0.045	0.007	0.043	0.007	0.045	0.007
Hubble parameter	h	\|$\mathcal {U}[0.64,\, 0.82]$\|	0.719	0.051	0.711	0.050	0.719	0.049	0.712	0.048
Scalar spectral index	\|$n_{s}$\|	\|$\mathcal {U}[0.87,\, 1.07]$\|	0.957	0.056	0.958	0.056	0.958	0.054	0.959	0.054

As described in Section 2.1, we then use the Gaussian Process formalism to build an emulator which maps the cosmological parameters, |$\Theta \in \mathbb {R}^{N_{\theta }\times p}$|⁠, where |$N_{\theta }=1000$| and |$p=5$| in our application, to each of the target. The models are trained and we store |$\boldsymbol{\alpha } = (\textsf {K}+\Sigma)^{-1}\boldsymbol{y}$|⁠. During prediction phase, the mean prediction and the first derivative can be calculated using equations (3) and (5), respectively.

One can then use either the emulator or the fitting formula of Eisenstein & Hu (1998) to compute the linear matter power spectrum. Moreover, the existing jax-cosmo functionalities are untouched, implying that the non-linear matter power spectrum calculation proceeds via the Halofit fitting formula as a Takahashi et al. (2012), which depends directly on the linear matter power spectrum. Furthermore, the shear and galaxy clustering power spectra can be calculated very quickly via numerical integration.

4.4 Inference

In this work, we will compare two samplers namely, NUTS and the default MCMC sampler in cobaya. Throughout this work, we will use the implementation of NUTS in numpyro. In numpyro, the sampling process is divided into two distinct phases: warmup and sampling.

During the warmup phase, the sampler explores the parameter space to adaptively tune its proposal distribution. It does so by generating a series of samples and adjusting its parameters based on these samples. This phase aims to reach a region of high-probability density in the posterior distribution while minimizing the influence of the initial guess. In particular, it focuses on improving the sampler’s efficiency by adjusting its step sizes, trajectory lengths, or other parameters to achieve an optimal acceptance rate. The warmup phase is crucial for ensuring that subsequent samples are drawn from the target distribution effectively. In general, if one chooses to adapt the mass matrix during the warmup phase, this can be expensive but once this is learnt, it is fixed and the sampling process can be fast.

Once the warmup phase is complete, the sampler enters the sampling phase. Here, the sampler generates samples from the posterior distribution according to the adapted proposal distribution. These samples are used for inference and analysis, such as estimating posterior means and variances.

The warmup phase is essential for the sampler to adapt to the characteristics of the target distribution, while the sampling phase focuses on generating samples for inference. Separating these phases allows numpyro to achieve efficient sampling while ensuring the quality of the final samples.

The NUTS sampler in numpyro involves setting a maximum depth parameter, which determines the maximum depth of the binary trees it evaluates during each iteration. The number of leapfrog steps taken is then constrained to be no more than |$2^{j}-1$|⁠, where j is the maximum tree depth (see Fig. 1 in Hoffman & Gelman 2011). The sampler reports both the tree depth and the actual number of leapfrog steps computed, along with the parameters sampled during the process. These parameters provide insights into how the sampler explores the parameter space and allow users to monitor the efficiency of the sampling process. Adjusting the maximum depth parameter can help balance exploration and efficiency, ensuring that the sampler explores the posterior distribution effectively while avoiding excessive computational costs.

For NUTS, we set the maximum number of tree depth to eight and use an initial step size of 0.01. Moreover, we fix the number of warmup steps to 500. We also generate two chains consisting of 15 000 samples each.

cobaya uses a Metropolis MCMC as described by Lewis (2013) and one could also exploit its fast and slow sampler. The idea is to decorrelate the fast and slow parameters so that sampling becomes very efficient. This can result in large performance gains when there are many fast parameters. Throughout this work, we do not use this feature, as both the emulator and the Eisenstein and Hu (EH) methods are sufficiently fast. Additionally, accurately quantifying the performance gain poses challenges in this context. When using cobaya’s MCMC, the sampler stops when either the number of samples specified is attained or the convergence criterion is met |$(R-1\le 0.01)$|⁠. In some likelihood analyses, it is also possible to adopt an approximate likelihood by analytically marginalizing over the many nuisance parameters (Hadzhiyska et al. 2023; Ruiz-Zapatero et al. 2023). For cobaya’s MCMC, we run two separate chains and we set the number of MCMC samples to |$5\times 10^{5}$|⁠.

In all experiments we have performed in this work, we have provided cobaya’s MCMC with a covariance matrix for the proposal distribution. In doing so, we are essentially providing information about how to explore the parameter space efficiently. This vastly helps the sampler to generate samples that are more likely to be accepted, leading to better sampling performance. In principle, one could also specify a mass matrix in numpyro for NUTS. The latter dynamically adapts its step size and mass matrix during the warmup phase. It uses the information from the gradient of the log posterior to tune these parameters. This adaptiveness allows NUTS to efficiently explore the parameter space without requiring explicit specification of the proposal distribution covariance.

4.5 Results

In this section, we explore the different inference made when using the emulator and the two samplers, NUTS and cobaya’s MCMC.

In Fig. 4, we show the accuracy of the emulator, which was then embedded into the jax-cosmo pipeline. For the range of redshifts and wavenumber considered and the domain of the cosmological parameters, the quantities |$P_{l}$| and G are accurate up to |$\le 1~{{\ \rm per\ cent}}$| and |$\le 0.01~{{\ \rm per\ cent}}$|⁠, respectively. Recall that we are using 1000 LH samples to build the emulator. Generating the 1000 training points using class took around 1 h while training the GPs took around 2 h on a desktop computer. The training of GPs is expensive because of the |$\mathcal {O}(N^{3})$| cost in computing the Cholesky factor and |$\mathcal {O}(N^{2})$| cost in solving for |$\boldsymbol{\alpha }=\textsf {K}_{y}^{-1}\boldsymbol{y}$|⁠. However, once they are trained and stored, prediction is very fast and computing the log-likelihood is of the order of milliseconds. Moreover, one can use the fixed |$\boldsymbol{\alpha }$|s and the kernel pre-trained hyperparameters in any GP implementation irrespective of whether we use numpy, pytorch, tensorflow, and jax. Moreover, the priors are sufficiently broad that the emulation framework can be used for different probes, for example, weak lensing as in this context.

$The left panel shows the accuracy for the linear matter power spectrum, evaluated at $z=0$ over a wavenumber range of $k\in [10^{-4},\, 50]$ in units of $\textrm {Mpc}^{-1}$ and the right panel shows the accuracy for the quantity, G evaluated over the redshift range of $z\in [0.0\, 3.0]$. These quantities can be robustly calculated within an accuracy of 1 per cent.$

Figure 4.

The left panel shows the accuracy for the linear matter power spectrum, evaluated at |$z=0$| over a wavenumber range of |$k\in [10^{-4},\, 50]$| in units of |$\textrm {Mpc}^{-1}$| and the right panel shows the accuracy for the quantity, G evaluated over the redshift range of |$z\in [0.0\, 3.0]$|⁠. These quantities can be robustly calculated within an accuracy of 1 per cent.

Fig. 5 compares the marginalized 1D and 2D distributions of the cosmological parameters, |$\boldsymbol{\Theta }$| (see Table 2) using the emulator with cobaya’s MCMC and NUTS. The inferred cosmological parameters are shown in Table 2 for different setups.

Figure 5.

The 1D and 2D marginalized posterior distribution of all the cosmological parameters. The green contours show the distribution when the emulator is used with the cobaya’s MCMC sampler while the solid black curves correspond to the setup where NUTS is used for sampling the posterior distribution. There is negligible difference in the posterior when comparing the MCMC in cobaya and NUTS. The left panel shows the posterior obtained when using the DES data while the right panel shows the contours obtained with the simulated LSST-like data.

Under this configuration, the potential scale reduction factor is equal to 1.00 for all parameters when either the emulator or EH is used in jax-cosmo. Sampling the posterior with NUTS takes |$\sim 13$| h for two chains with numpyro using a single GPU. Alternatively, a single run using cobaya’s MCMC, whether with the emulator or EH in jax-cosmo, takes approximately 5 h to sample the posterior. Note that the chains generated by both samplers did not contain the same number of samples, and therefore the difference in time above is not reflective of their relative performance. Moreover, in order to quantify the difference between the inferred parameters with either sampler, we use the ‘difference of Gaussians’ statistic:

$$\begin{eqnarray} \delta = \dfrac{|\mu _{\textrm{ NUTS}}-\mu _{\textrm{COBAYA}}|}{\sqrt{\sigma _{\textrm{NUTS}}^{2} + \sigma _{\textrm{COBAYA}}^{2}}}. \end{eqnarray}$$

(30)

The maximum difference among the set of parameters considered in this experiment is |$\sim 0.1$|⁠. We also compute the average of the scaled effective sample size, |$N_{\textrm {eff}}$|⁠, to compare the samplers. We define the efficiency gain as:

$$\begin{eqnarray} \gamma = \dfrac{N_{\textrm {eff, {NUTS}}}}{N_{\textrm {eff, {COBAYA}}}}. \end{eqnarray}$$

(31)

The relative gain in efficiency when using NUTS compared to cobaya’s MCMC is |$\mathcal {O}(10)$|⁠. When using Limberjack and NUTS in turing.jl, Ruiz-Zapatero et al. (2024) estimated a gain in |$N_{\textrm {eff}}$| of |$\sim 1.7$|⁠, compared to the samples obtained using Metropolis–Hastings implemented in cobaya (García-García et al. 2021). In addition, when using reverse mode automatic differentiation (the default setup in numpyro), the cost of a single gradient calculation to the cost of a single likelihood evaluation is |$\sim 4.5$| (either with the emulator or EH). The gain in efficiency is better compared to the cost of the gradient evaluation. With julia, Ruiz-Zapatero et al. ( 2024) found this ratio to be |$\sim 5.5$| when using forward mode automatic differentiation. Note that the differences in the values above can be attributed to the fact that different samplers will, in general, have different implementations. Taking into account the more expensive gradient evaluation, we find that the overall efficiency gain of NUTS with respect to cobaya’s MCMC, when measured in terms of computing time on the same platform, is |$\sim 2$|⁠.

In Fig. 6, we show the joint posterior distribution of the |$S_{8}\equiv \sigma _{8}\sqrt{\Omega _{\mathrm{ m}}/0.3}$| and |$\Omega _{\mathrm{ m}}$| parameters when the posterior is sampled with cobaya’s MCMC and NUTS. We also use the Planck 2018 samples, which are publicly available, to show the same joint distribution. Note that we are using the baseline |$\Lambda$|cold dark matter (CDM) chains with the baseline likelihoods for Planck. With the public Planck chains, |$S_{8}=0.834\pm 0.016$|⁠. On the other hand, in our experiments, if we use the emulator with cobaya’s MCMC and NUTS, then |$S_{8}=0.797 \pm 0.035$| and |$S_{8}=0.797\pm 0.036$|⁠. With EH, |$S_{8}=0.788\pm 0.033$| and |$S_{8}=0.788\pm 0.034$| with cobaya’s MCMC and NUTS, respectively.

$The marginalized posterior distribution for the derived parameter, $S_{8}=\sigma _{8}\sqrt{\Omega _{\mathrm{ m}}/0.3}$ using the emulator. The black contours show the distribution using the sampler in NUTS while the green contours correspond to the posterior using cobaya’s MCMC. We also plot the Planck 2018 samples (in orange) of the $S_{8}$ and $\Omega _{m}$ which are publicly available.$

Figure 6.

The marginalized posterior distribution for the derived parameter, |$S_{8}=\sigma _{8}\sqrt{\Omega _{\mathrm{ m}}/0.3}$| using the emulator. The black contours show the distribution using the sampler in NUTS while the green contours correspond to the posterior using cobaya’s MCMC. We also plot the Planck 2018 samples (in orange) of the |$S_{8}$| and |$\Omega _{m}$| which are publicly available.

Despite the computational overhead of computing derivatives in NUTS, we observe a gain in the scaled effective sample size when comparing NUTS and cobaya’s MCMC. However, it raises the question of whether this advantage will persist in higher-dimensional problems. To address this, we investigate three additional cases: a multivariate normal distribution, the Rosenbrock function, and a future LSST-like system with 37 model parameters. This broader analysis aims to ascertain the scalability and effectiveness of these samplers across various dimensions and problem complexities.

5 ANALYTICAL FUNCTIONS

In this section, we will investigate how these metrics scale with other functions, such as the multivariate normal distribution and the Rosenbrock function.

5.1 Multivariate normal distribution

The expression for a multivariate normal distribution is:

$$\begin{eqnarray} q(\boldsymbol{x}|\boldsymbol{\mu }, \textsf {C}) = \dfrac{1}{\sqrt{|2\pi \textsf {C}|}}\, \textrm {exp}\left[-\dfrac{1}{2}(\boldsymbol{x}-\boldsymbol{\mu })^{\textrm {T}}\textsf {C}^{-1}(\boldsymbol{x}-\boldsymbol{\mu })\right]. \end{eqnarray}$$

(32)

For simplicity, we assume a zero mean and an identity matrix for the covariance of the multivariate normal distribution. The aim is to obtain samples of |$\boldsymbol{x}$| and to estimate the sample mean, |$\boldsymbol{\bar{x}}$| and covariance, |$\hat{\textsf {C}}$| as we increase the dimensionality of the problem. In the limit where we have a large number of unbiased samples of |$\boldsymbol{x}$|⁠, then |$\boldsymbol{\bar{x}} \rightarrow \boldsymbol{\mu }$| and |$\hat{\textsf {C}} \rightarrow \textsf {C}$|⁠.

5.2 Rosenbrock function

The next function we consider is the Rosenbrock function, which is given by:

$$\begin{eqnarray} f(\boldsymbol{x}) = \sum _{i=1}^{N/2}\left[\zeta (x_{2i-1}^{2}-x_{2i})^{2}+(x_{2i-1}-1)^{2}\right], \end{eqnarray}$$

(33)

where |$\boldsymbol{x}$| is a vector of size N and for this particular variant of the Rosenbrock function, N is even. |$\zeta$| is a factor which controls the overall shape of the final function. If it is set to zero, then, the function is simply analogous to the |$\chi ^{2}$| term in a multivariate normal distribution with mean one and covariance matrix equal to the identity matrix. For |$\zeta > 0$|⁠, a quartic term is introduced in the overall function and this leads to banana-like posteriors. In both experiments (multivariate normal distribution and the Rosenbrock function), we set the step size and the maximum tree depth to 0.01 and 6, respectively. We use 500 warmup steps and generate two chains consisting of 5000 samples.

We fix |$\zeta = 9$| in our experiments. Some of the 2D joint posterior distributions follow a banana-like shapes (see Fig. 7), which demonstrates the complexity of these functions, especially in high dimensions.

$The posterior distribution of the first six parameters in a 100 dimensional Rosenbrock function using cobaya’s MCMC and NUTS sampler. In particular, this is obtained by setting $\zeta =9$ in the function (see Section 5.2 for further details).$

Figure 7.

The posterior distribution of the first six parameters in a 100 dimensional Rosenbrock function using cobaya’s MCMC and NUTS sampler. In particular, this is obtained by setting |$\zeta =9$| in the function (see Section 5.2 for further details).

5.3 Results

In Fig. 8, we show how |$N_{\textrm {eff}}$| changes as we increase the dimensionality of the multivariate normal distribution (in red and blue using cobaya’s MCMC and NUTS, respectively). When evaluating the effective sample size per likelihood evaluation, cobaya’s MCMC consistently yields lower values compared to NUTS. This discrepancy arises from their respective sampling strategies. NUTS uses the gradient of the likelihood function to build its proposal distribution leading to a higher acceptance rate compared to cobaya’s MCMC whose proposal distribution only depends on the last accepted sample. Consequently, while cobaya’s MCMC requires more likelihood evaluations to achieve a comparable effective sample size to NUTS, the latter tends to provide more informative samples in fewer evaluations. Moreover, Fig. 9 shows how the KL-divergence between the inferred distribution and the exact distribution changes as a function of the dimensionality. The KL-divergence values for both samplers are comparable except at |$d=10$| and |$d> 80$| where, it is greater for cobaya’s MCMC compared to NUTS. However, the inferred values of the mean and standard deviation are close to zero and unity, respectively, with either sampler. In order to compute the KL-divergence, we use the same number of samples (⁠|$10^{4}$|⁠) for both samplers, which is equal to the number of samples drawn using NUTS. The maximum difference between the expected statistics (mean and standard deviation) and the inferred ones across all dimensions is |$\sim 0.03$| with either sampler.

$The figure shows the scaled effective sample size as a function of dimensions of the multivariate normal distribution (red for cobaya’s MCMC and blue for NUTS) and the Rosenbrock function (green for cobaya’s MCMC and purple for NUTS). $N_{\textrm {eff}}$ for NUTS is always higher compared to cobaya’s MCMC over the dimension considered here. Moreover, as expected, for a tricky function such as the Rosenbrock function, $N_{\textrm {eff}}$ is always lower compared to the multivariate normal distribution case.$

Figure 8.

The figure shows the scaled effective sample size as a function of dimensions of the multivariate normal distribution (red for cobaya’s MCMC and blue for NUTS) and the Rosenbrock function (green for cobaya’s MCMC and purple for NUTS). |$N_{\textrm {eff}}$| for NUTS is always higher compared to cobaya’s MCMC over the dimension considered here. Moreover, as expected, for a tricky function such as the Rosenbrock function, |$N_{\textrm {eff}}$| is always lower compared to the multivariate normal distribution case.

$As elaborated in Section 3.1, we also compute the Kullback–Leibler divergence, $D_{\textrm {KL}}$, between the sampled distribution and the expected one. For the dimensions considered here $(10-100)$, most the $D_{\mathrm{ KL}}$ values are comparable except at $d=10$ and $d> 80$, where the $D_{\textrm {KL}}$ for cobaya’s MCMC are greater than that of NUTS. Note that for a fair comparison, we use the same number of MCMC samples in both cases. The inferred mean and the standard deviation in either case are close to 0 and 1, respectively.$

Figure 9.

As elaborated in Section 3.1, we also compute the Kullback–Leibler divergence, |$D_{\textrm {KL}}$|⁠, between the sampled distribution and the expected one. For the dimensions considered here |$(10-100)$|⁠, most the |$D_{\mathrm{ KL}}$| values are comparable except at |$d=10$| and |$d> 80$|⁠, where the |$D_{\textrm {KL}}$| for cobaya’s MCMC are greater than that of NUTS. Note that for a fair comparison, we use the same number of MCMC samples in both cases. The inferred mean and the standard deviation in either case are close to 0 and 1, respectively.

https://github.com/xC-ell/growth-history/tree/main

In Fig. 8, we show the scaled effective sample size when we sample the Rosenbrock function with NUTS and cobaya’s MCMC (in purple and green, respectively) for different dimensions. Although the Rosenbrock function is more complex than the multivariate normal distribution, |$N_{\textrm {eff}}$| for NUTS is superior to that of cobaya’s MCMC. Nevertheless, as depicted in the Fig. 8, the |$N_{\textrm {eff}}$| for the Rosenbrock function is anticipated to be lower than that for the multivariate normal distribution when using the same sampler. In the multivariate normal distribution example, we know the analytic expression for the function and hence, we are able to use the analytic expression for the Kullback–Leibler divergence to estimate the distance between the inferred distribution and the expected distribution. For the Rosenbrock function, this is not the case since we are dealing with a non-trivial function. However, as seen in Fig. 7, the mean and variance are roughly the same for every two dimensions. In this spirit, we calculate the mean of the difference between the mean of the samples and the expected mean, that is, |$\langle \mu _{*}-\mu \rangle$| for all parameters. We do a similar calculation for the standard deviation, that is, |$\langle \sigma _{*}-\sigma \rangle$|⁠. |$\mu _{*}$| and |$\sigma _{*}$| are the inferred mean and standard deviation while |$\mu$| and |$\sigma$| are the expected mean and standard deviation. For |$\zeta =9$|⁠, |$\mu \sim 0.45$| and |$\sigma \sim 0.39$| in the first dimension and |$\mu \sim 0.39$| and |$\sigma \sim 0.48$| in the second dimension. We find that with NUTS and cobaya’s MCMC, these differences are very close to zero in all dimensions, |$\langle \mu _{*}-\mu \rangle \lesssim 0.015$| and |$\langle \sigma _{*}-\sigma \rangle \lesssim 0.015$|⁠.

Furthermore, in both the multivariate normal distribution and the Rosenbrock function examples, the potential scale reduction factor is close to one when either NUTS or cobaya’s MCMC is used to sample the function. However, with a tricky function such as the Rosenbrock function, we find that the potential scale reduction factor gets worse (⁠|$R\sim 1.0-1.4$|⁠) with cobaya’s MCMC as the dimensionality increases. Moreover, the acceptance probability when NUTS is used is always |$\gtrsim 0.7$| with either the multivariate normal or the Rosenbrock function. On the other hand, cobaya’s MCMC has an acceptance probability of |$\sim 0.3$| when sampling the multivariate normal distribution. With the Rosenbrock function, the acceptance probability varies from |$\sim 0.17$| to |$\sim 0.1$| as the dimensionality increases.

Based on the experiments performed in this section, we find that NUTS always produces more effective samples, irrespective of the function employed. With the Rosenbrock function depicting non-Gaussianity – characterized by its non-linear and asymmetric shape – it is expected that samplers will result in a reduction of |$N_{\textrm {eff}}$|⁠. For |$d\le 100$|⁠, both samplers are able to recover the correct shape of the posterior distribution. However, NUTS is more likely to scale better to higher dimensions |$(d> 100)$| as a result of its consistent high |$N_{\textrm {eff}}$|⁠.

6 LSST

Lastly, we look into a |$3\times 2$| point analysis using simulated data for a future LSST-like survey. The number of nuisance parameters is expected to be higher in order to account for the astrophysical and observational systematic uncertainties in the shear signal.

6.1 Data and model

Following the The LSST Dark Energy Science Collaboration (2018) document, we specify ten tomographic spaced by 0.1 in photo-z between |$0.2 \le z \le 1.2$| bins for galaxy clustering. For cosmic shear, we assume five tomographic bins (Leonard et al. 2023). We use a physical scale cut of |$k_{\rm max}=0.15\, {\rm Mpc}^{-1}$| for galaxy clustering, and a less conservative |$\ell _{\rm max}=3000$| for cosmic shear. The simulated data consists of the angular power spectra, |$C_{\ell }$|⁠, with a Gaussian covariance given by the Knox formula. The |$C_\ell$| has been averaged over each of the |$\ell$|-bins. Both auto- and cross-power spectra are included in the analysis for shear–shear and galaxy–shear correlations and only auto-power spectra are included for galaxy–galaxy correlations. The data vector, after applying the scale cuts, consists of 903 elements, |$\boldsymbol{x}\in \mathbb {R}^{903}$| and a data covariance matrix, |$\textsf {C}\in \mathbb {R}^{903\times 903}$|⁠. In this setup, we now have

ten bias parameters, |$b_{i},\, i\in [1,10]$|⁠,
ten shift parameters, |$\Delta z_{j}^{g},\, j\in [1,10]$|⁠,
five multiplicative bias parameters, |$m_{k},\, k\in [1,5]$|⁠,
five shift parameters, |$\Delta z_{t}^{\gamma },\, t\in [1,5]$| and
two parameters in the intrinsic alignment model (⁠|$A_{IA},\eta)$|

resulting in a total of 32 nuisance parameters. The total number of parameters in this experiment is 37, the five cosmological parameters (shown in Table 2) and 32 nuisance parameters.

In this experiment, for NUTS, we fix the step size to 0.01 and a maximum tree depth of eight. We also fix the number of warmup steps to 500 and generate two chains of 15 000 samples. For cobaya’s MCMC, we run two chains and set the number of samples to |$5\times 10^{5}$| and the convergence criterion to |$R-1\le 0.01$|⁠.

6.2 Results

The Gelman–Rubin convergence test, represented by the potential scale reduction factor, consistently indicates convergence to the target distribution for all parameters when employing either NUTS or cobaya’s MCMC sampler. This convergence holds true regardless of the forward modelling technique used, whether it be the Eisenstein–Hu method or the emulation method.

Furthermore, when comparing the NUTS and cobaya’s MCMC samplers, the sampling efficiency, |$\gamma$| as calculated using equation (31), is found to be |$\mathcal {O}(10)$|⁠. The relative cost of the gradient of the likelihood to the likelihood itself is |$\sim 4.5$|⁠. Similar to the analysis of the DES Year 1 data (see Section 4), the overall gain is |$\sim 2$|⁠.

The constraints obtained using NUTS and cobaya’s MCMC are comparable, with the maximum distance (see equation 30) being |$\sim 0.2$| if we use the emulator and |$\sim 0.1$| if we use EH method. The marginalized 1D and 2D posterior distribution of the cosmological parameters are shown in right panel of Fig. 5.

7 CONCLUSION

In this work, we have performed a quantitative assessment of different aspects related to emulation and gradient-based samplers.

In particular, we have integrated an emulator for the linear matter power in jax-cosmo. The emulator is both accurate (see Fig. 4) and fast, comparable to the speed-up obtained when using Eisenstein & Hu method to calculate the linear matter power spectrum. Note that only 1000 training points have been used to achieve an accuracy of |$\sim 1~{{\ \rm per\ cent}}$|⁠, compared to deep learning frameworks, which are generally more data-hungry, and require many training points to achieve almost the same level of accuracy. Moreover, as shown in Fig. 5, the constraints obtained with different samplers such as NUTS and cobaya’s MCMC agree with each other. See Table 2 for numerical results. A notable advantage of using emulation technique is that for a particular set of configurations (priors on the cosmological parameters, range of the wavenumber considered and redshift domain), the emulator is trained once, stored and can be coupled to any likelihood code, under the assumption that the scientific problem being investigated does not require configurations beyond those of the emulator.

In the DES analysis, the use of NUTS yields a sampling efficiency gain of approximately 10 for a system with 25 parameters and a non-trivial |$\sigma _{8}-\Omega _{\mathrm{ c}}$| degeneracy. Alongside assessing the effective sample size, the Gelman–Rubin statistics were calculated to confirm convergence across all chains in each experiment. While the efficiency gain favours NUTS in this scenario, it is not by a significant margin (roughly two when comparing sampling efficiency to the relative cost of gradient calculation). This observation aligns with findings by Ruiz-Zapatero et al. (2024).

Given the DES analysis, we also investigate the advantages of NUTS over standard non-gradient based samplers like cobaya’s MCMC as a function of dimensionality, and we find that NUTS is preferable in high dimensions |$(d> 100)$|⁠. The comparison, illustrated with the multivariate normal distribution, reveals several key advantages of NUTS in high-dimensional contexts. First, the small Kullback–Leibler divergence indicates that the chain converges to expected results efficiently. Secondly, the Gelman–Rubin statistics remain close to 1.00, indicating convergence across chains. Thirdly, the scaled effective sample size consistently outperforms cobaya’s MCMC, demonstrating the effectiveness of NUTS. This analysis underscores the utility of NUTS in efficiently exploring complex parameter spaces, especially in high dimensions where other methods like cobaya’s MCMC may struggle.

Furthermore, an in-depth examination using the Rosenbrock function confirms the sampling efficiency gain of NUTS. This suggests that NUTS is not only advantageous in terms of convergence and effective sample size but also provides improved exploration of complex, non-trivial functions. Overall, these findings highlight the contexts where NUTS outperforms traditional non-gradient based samplers, making it a valuable tool for Bayesian inference in a wide range of applications.

In exploring a 37-dimensional parameter inference problem with NUTS and cobaya’s MCMC for a future LSST-like survey data, we have also found that NUTS is more effective than cobaya’s MCMC by a factor of |$\sim 2$|⁠. NUTS consistently exhibits higher sampling efficiency and provides a greater effective sample size per likelihood call compared to cobaya’s MCMC. This suggests that, while NUTS is better suited for handling complex parameter spaces with |$O(40)$| dimensions, the relative improvement factor to be expected on a given platform is mild (⁠|$\sim 2$| instead of orders of magnitude). While the cost of gradient calculation is a drawback of using NUTS, this process can be accelerated by leveraging GPUs. GPUs excel in parallel processing tasks, including gradient computations, offering significant speedup over traditional CPU-based methods. By harnessing GPU acceleration, NUTS becomes more feasible for high-dimensional problems and large-scale cosmological analyses.

In order to fully exploit this possibility, more complete and sophisticated theory prediction frameworks will need to be developed, able to flexibly produce predictions for a wide range of observables of interest to current and future large-scale structure and CMB experiments. Various approaches to this problem have been initiated by the community, making use of tools such as jax (Campagne et al. 2023; Piras & Spurio Mancini 2023; Piras et al. 2024; Ruiz-Zapatero et al. 2024) and julia, and efforts to bring these frameworks to full maturity will significantly improve our ability to obtain both fast and robust parameter constraints from future data.

ACKNOWLEDGEMENTS

We thank Dr. Zafiirah Hosenie for reviewing this manuscript and providing useful feedbacks. We thank Prof. Alan Heavens and Prof. Andrew Jaffe for insightful discussions. AM was supported through the LSST Discovery Alliance (LSST-DA) Catalyst Fellowship project; this publication was thus made possible through the support of Grant 62192 from the John Templeton Foundation to LSST-DA. JR-Z was supported by UK Space Agency grants ST/W001721/1 and ST/X00208X/1. CG-G and DA were supported by the Beecroft Trust. We made extensive use of computational resources at the University of Oxford Department of Physics, funded by the John Fell Oxford University Press Research Fund.

DATA AVAILABILITY

The code and part of the data products underlying this article can be found at: https://github.com/Harry45/DESEMU/. The full data set and pre-trained GPs can be made available upon request. An unofficial implementation of the emulator implemented by Jean-Eric Campagne – originally developed by Mootoovaloo et al. (2022) – can be found at https://github.com/jecampagne/Jemu.

Footnotes

REFERENCES

Abbott

T. M. C.

et al. ,

2018

Phys. Rev. D

043526

10.1103/PhysRevD.98.043526

Aricò

Angulo

R. E.

Zennaro

2021

preprint

(

10.48550/arXiv.2311.15865

)

Bartlett

D. J.

et al. ,

2023

A&A

686

A209

Bartlett

D. J.

Wandelt

B. D.

Zennaro

Ferreira

P. G.

Desmond

2024

A&A

686

A150

10.1051/0004-6361/202449854

10.1088/1475-7516/2011/07/034

Bingham

et al. ,

2019

J. Mach. Learn. Res.

28:1

Blas

Lesgourgues

Tram

2011

J. Cosmol. Astropart. Phys.

2011

034

Blei

D. M.

Kucukelbir

McAuliffe

J. D.

2016

preprint

(

10.21105/astro.2307.14339

)

Bonici

Bianchini

Ruiz-Zapatero

2024

Open J. Astrophys.

10.1088/1367-2630/9/12/444

Bridle

King

2007

New J. Phys.

444

10.21105/astro.2302.05163

Campagne

J.-E.

et al. ,

2023

Open J. Astrophys.

10.1016/0370-2693(87)91197-X

Duane

Kennedy

A. D.

Pendleton

B. J.

Roweth

1987

Phys. Lett. B

195

216

Eisenstein

D. J.

1998

ApJ

496

605

10.1086/305424

Fendt

W. A.

Wandelt

B. D.

2007

ApJ

654

10.1086/508342

10.1088/1475-7516/2021/10/030

García-García

Ruiz-Zapatero

Alonso

Bellini

Ferreira

P. G.

Mueller

E.-M.

Nicola

Ruiz-Lapuente

2021

J. Cosmol. Astropart. Phys.

2021

030

Gelman

Rubin

D. B.

1992

Stat. Sci.

457

10.1214/ss/1177011136

10.1103/PhysRevD.76.083503

Gelman

Carlin

Stern

Dunson

Vehtari

Rubin

2015

Bayesian Data Analysis

, 3rd edn.

Chapman and Hall/CRC

New York

Habib

Heitmann

Higdon

Nakhleh

Williams

2007

Phys. Rev. D

083503

10.21105/astro.2301.11895

Hadzhiyska

Wolz

Azzoni

Alonso

García-García

Ruiz-Zapatero

Slosar

2023

Open J. Astrophys.

10.1103/PhysRevD.75.083525

Hajian

2007

Phys. Rev. D

083525

10.1103/PhysRevD.70.063526

Hirata

C. M.

Seljak

2004

Phys. Rev. D

063526

Hoffman

M. D.

Gelman

2011

preprint

(

10.1051/0004-6361/201833710

)

Jasche

Lavaux

2019

A&A

625

A64

10.1051/0004-6361/201015621

Joachimi

Mandelbaum

Abdalla

F. B.

Bridle

S. L.

2011

A&A

527

A26

Karamanis

Beutler

Peacock

J. A.

2021

MNRAS

508

3589

10.1093/mnras/stab2867

10.21105/astro.2212.04291

Leonard

C. D.

et al. ,

2023

Open J. Astrophys.

Lesgourgues

2011

preprint

(

10.1103/PhysRevD.87.103529

)

Lewis

2013

Phys. Rev. D

103529

10.1103/PhysRevD.66.103511

Lewis

Bridle

2002

Phys. Rev. D

103511

Limber

D. N.

1953

ApJ

117

134

10.1086/145672

10.1103/PhysRevD.78.123506

LoVerde

Afshordi

2008

Phys. Rev. D

123506

Mootoovaloo

Heavens

A. F.

Jaffe

A. H.

Leclercq

2020

MNRAS

497

2213

10.1093/mnras/staa2102

10.1016/j.ascom.2021.100508

Mootoovaloo

Jaffe

A. H.

Heavens

A. F.

Leclercq

2022

Astron. Comput.

100508

Neal

2011

, in

Brooks

Gelman

Jones

G. L.

Meng

X.-L.

, eds,

Handbook of Markov Chain Monte Carlo

Chapman & Hall/CRC

New York

, p.

113

Phan

Pradhan

Jankowiak

2019

preprint

(

10.21105/astro.2305.06347

)

Piras

Spurio Mancini

2023

Open J. Astrophys.

10.48550/arXiv.2405.12965

Piras

Polanska

Spurio Mancini

Price

M. A.

McEwen

J. D.

2024

Open J. Astrophys.

Rasmussen

C. E.

Williams

C. K. I.

2006

Gaussian Processes for Machine Learning

MIT Press

Cambridge

Ruiz-Zapatero

Hadzhiyska

Alonso

Ferreira

P. G.

García-García

Mootoovaloo

2023

MNRAS

522

5037

10.1093/mnras/stad1192

10.21105/astro.2310.08306

Ruiz-Zapatero

Alonso

García-García

Nicola

Mootoovaloo

Sullivan

J. M.

Bonici

Ferreira

P. G.

2024

Open J. Astrophys.

Spurio Mancini

Piras

Alsing

Joachimi

Hobson

M. P.

2022

MNRAS

511

1771

10.1093/mnras/stac064

10.1088/0004-637X/761/2/152

Takahashi

Sato

Nishimichi

Taruya

Oguri

2012

ApJ

761

152

10.1111/j.1365-2966.2008.13630.x

Taylor

J. F.

Ashdown

M. A. J.

Hobson

M. P.

2008

MNRAS

389

1284

Tegmark

Taylor

A. N.

Heavens

A. F.

1997

ApJ

480

10.1086/303939

The LSST Dark Energy Science Collaboration

2018

preprint

(

)

APPENDIX A: HMC AND NUTS

A1 Hamiltonian Monte Carlo

HMC is a sampling technique, developed by Duane et al. (1987) and the method is inspired by molecular dynamics. The motion of molecules follows Newton’s law of motions and it can be formulated as Hamiltonian dynamics. In short, we label the parameters governing the full theoretical model the ‘position variables’, and introduce an equal number of ‘momentum variables’. Starting from an initial state, a final state is proposed, related to the initial state through a Hamiltonian trajectory using the log-posterior as an effective potential, and a kinetic term governed by a mass matrix. A clear benefit of this sampling scheme is that the proposed state has, by construction a high probability of acceptance, due to energy conservation. Moreover, the sampler is now guided by the momentum variables, and is therefore less prone to random transitions. We refer the reader to Neal (2011), who provides an in-depth and pedagogical review on HMC.

In cosmology, HMC has been adopted in various cases, with the main goal being to sample the posterior distribution of some desired quantities, for example, the power spectrum, in an efficient way. Hajian (2007) implemented an HMC to sample cosmological parameters and found it to be more efficient by a factor of 10, compared to existing MCMC method in cosmomc. Taylor, Ashdown & Hobson (2008) developed an HMC sampling scheme to sample the CMB power spectrum while Jasche & Lavaux (2019) developed an ambitious sampling framework, Bayesian origin reconstruction from galaxies consisting of the HMC sampler, to sample over million of parameters. Most recently, Campagne et al. (2023), Ruiz-Zapatero et al. (2024), and Piras et al. (2024) have developed auto-differentiable frameworks to analyse angular power spectra with the aim of enabling gradient-based samplers.

Here, we describe an implementation of the HMC algorithm tailored towards using the mean and the first derivative. The second derivative can also be used to tune the mass matrix but one could also resort to using a small MCMC chain to estimate it. In this section, we are assuming that one is updating the mass matrix as we sample the parameters of the system using the second derivative. Suppose |$\Omega \in \mathbb {R}^{d}$| is a |$d-$|dimensional position vector (the parameters of interest) and |$\boldsymbol{q}\in \mathbb {R}^{d}$| is a |$d-$|dimensional momentum vector. The full state space has |$2d$| dimensions and the system is described by the Hamiltonian:

$$\begin{eqnarray} \mathcal {H}(\Omega ,\boldsymbol{q}) = \mathcal {U}(\Omega) + \mathcal {K}(\boldsymbol{q}). \end{eqnarray}$$

(A1)

In Bayesian applications, the potential energy is simply the negative log-posterior, that is,

$$\begin{eqnarray} \mathcal {U}(\Omega) = -\left[\textrm {log}\, p(\boldsymbol{x}|\Omega) +\textrm {log}\, \pi (\Omega)\right] \end{eqnarray}$$

(A2)

while the kinetic energy term is

$$\begin{eqnarray} \mathcal {K}(\boldsymbol{q}) = \frac{1}{2}\, \boldsymbol{q}^{\textrm {T}}\textsf {M}^{-1}\boldsymbol{q} + \frac{1}{2}\textrm {log}|\textsf {M}| + \textrm {constant}. \end{eqnarray}$$

(A3)

Note that the determinant of the mass matrix is not ignored in this variant of HMC, since the mass matrix is updated after every leapfrog move. In other words, the mass matrix is in fact a function of position as we sample the parameter space. The partial derivatives of the Hamiltonian control the evolution of the position and momentum variables, that is,

$$\begin{eqnarray} \dfrac{\mathrm{ d}\Omega _{i}}{\mathrm{ d}t}&=&\dfrac{\partial \mathcal {H}}{\partial \boldsymbol{q}_{i}}\\ \dfrac{\mathrm{ d}\boldsymbol{q}_{i}}{\mathrm{ d}t} &=& -\dfrac{\partial \mathcal {H}}{\partial \Omega _{i}} \end{eqnarray}$$

(A4)

Crucially, as discussed by Neal (2011), three key aspects of this formulation are: (1) the Hamiltonian dynamics are reversible, (2) the Hamiltonian is preserved, and (3) the volume is preserved. In order to solve the system of differential equations, one typically resorts to numerical techniques, for example, Euler’s method. A better alternative is the leapfrog algorithm, which is summarized below:

$$\begin{eqnarray} \boldsymbol{q}\left(t+\frac{\epsilon }{2}\right) &=& \boldsymbol{q}(t)-\frac{\epsilon }{2}\frac{\partial \mathcal {U}}{\partial \Omega }\left[\Omega (t)\right]\\ \Omega (t+\epsilon) &=& \Omega (t) + \epsilon \textsf {M}^{-1}\boldsymbol{q}\left(t+\frac{\epsilon }{2}\right)\\ \boldsymbol{q}(t+\epsilon) &=& \boldsymbol{q} \left(t+\frac{\epsilon }{2}\right)-\frac{\epsilon }{2}\frac{\partial \mathcal {U}}{\partial \Omega }\left[\Omega (t+\epsilon)\right], \end{eqnarray}$$

where |$\epsilon$| is the step size parameter. The leapfrog algorithm is shown in Algorithm 1. In particular, one takes a half-step in the momentum at the beginning before doing |$N_{\mathrm{ L}}$| full steps in the momentum and position variables, followed by a final half-step in momentum. The new proposed state is accepted according to Algorithm 2. The full HMC algorithm is presented in Algorithm 3.

Algorithm 1

The leapfrog algorithm

Algorithm 2

The acceptance criterion

Algorithm 3