Abstract

At the onset of the COVID-19 pandemic, various non-pharmaceutical interventions aimed to reduce infection levels, leading to multiple phases of transmission. The disease reproduction number, Rt, quantifies transmissibility and is central to evaluating these interventions. This article discusses hierarchical stochastic epidemic models with piece-wise constant Rt, suitable for capturing distinct epidemic phases and estimating disease magnitude. The timing and scale of Rt changes are inferred from data, while the number of phases is allowed to vary. The model uses Poisson point processes and Dirichlet process components to learn the number of phases, providing insight into epidemic dynamics. We test the models on synthetic data and apply them to freely available data from the UK, Greece, California, and New York. We estimate the true number of infections and Rt and independently validate this approach via a large seroprevalence study. The results show that key disease characteristics can be derived from publicly available data without imposing strong assumptions.

1 Introduction

At the start of 2020 the emergence of COVID-19, an infectious disease caused by the virus SARS-CoV-2, has placed health systems around the globe under immense pressure. In March 2020, the World Health Organization declared COVID-19 as a global pandemic, and as of the end of September 2022, more than 6.5 million have died due to illness or complications of it. At the beginning of the pandemic in the absence of available vaccines or suitable medication, the majority of governments around the globe resorted to Non-Pharmaceutical-Interventions (NPIs) in an attempt to stop the exponential spreading of the virus and reduce transmissibility. Such NPIs involved measures like work-from-home policies, school, and university closures, stay-at-home guidance for people in high-risk groups and full lockdowns.

These measures had an effect on reducing the transmissibility and resulted in spreading trajectories that could not be properly described by the standard epidemic models due to the resulting multiphasic nature of transmission. The first systematic technique to assess these interventions was due to Flaxman et al. (2020) who proposed a renewal equation model whose infection dynamics were modelled through a multilevel framework incorporating NPIs. We amend this model by inferring the points in time at which the transmissibility changes as well as the magnitude of infectiousness in a data-driven manner. We determine the number of phases by using appropriate stochastic processes based upon variations of the Poisson process (PP) and Dirichlet process (DP)-based priors via their stick-breaking constructions (Miller & Harrison, 2018; Sethuraman, 1994).

Several models have been proposed in the literature for the estimation of multiphasic infectious diseases, particularly COVID-19. Briefly, a stochastic Susceptible-Exposed-Infectious-Removed (SEIR) model with a regression framework for the effect of the NPIs on transmissibility is used in Knock et al. (2021) while Birrell et al. (2021), Li et al. (2021), and Chatzilena et al. (2022) use stochastic SEIR models where the transmission mechanism is described by a system of non-linear ordinary differential equations and the transmission rate is modelled by a diffusion process. Modelling the transmission rate as a random walk facilitates gradual and smooth changes in time. A piecewise linear quantile trend model was proposed by Jiang et al. (2021), a kernel-based SIR model distinguishing the different phases of the transmissibility in space was developed by Geng et al. (2021) while Wistuba et al. (2022) incorporated splines to estimate the reproduction number in Germany.

Simpler forms of deterministic and stochastic multiphasic epidemic models have been considered before. In the context of modelling SARS-CoV-2 transmission, (Flaxman et al., 2020) used an approach with a fixed number, date, and scale of the Rt change. Related work based upon variations of DP mixtures is presented in Hu and Geng (2021) and Creswell et al. (2023). In the former, the authors used a mixture of finite mixtures (MFM) model on a Susceptible-Infected-Recovered-Susceptible model, while in the latter the authors used a suitably modified Pitman-Yor process but only fitted it to the reported cases, thus dispensing with the effort to estimate the complete epidemic burden and the suitable adjustment for the reproduction number. The main advantage of this class of methods, based on piece-wise constant transmission rates, is the intuitive characterization of the epidemic in terms of multiple phases of transmissibility. The number and magnitude of the distinct phases are determined purely by data without forced assumptions on the effect of policy changes and NPIs. This approach should be central to a retrospective assessment of the NPIs: an evidence-based method for estimating the timing and effect of those interventions, minimizing the risk of introducing several types of bias.

The article is organized as follows. In Section 2, we define the proposed compartmental process, elucidate its equivalence with renewal process-based models and describe the observation regimes of the data. In Section 3, we complete the model definition by characterizing the complexity regimes. Section 4 assesses the proposed models via simulation experiments while Section 5 contains the application to data from California and New York state, the UK and Greece. The article concludes with a discussion.

2 Modelling disease transmission

2.1 Model definition and related characterizations

The methodology for modelling the time-varying disease transmissibility has been implemented under two distinct but equivalent models, the compartmental Susceptible-Infectious-Removed (SIR) model and the seemingly simpler time-since-infection model with population susceptibility reduction. Here, we define both models and delineate their equivalence.

The model assumes that the population has size n, is closed (demographic changes during the course of the epidemic are ignored) homogeneous and homogeneously mixing. In the stochastic SIR model, an infected individual makes contact with any other individual on day t at the points of a PP with time-varying intensity λtn. This scaling is commonly adopted as it makes the contact process independent of the size of the population (e.g. Andersson & Britton, 2000). If these (close) contacts of an infected individual occur with a susceptible individual they result in an infection. Each individual remains infectious for a random time period Y. All PPs in this construction are assumed to be independent. The disease reproduction number is defined as Rt=λt*E[Y], t=1,,T, where T is the time horizon of the study.

The infection rate λt will generally be assumed piece-wise constant in this article and further details are given below. For this model, let ct denote the number of new infections on day t. The expected number of new infections on day t+1 is given by:

(1)

where St represents the number of individuals that remain susceptible:

(2)

and It denotes the active set of infectives:

(3)

and I{Yj>ts} the indicator function of the event that the individual j, infected on day s, remains infectious on day t. This event is implicitly determined by the disease characteristics. Then, (1) can be rewritten as

(4)

since I{Yj>ts}P(Y>ts) and gs(t)=P(Y>ts)E[Y] denotes the generation interval which characterises the time from the infection of an individual until they generate their first infection, see for example (Champredon et al., 2018; Svensson, 2007, 2015). Such approximation works reasonably well in homogeneous populations but there are notable exceptions, including on spatial epidemics, see Mollison (1991) and Durrett and Levin (1994) for extensive discussion and counter examples.

Note that equation (4) is used in the commonly adopted technique of Cori et al. (2013) for estimating the instantaneous reproduction number and this approach gives a different justification to the derivation of the Stn correction factor in Bhatt et al. (2023). The Stn term accounts for the depletion of the susceptible population and is sometimes ignored when the aim of modelling disease transmissibility is somewhat different. We will use equation (4) for the subsequent analyses as it is computationally more efficient compared to the SIR structure. One should also consider potential ‘superspreading’ events when certain individuals infected unusually large numbers of secondary cases (Lipsitch et al., 2003; Shen et al., 2004). We account for this variability assuming that the individual reproduction number is gamma distributed with mean Rt and dispersion parameter kC, yielding ctNegativeBinomial(μct,kC) (Lloyd-Smith et al., 2005), where μct is equal to E[ct+1] of equation (4).

The reproduction number Rt is of great practical interest as it is used to assess if the epidemic is growing or shrinking. Here, we consider two distinct instances of reproduction number. The effective reproduction number Re(t)=St*Rt describes the expected number of secondary cases generated by an infectious individual. Then, Re(t)>1 and Re(t)<1 indicate that the epidemic is growing or shrinking, respectively and reducing Re(t) below unity is the typical target of public health authorities. In contrast, Rt quantifies contacts that may not always result in new infections, due to mixing with the immune proportion of the population. Therefore, Rt>1 does not necessarily mean that the epidemic is growing. A detailed discussion about reproduction numbers can be found in Pellis et al. (2022).

2.2 Observation regimes

We consider two distinct observation regimes, one where the observed number of cases corresponds to the total number of infections and one where the total number of infections is indirectly estimated, outlined below.

2.2.1 Observed infections

The regime where the total number of infections is observed may be of interest in its own right but may also be used for certain transmissible diseases, for example in the analysis of influenza-like-illness data when seroprevalence study information is available. Epidemic models are attractive for analysing such data and are naturally defined in terms of infector–infectee pair and the timing of such events. In reality, however, this type of data is rarely available. Disease monitoring is based on the daily reported infections, which are known to be susceptible to multiple problems, including a time lag between the timing of infection and symptom onset or testing positive.

In the case of COVID-19, a large proportion of the population experiences asymptomatic or mild disease (Ward et al., 2021) leading to severe under-reporting. Inference about the reproduction number can be robust when the reported cases are used if depletion of the susceptible population is accounted for, or if the observed proportion of cases remains constant over time. One way to validate this assumption is by sequentially performing seroprevalence studies to estimate the true disease prevalence and the proportion of unreported incidences. However, such information was not available in most countries. In the following subsection, we describe an alternative approach that dispenses with the need for this assumption.

2.2.2 Unobserved cases

The case where infections may not be directly observed has been studied in a different context by Demiris et al. (2014). In the case of the pandemic, it became immediately apparent that the observed number of infections only partially accounts for the complete epidemic burden. An alternative technique was proposed by Flaxman et al. (2020) where the true cases were estimated by back-calculating infections from the daily reported deaths which are likely less prone to under-reporting. This method has the additional advantage of yielding an estimate of St. We adopt this approach by introducing another hierarchical level into our model and the daily deaths are linked with the true cases via:

(5)

where dt and kD are the mean and dispersion parameters of the negative binomial distribution, respectively. IFR is the infection fatality ratio and π is the distribution of the time from infection to death, with π(i) denoting the probability an infected individual will die i days after becoming infected. Accurate estimates of the IFR and π are necessary for estimating incidence, which is treated here as a latent parameter. The IFR and π parameters may be calculated independently from external data or in a single stage, leveraging additional evidence from seroprevalence studies.

3 Epidemic complexity determination

The number of phases may be treated as a fixed but unknown integer or as a random quantity to be modelled and estimated from data. We describe two such frameworks in the following two subsections.

3.1 Deterministic number of phases

In Flaxman et al. (2020), the number of phases was a priori selected and the times when the reproduction number Rt changed were also predefined. The dates of these points were informed by the NPIs implemented by each government leading to a piece-wise constant reproduction number Rt, effectively assuming the immediate effect of those NPIs. We study the limitations of this approach in the case when someone makes inaccurate assumptions regarding the start or the end of an epidemic phase or indeed their number. We do this using the simulated data of the Section 4 and present the results in the Supplementary material. We follow Flaxman et al. (2020) considering Rt as a piece-wise constant function and infer the point in time and magnitude of Rt changes directly from the data. The number, K, of epidemic phases is investigated using models with different K values and the best model is selected using the Watanabe–Akaike information criterion (WAIC) (Watanabe, 2013) and approximate leave-one-out cross-validation (LOO) (Vehtari et al., 2017). Note that the time-ordering of the data suggests that k-step ahead forecasting would be a better predictive model selection process. However, model complexity renders such repeated evaluations practically infeasible and therefore we proceed with the commonly used approximation via information criteria. The model is defined as follows:

(6)

where f() is any suitable positive defined probability distribution. We initially search for the first point in time, T1, when the epidemic changes phase for the first time. This search spans the entire duration of our study. To avoid identifiability issues with timepoints Ti, each new point is determined by adding a positive quantity ei to the previous one.

3.2 Stochastic number of phases

Under the Bayesian paradigm, a natural but not trivial way is to treat the number of epidemic phases K as a parameter and learn its posterior distribution. The ‘reversible jump’ algorithm (e.g. Richardson & Green, 1997) could be used to explore the joint space of K and within-K models. Here we adopt a different approach and model K as a characteristic of two stochastic models, the PP and variations of the DP (Ferguson, 1973). For both processes, we use the stick-breaking representation, see Miller and Harrison (2018) and Sethuraman (1994) for the PP and DP respectively, facilitating inference for K. The directed acyclic graph (Figure 1) represents the general structure of our modelling framework.

Directed acyclic graph of the model. Ellipses denote parameters to be learned by the model. The number of phases K is estimated by the DP/PP model or via model selection criteria.
Figure 1.

Directed acyclic graph of the model. Ellipses denote parameters to be learned by the model. The number of phases K is estimated by the DP/PP model or via model selection criteria.

Estimating the number of phases of the epidemic and the associated date and magnitude of the Rt changes can lead to identifiability problems for Rt and its generative quantities, notably the total number of infections. In order to overcome such issues, we explore a multi-stage modelling procedure (e.g. Bhatt et al., 2023) for the stochastic number of phases models under the unobserved cases paradigm. At the first stage, the latent disease cases are estimated using a Gaussian Process (GP) model or a GP mixture with Student-t marginals (Heyde & Leonenko, 2005; Shah et al., 2014) and then the smoothed medians of these latent cases are treated as data with the likelihood given in (4). The GP model was used as our baseline since both models performed similarly based on WAIC. More details for the estimation of cases can be found in the supplementary material.

3.2.1 Poisson point process-based model

We consider that the arrival of new phases in the time horizon (0,T] is driven by a time-homogeneous PP with rate λ, with K growing linearly with time. Hence, following the first epidemic phase, the number, K1, of new phases follows a Poisson distribution with rate λ*T while the duration of each phase a priori follows an Exponential distribution with rate λ. We follow Miller and Harrison (2018) and use the representation:

(7)

truncating K at Kmax=100, far higher than data-supported estimates. We use an informative prior on λ with a mean of 0.02, so that the mean of the Poisson distribution for the number of phases λT is around 5 for 250 days. This is a broad estimate for the prior mean, following a visual inspection of the simulated daily deaths.

3.2.2 DP-based model

An alternative model for the number of phases is based on the DP and its stick-breaking construction. This is based on a stick of unit length, where one samples iteratively the vi from a Beta(1,θ) distribution as a portion vi of the remaining stick, deriving the weights of the DP:

(8)

where L is the truncation point of the DP, set here to 36. Here, K is increasing with the scaling parameter θ. While inherently very different, the PP and DP models share a similar intuition in their construction. The goal is to construct weights that represent the proportion of each epidemic phase over the complete duration of the study, T. These weights are then used in a Categorical distribution that assigns each daily observation to a specific phase.

4 Synthetic data experiments

Simultaneously learning the parameters and the dimension of a model is typically a challenging statistical task. We assess the performance of our methods by simulating epidemics of various characteristics for 250 days. The epidemic model defined in Section 2 was used for simulating daily infections and deaths. The population size was set at 108 with IFR=2%. The discretized infectious period and the infection-to-death interval are described in the supplementary material. The epidemic was simulated with 5 distinct increasing/decreasing phases resembling the observed COVID-19 outbreaks after the introduction of multiple NPIs. The time-varying reproduction number was set as follows:

Using the model in (6) and the daily deaths as data (unobserved infections regime) the lowest WAIC and LOO selected six phases (Table 1). Models with varying (5, 6, and 7) number of phases incorrectly identified the first 10 days of the simulation as a distinct phase (Figure 2). This can be attributed to the lack of information at the start, where we essentially only draw information from our prior distribution, a common issue in epidemic models. Following this period the model with six phases correctly identifies the different epidemic phases, including their timing and magnitude of change. The total daily infections (Figure 2) are also accurately recovered. Inference was initiated on the day that 10 cumulative deaths were observed. Plots for the other models may be found in the supplementary material.

Simulation and estimates based on observing deaths. (a) Simulated (triangles) and estimated daily infections with 95% credible intervals (line) and (b) Real (solid line) and estimated reproduction number Rt with 95% credible intervals (dashed line).
Figure 2.

Simulation and estimates based on observing deaths. (a) Simulated (triangles) and estimated daily infections with 95% credible intervals (line) and (b) Real (solid line) and estimated reproduction number Rt with 95% credible intervals (dashed line).

Table 1.

Fixed number of phases—simulated dataset

Number of phasesWAICLOO
Five phases3,635.03,635.1
Six phases2,252.92,253.4
Seven phases2,260.62,261.8
Number of phasesWAICLOO
Five phases3,635.03,635.1
Six phases2,252.92,253.4
Seven phases2,260.62,261.8
Table 1.

Fixed number of phases—simulated dataset

Number of phasesWAICLOO
Five phases3,635.03,635.1
Six phases2,252.92,253.4
Seven phases2,260.62,261.8
Number of phasesWAICLOO
Five phases3,635.03,635.1
Six phases2,252.92,253.4
Seven phases2,260.62,261.8

In addition to the findings that the models correctly select the right complexity, it is interesting to summarize the model behaviour when investigating model misspecification. Broadly, these findings may be summarized as follows; when we fix the number of phases to be smaller than the true one then the model is correctly recovering the early phases while it is averaging the final ones leading to poorly fitted models. In contrast, when fixing K to be larger than the true one, we recover the true patterns and get a good fit. Hence, overestimating the number of phases does not materially affect the recovery of the true signal. A list of detailed results is outlined in the supplement.

When fitting the models with a stochastic number of phases using as data the daily infections (observed infections regime), both the PP and DP models precisely estimate the number of epidemic phases, the time of change and the true Rt value (Figure 3). The model was run for 1,00,000 iterations and 8 chains. The analysis based on observing deaths is included in the supplementary material (unobserved infections regime—multi-stage approach). Briefly, the intermediate phases of the epidemic are well estimated while the first and final phases are recovered with noise. The level of smoothing of the noisy estimated cases affects the estimation of the reproduction number; the smoother the estimation of cases the smoother and with fewer phases the reproduction number is recovered.

True (solid line) and estimated reproduction number Rt with 95% credible intervals (dashed line) based on observing infections. (a) Dirichlet process model and (b) Poisson process model.
Figure 3.

True (solid line) and estimated reproduction number Rt with 95% credible intervals (dashed line) based on observing infections. (a) Dirichlet process model and (b) Poisson process model.

We utilized readily available software such as Rstan and Nimble within the R statistical programming language. In Rstan, we used the No-U-Turn Sampler (NUTS) algorithm for the model in (6), while in Nimble, we used a combination of the Random Walk Metropolis–Hastings sampler and a categorical sampler for the DP and PP models. The code replicating all the experiments may be found in https://github.com/pbarmpounakis/Multiphasic-stochastic-epidemic-models.

5 Real-data application

5.1 Data description and prepocessing

The models were fitted to daily reported deaths (unobserved cases regime) from two US states, California and New York, and two European countries, the UK and Greece. The data are accessible from John Hopkins University and ECDC and the time horizon ran to the end of June 2021 when many NPIs were lifted. Due to a lack of data availability, the model does not account for reinfections. The age-standardized IFR for each country was informed by the meta-analysis from COVID-19 Forecasting Team (2022) accounting for time, geography, and population characteristics. We allowed the IFR to vary over time, accounting for the age structure of those infected, the burden of health systems and amendments in treating the disease. The infection-to-death time and generation interval were given a Gamma distribution with (mean, standard deviation) set to (19, 8.5) and (6.5, 4.4) days, respectively following Flaxman et al. (2020).

5.2 Analyses and results

California was one of the first US states to report cases on the 26th of January, 2020. A state of emergency was declared on March 4, 2020, and mass/social gatherings were banned while a mandatory statewide stay-at-home order was issued on 19 March 2020. We fitted the model to daily deaths (unobserved cases regime) and using WAIC/LOO selected seven phases. Figures 4 and 5 suggest that Re(t) was reduced after imposing restrictions and fell below the critical value of 1 after April 2020 when school closure was decided for the remainder of the 2019–2020 academic year. The epidemic remained under control until the summer of 2020 when Re(t) jumped slightly above 1 following a gradual relaxation of measures. On 31 August 2020, a new set of measures called ‘Blueprint for a Safer Economy’ was applied and all models show that they were effective, alongside the gained immunity of the population, at reducing the effective reproduction number below one and keeping the epidemic under control until the first half of October 2020. All models estimate a sharp increase in Re(t), which resulted in an increase in the daily reported cases and deaths between November 2020 and January 2021. Nighttime curfew and regional stay-at-home orders were announced at the start of December 2020 whence Re(t) remained stable and began declining. The initiation of the vaccination program in early 2021 brought the epidemic under control with Re(t) remaining below 1.

Estimation of effective reproduction number Re(t) with 95% credible intervals (solid and dashed lines) based on observing deaths, fixed number of phases model. (a) California state. (b) New York state. (c) The UK and (d) Greece.
Figure 4.

Estimation of effective reproduction number Re(t) with 95% credible intervals (solid and dashed lines) based on observing deaths, fixed number of phases model. (a) California state. (b) New York state. (c) The UK and (d) Greece.

Estimation of effective reproduction number Re(t) with 95% credible intervals (solid and dashed lines) based on observing deaths, multi-stage approach. (a) California state—DP model. (b) California state—PP model. (c) New York state—DP model and (d) New York state—PP.
Figure 5.

Estimation of effective reproduction number Re(t) with 95% credible intervals (solid and dashed lines) based on observing deaths, multi-stage approach. (a) California state—DP model. (b) California state—PP model. (c) New York state—DP model and (d) New York state—PP.

New York state had, by 10 April 2020, more confirmed cases than any country outside the US and was heavily affected at the start of the pandemic, with daily recorded deaths reaching a thousand in April. On 15 March, all New York City schools were closed and on 20 March state-wide stay-at-home order was declared. As a result, the models show a drop of Re(t) below 1 from mid-March 2020 until August 2020 (Figures 4 and 5). The best-performing models based on WAIC and LOO had eight distinct phases. This model estimates that after the summer of 2020, Re(t) remained above 1 up until the start of 2021 with a small increase during November and the holiday season. The DP and PP models (unobserved cases regime multi-stage approach) show similar estimates for Re(t) (Figure 5).

For the UK, a model with nine phases was selected by WAIC and LOO. Until early March 2020, when a lockdown was imposed we estimate that Rt3.5 (Figure 4). These measures were lifted in early June and during the lockdown Re(t) remained below 1, and therefore under control. After the summer Re(t) increased above 1 and the so-called rule of six was imposed while on 5 November 2020, the second lockdown was announced. The number of reported deaths was reduced after the initiation of the vaccination program on 4 January 2021. Virtually identical estimates for the UK Re(t) are inferred by the DP and PP models (figures and additional details in the supplementary material).

We conducted an independent (or ‘external’) validation of the model performance based upon REACT-2, an antibody prevalence study conducted in the UK with the participation of more than 100,000 adults (Ward et al., 2021). This appeared like an optimal choice as it took place in early July 2020 when waning immunity was relatively unlikely. As such, it provides a reasonable estimate of the total disease burden up to that time. The estimated prevalence for the adult population (children were excluded) was 6.0% (95% CI: 5.8, 6.1) and our estimate for the whole population is 7.5% (95% credible intervals (Cr.I.): 5.7, 10.) (Figure 6) well compatible with that independent estimate.

Cumulative sum of estimated daily infections with 95% credible intervals (dashed lines) and the estimation of REACT-2 with 95% confidence intervals (solid lines) for the UK.
Figure 6.

Cumulative sum of estimated daily infections with 95% credible intervals (dashed lines) and the estimation of REACT-2 with 95% confidence intervals (solid lines) for the UK.

For Greece WAIC and LOO selected the eight-phases model. At the starting phase, we estimate Re(t)=3.36  (sd=0.88) and a decrease below 1 in the first half of March 2020 (Figure 4). On 10 March, the government suspended most activities, including educational, shopping, and recreational while a week later all nonessential movement was restricted. The Re(t) estimate remained below 1 until early June 2020 when it increased following the lifting of restrictions. During summer Re(t) remained over 1 until November 2020 since a case spike in October led to new measures. Similar estimates for the Re(t) are obtained by the DP and PP models (Supplementary material).

The computation time was similar for the PP and DP models with the DP being faster. More importantly, we get valuable insights into the effectiveness of the measures imposed by the governments. For New York and the UK, it appears that the school closures and the strict lockdowns predate the reductions in transmissibility, and we can assume that they were pivotal in achieving that. California and Greece adopted the measures before a large first wave, like other EU countries and US states. All regions were similar when these two measures were relaxed: multiple epidemic waves emerged and the estimated Re(t) remained above 1. These findings closely align with similar results documented in the existing literature, where authors found that the national lockdowns consistently helped with the spread of the disease (Flaxman et al., 2020; Knock et al., 2021).

The results of our simulation experiments corroborate the findings of the application to real data from different areas. The time-ordering of the data facilitates avoiding label-switching problems typically encountered when fitting mixture-type models. By selecting the number of phases we capture mortality changes in all the real-world examples (Figure 7). The DP and PP models can infer a higher number of phases but the conclusions are not materially affected. This observation is in line with Rousseau and Mengersen (2011) who show a generally stable behaviour of such so-called overfitted mixture models, theoretically verifying the robust behaviour of the developed models.

Reported (triangles) and estimated deaths with 95% credible intervals (solid and dashed lines) based on observing deaths, fixed number of phases model. (a) California state. (b) New York state. (c) The UK and (d) Greece.
Figure 7.

Reported (triangles) and estimated deaths with 95% credible intervals (solid and dashed lines) based on observing deaths, fixed number of phases model. (a) California state. (b) New York state. (c) The UK and (d) Greece.

6 Discussion

In this article, we propose three models for the transmission mechanism of infectious diseases with multiple epidemic phases. We use freely available data to estimate the points in time when transmissibility changes and the realized magnitude of the NPI effects. We focus on publicly available data, as data availability and usability were raised as concerns during the acute COVID-19 pandemic phase. Several potentially useful datasets, including those regarding the pressure on a country’s health system, such as the number of patients in ICU beds, may not be generally available.

A number of interventions were applied in overlapping time intervals, and identifiability issues can arise when trying to disentangle individual effects and the associated time lags. In particular, it is generally hard to estimate the timing of the effect and the possible time lag between cause and effect. In this work, we retrospectively assess the effect of the NPIs by comparing changes in the reproduction number with the dates that these measures were imposed. Such a purely evidence-based approach represents an alternative to forced assumptions that aim for (potentially unwarranted) causal statements about intervention effects. Our code is made freely available, and the reliance on publicly available data facilitates reproducibility and potential uptake by other researchers.

Selecting the number of phases requires multiple runs and the computation time can be an issue when nowcasting is essential for decision-making. Estimating the number of phases via the DP and PP models represents an alternative approach that is computationally efficient and statistically robust. The DP and PP models may overestimate the number of epidemic phases and this issue is discussed in detail in Rousseau and Mengersen (2011), see also Miller and Harrison (2013). In our setting, this effect essentially relates to the start and end of the epidemic and the inherent challenges of limited information. At the start of the epidemic, such uncertainty dictates that estimates should be interpreted with caution. In the end, this is less of an issue and is mostly due to the time lag between cases and deaths. When one is working with the observed infections these issues largely disappear and inference is generally accurate throughout the duration of the data as indicated by our simulation experiments.

In times of crisis, there is a demand for rapid predictions that facilitate decision support and one may employ the multi-stage approach of the DP or PP models to gauge the evolution of the transmissibility. After obtaining an initial estimate of the reproduction number from these models, a more focused exploration of the number of phases using the deterministic number of phases model could be conducted. This targeted investigation aims to minimize the need for extensive model fitting by narrowing down the range of potential values for the number of epidemic phases, K.

It is not apparent how our models can be optimally used for predicting new epidemic phases. Now-casting or short-term future predictions are typically performed under the assumption that the main conditions, including transmissibility, remain the same, or by simulating particular scenarios. In this article, we focus on the retrospective assessment of the imposed NPIs and the inferred quantities may be used in feeding forward simulations in a similar manner.

The models developed in this work assume a homogeneous and homogeneously mixed population similar to many studies on SARS-CoV-2 transmission. This assumption may be appropriate for large populations such as working at the state or country level since functional central limit theorems can reasonably be thought of as applicable (e.g. Andersson & Britton, 2000). Incorporating additional information or structure from data sources such as hospitalizations, seroprevalence studies, or vaccination programs is desirable in principle but may also pose challenges. The inclusion of data sources such as those discussed in Knock et al. (2021) would necessitate model expansion. Our models can naturally be extended when more detailed information is available and this is the subject of current research.

Acknowledgments

The authors are grateful to the two referees whose comments substantially improved the content and presentation of this work. We also wish to thank Kostas Kalogeropoulos and Petros Dellaportas for useful comments on an earlier version of this article and the editor and associate editor for a number of useful suggestions.

Funding

This article is part of the first author’s doctoral thesis, co-financed by Greece and the European Union (European Social Fund-ESF) through the Operational Programme ‘Human Resources Development, Education and Lifelong Learning’ in the context of the Act ‘Enhancing Human Resources Research Potential by undertaking a Doctoral Research’ Sub-action 2: State Scholarships Foundation (IKY, Greece) Scholarship Programme for PhD candidates in the Greek Universities.

Data availability

All the data used in the synthetic data experiments resulted from simulation studies, which were carefully designed to replicate realistic conditions and are detailed within the methodology to provide transparency and allow for potential replication by others. The data for the real-data application were derived from publicly available sources, no proprietary or restricted data were used. All real-world data are collectively available at COVID-19 Data Repository by the Center for Systems Science and Engineering (CSSE) at Johns Hopkins University at GitHub.

Author contributions

P.B. and N.D. conceived the experiment(s); P.B. conducted the experiment(s); P.B. and N.D. analysed the results; P.B. and N.D. wrote and reviewed the manuscript.

Supplementary material

Supplementary material is available online at Journal of the Royal Statistical Society: Series C.

References

Andersson
 
H.
, &
Britton
 
T.
(
2000
).
Stochastic epidemic models and their statistical analysis
.
Lecture notes in statistics
(2000 ed.).
Springer
.

Bhatt
 
S.
,
Ferguson
 
N.
,
Flaxman
 
S.
,
Gandy
 
A.
,
Mishra
 
S.
, &
Scott
 
J. A.
(
2023
).
Semi-mechanistic Bayesian modelling of COVID-19 with renewal processes
.
Journal of the Royal Statistical Society Series A: Statistics in Society
,
186
(
4
),
601
615
.

Birrell
 
P.
,
Blake
 
J.
,
van Leeuwen
 
E.
,
Gent
 
N.
, &
De Angelis
 
D.
(
2021
).
Real-time nowcasting and forecasting of covid-19 dynamics in England: The first wave
.
Philosophical Transactions of the Royal Society B: Biological Sciences
,
376
(
1829
), Article 20200279.

Champredon
 
D.
,
Dushoff
 
J.
, &
Earn
 
D. J. D.
(
2018
).
Equivalence of the Erlang-distributed Seir epidemic model and the renewal equation
.
SIAM Journal on Applied Mathematics
,
78
(
6
),
3258
3278
.

Chatzilena
 
A.
,
Demiris
 
N.
, &
Kalogeropoulos
 
K.
(
2022
).
A modelling framework for the analysis of the transmission of SARS-CoV2
.

Cori
 
A.
,
Ferguson
 
N. M.
,
Fraser
 
C.
, &
Cauchemez
 
S.
(
2013
).
A new framework and software to estimate time-varying reproduction numbers during epidemics
.
American Journal of Epidemiology
,
178
(
9
),
1505
1512
.

COVID-19 Forecasting Team
(
2022
).
Variation in the COVID-19 infection-fatality ratio by age, time, and geography during the pre-vaccine era: A systematic analysis
.
Lancet
,
399
(
10334
),
1469
1488
.

Creswell
 
R.
,
Robinson
 
M.
,
Gavaghan
 
D.
,
Parag
 
K. V.
,
Lei
 
C. L.
, &
Lambert
 
B.
(
2023
).
A Bayesian nonparametric method for detecting rapid changes in disease transmission
.
Journal of Theoretical Biology
,
558
, Article 111351.

Demiris
 
N.
,
Kypraios
 
T.
, &
Smith
 
L. V.
(
2014
).
On the epidemic of financial crises
.
Journal of the Royal Statistical Society. Series A (Statistics in Society)
,
177
(
3
),
697
723
.

Durrett
 
R.
, &
Levin
 
S.
(
1994
).
The importance of being discrete (and spatial)
.
Theoretical Population Biology
,
46
(
3
),
363
394
.

Ferguson
 
T. S.
(
1973
).
A Bayesian analysis of some nonparametric problems
.
Annals of Statistics
,
1
(
2
),
209
230
.

Flaxman
 
S.
,
Mishra
 
S.
,
Gandy
 
A.
,
Unwin
 
H. J. T.
,
Mellan
 
T. A.
,
Coupland
 
H.
,
Whittaker
 
C.
,
Zhu
 
H.
,
Berah
 
T.
,
Eaton
 
J. W.
,
Monod
 
M.
,
Perez-Guzman
 
P. N.
,
Schmit
 
N.
,
Cilloni
 
L.
,
Ainslie
 
K. E. C.
,
Baguelin
 
M.
,
Boonyasiri
 
A.
,
Boyd
 
O.
,
Cattarino
 
L.
, …
Team
 
I. C. C.-. R.
(
2020
).
Estimating the effects of non-pharmaceutical interventions on covid-19 in Europe
.
Nature
,
584
(
7820
),
257
261
.

Geng
 
X.
,
Katul
 
G. G.
,
Gerges
 
F.
,
Bou-Zeid
 
E.
,
Nassif
 
H.
, &
Boufadel
 
M. C.
(
2021
).
A kernel-modulated sir model for covid-19 contagious spread from county to continent
.
Proceedings of the National Academy of Sciences of the United States of America
,
118
(
21
),
e2023321118
.

Heyde
 
C. C.
, &
Leonenko
 
N. N.
(
2005
).
Student processes
.
Advances in Applied Probability
,
37
(
2
),
342
365
.

Hu
 
G.
, &
Geng
 
J.
(
2021
).
Heterogeneity learning for sirs model: An application to the covid-19
.
Statistics and Its Interface
,
14
(
1
),
73
81
.

Jiang
 
F.
,
Zhao
 
Z.
, &
Shao
 
X.
(
2021
).
Modelling the covid-19 infection trajectory: A piecewise linear quantile trend model
.
Journal of the Royal Statistical Society: Series B, Statistical Methodology
,
84
(
5
),
1589
1607
.

Knock
 
E. S.
,
Whittles
 
L. K.
,
Lees
 
J. A.
,
Perez-Guzman
 
P. N.
,
Verity
 
R.
,
FitzJohn
 
R. G.
,
Gaythorpe
 
K. A. M.
,
Imai
 
N.
,
Hinsley
 
W.
,
Okell
 
L. C.
,
Rosello
 
A.
,
Kantas
 
N.
,
Walters
 
C. E.
,
Bhatia
 
S.
,
Watson
 
O. J.
,
Whittaker
 
C.
,
Cattarino
 
L.
,
Boonyasiri
 
A.
,
Djaafara
 
B. A.
, …
Baguelin
 
M.
(
2021
).
Key epidemiological drivers and impact of interventions in the 2020 SARS-CoV-2 epidemic in England
.
Science Translational Medicine
,
13
(
602
),
eabg4262
.

Li
 
Y. I.
,
Turk
 
G.
,
Rohrbach
 
P. B.
,
Pietzonka
 
P.
,
Kappler
 
J.
,
Singh
 
R.
,
Dolezal
 
J.
,
Ekeh
 
T.
,
Kikuchi
 
L.
,
Peterson
 
J.
,
Bolitho
 
A.
,
Kobayashi
 
H.
,
Cates
 
M. E.
,
Adhikari
 
R.
, &
Jack
 
R. L.
(
2021
).
Efficient Bayesian inference of fully stochastic epidemiological models with applications to COVID-19
.
Royal Society Open Science
,
8
(
8
), Article 211065.

Lipsitch
 
M.
,
Cohen
 
T.
,
Cooper
 
B.
,
Robins
 
J. M.
,
Ma
 
S.
,
James
 
L.
,
Gopalakrishna
 
G.
,
Chew
 
S. K.
,
Tan
 
C. C.
,
Samore
 
M. H.
,
Fisman
 
D.
, &
Murray
 
M.
(
2003
).
Transmission dynamics and control of severe acute respiratory syndrome
.
Science
,
300
(
5627
),
1966
1970
.

Lloyd-Smith
 
J. O.
,
Schreiber
 
S. J.
,
Kopp
 
P. E.
, &
Getz
 
W. M.
(
2005
).
Superspreading and the effect of individual variation on disease emergence
.
Nature
,
438
(
7066
),
355
359
.

Miller
 
J. W.
, &
Harrison
 
M. T.
(
2013
).
Inconsistency of Pitman-Yor process mixtures for the number of components
.

Miller
 
J. W.
, &
Harrison
 
M. T.
(
2018
).
Mixture models with a prior on the number of components
.
Journal of the American Statistical Association
,
113
(
521
),
340
356
.

Mollison
 
D.
(
1991
).
Dependence of epidemic and population velocities on basic parameters
.
Mathematical Biosciences
,
107
(
2
),
255
287
.

Pellis
 
L.
,
Birrell
 
P. J.
,
Blake
 
J.
,
Overton
 
C. E.
,
Scarabel
 
F.
,
Stage
 
H. B.
,
Brooks-Pollock
 
E.
,
Danon
 
L.
,
Hall
 
I.
,
House
 
T. A.
,
Keeling
 
M. J.
,
Read
 
J. M.
,
JUNIPER Consortium
, &
De Angelis
 
D
. (
2022
).
Estimation of reproduction numbers in real time: Conceptual and statistical challenges
.
Journal of the Royal Statistical Society: Series A (Statistics in Society)
,
185
(
S1
),
S112
S130
.

Richardson
 
S.
, &
Green
 
P. J.
(
1997
).
On Bayesian analysis of mixtures with an unknown number of components (with discussion)
.
Journal of the Royal Statistical Society: Series B, Statistical Methodology
,
59
(
4
),
731
792
.

Rousseau
 
J.
, &
Mengersen
 
K.
(
2011
).
Asymptotic behaviour of the posterior distribution in overfitted mixture models
.
Journal of the Royal Statistical Society: Series B, Statistical Methodology
,
73
(
5
),
689
710
.

Sethuraman
 
J.
(
1994
).
A constructive definition of Dirichlet priors
.
Statistica Sinica
,
4
(
2
),
639
650
.

Shah
 
A.
,
Wilson
 
A.
, &
Ghahramani
 
Z.
(
2014
).
Student-t processes as alternatives to Gaussian processes
.
Journal of Machine Learning Research
,
33
,
877
885
.

Shen
 
Z.
,
Ning
 
F.
,
Zhou
 
W.
,
He
 
X.
,
Lin
 
C.
,
Chin
 
D. P.
,
Zhu
 
Z.
, &
Schuchat
 
A.
(
2004
).
Superspreading SARS events, Beijing, 2003
.
Emerging Infectious Diseases
,
10
(
2
),
256
260
.

Svensson
 
Å.
(
2007
).
A note on generation times in epidemic models
.
Mathematical Biosciences
,
208
(
1
),
300
311
.

Svensson
 
Å.
(
2015
).
The influence of assumptions on generation time distributions in epidemic models
.
Mathematical Biosciences
,
270
,
81
89
.

Vehtari
 
A.
,
Gelman
 
A.
, &
Gabry
 
J.
(
2017
).
Practical Bayesian model evaluation using leave-one-out cross-validation and Waic
.
Statistics and Computing
,
27
(
5
),
1413
1432
.

Ward
 
H.
,
Atchison
 
C.
,
Whitaker
 
M.
,
Ainslie
 
K. E. C.
,
Elliott
 
J.
,
Okell
 
L.
,
Redd
 
R.
,
Ashby
 
D.
,
Donnelly
 
C. A.
,
Barclay
 
W.
,
Darzi
 
A.
,
Cooke
 
G.
,
Riley
 
S.
, &
Elliott
 
P.
(
2021
).
SARS-CoV-2 antibody prevalence in England following the first peak of the pandemic
.
Nature Communications
,
12
(
1
),
905
.

Watanabe
 
S.
(
2013
).
A widely applicable Bayesian information criterion
.
Journal of Machine Learning Research
,
14
(
1
),
867
897
.

Wistuba
 
T.
,
Mayr
 
A.
, &
Staerk
 
C.
(
2022
).
Estimating the course of the covid-19 pandemic in Germany via spline-based hierarchical modelling of death counts
.
Scientific Reports
,
12
(
1
),
9784
.

Author notes

Conflicts of interest: The authors declare that there is no conflict of interest.

This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.

Supplementary data