Abstract

The transmission rate is a central parameter in mathematical models of infectious disease. Its pivotal role in outbreak dynamics makes estimating the current transmission rate and uncovering its dependence on relevant covariates a core challenge in epidemiological research as well as public health policy evaluation. Here, we develop a method for flexibly inferring a time-varying transmission rate parameter, modeled as a function of covariates and a smooth Gaussian process (GP). The transmission rate model is further embedded in a hierarchy to allow information borrowing across parallel streams of regional incidence data. Crucially, the method makes use of optional vaccination data as a first step toward modeling of endemic infectious diseases. Computational techniques borrowed from the Bayesian spatial analysis literature enable fast and reliable posterior computation. Simulation studies reveal that the method recovers true covariate effects at nominal coverage levels. We analyze data from the COVID-19 pandemic and validate forecast intervals on held-out data. User-friendly software is provided to enable practitioners to easily deploy the method in public health research.

1 Introduction

Infectious diseases present a threat to public health, and their potential impact is increasing with accelerating global travel and urbanization. Thus, it is important that we use each new disease outbreak as an opportunity to improve our preparedness for future pathogen outbreaks. One way to improve our capacity to fight future infectious disease outbreaks is to develop statistical methods which enable the use of surveillance data to (1) quantitatively estimate the relative impact of different public health interventions and (2) compute short-term, localized forecasts for the infectious disease burden. Methods satisfying these criteria would facilitate timely, targeted interventions by public health officials and would empower high-risk individuals to manage their exposure risk. We will refer to statistical methodologies developed with these aims as examples of epidemic regression analysis.

In this paper, we aim to develop a methodology tailored to the features and deficiencies exhibited by many of the large-scale epidemic surveillance data sources during the COVID-19 pandemic. We anticipate major sources of bias and uncertainty will be difficult to eliminate from rapid-response data collection efforts for novel and resurgent epidemics in the near future, especially in developing regions (Vasudevan et al., 2021), suggesting that ours and similarly motivated methodologies will remain relevant in making sense of such data sources. We illustrate use of this method through application to an assemblage of several contemporary data sources: the county-day resolution incidence data compiled by The New York Times (2021), the state-day resolution vaccination data compiled by Our World in Data (2022), and clinical data from several authors, as we elaborate in the Supporting information, Section C.1.

The conceptual starting point for many infectious disease models is the idea that the population can be partitioned into compartments (e.g., susceptible or infectious) and that new infections are proportional to both the susceptible and infectious populations (Hamer, 1906; Kermack & McKendrick, 1927). This line of reasoning was developed throughout the 20th century in deterministic and stochastic compartmental model formulations, as summarized by Hethcote (2000) and Andersson and Britton (2000), respectively. Efforts to improve forecasting of annual influenza disease burden have been a major driver of methodological innovation (e.g., Osthus & Moran, 2021 and references therein) though some of the associated methodology relies on recurring seasonal patterns. Other recent contributions to the literature can be organized as responding to two thematic challenges.

First, there is the challenge of devising a parsimonious model which is also plausible, particularly since we observe the sequential incidence data with subtle error processes (Tang et al., 2020). Adopting a state space structure allows us to flexibly and interpretably model a sampling density separately from the evolution of the population. Furthermore, we can prevent measurement error from contaminating infection dynamics, unlike observation-driven models (Cox et al., 1981). Well-known examples of state-space modeling include Shaman and Karspeck (2012) and Dukic et al. (2012). The compelling DBSSM model of Osthus et al. (2017) has been extended by Wang et al. (2020) and Zhou et al. (2020). However, many of these models have adopted unrealistic sampling models, assuming that surveillance data provide unbiased observations of the infectious compartment. For examples of more realistic sampling models, see Johndrow et al. (2020) or Quick et al. (2021).

Second, while the mathematical models of disease recapitulate certain realistic properties of epidemic transmission, they yield poor stand-ins for real-world transmission. For example, they typically anticipate a single wave of infection, whereas often multiple waves of infection are observed in practice (Patterson & Pyle, 1991; Orsted et al., 2013; Karako et al., 2021). The most intuitive solution to this problem is to allow transmission parameters to vary over time, often either the transmission rate or the basic reproduction number in particular.

Examples of flexible transmission rate modeling, along with attempts to associate changes with covariates, include Wallinga and Teunis (2004), Cori et al. (2013), and Thompson et al. (2019). These models suffer the drawback of placing no temporal correlation structure on the transmission rate, instead relying on windowed estimates to achieve smoothing behavior, and furthermore they are observation-driven. Hespanha et al. (2021) devised a forecast methodology for COVID-19 by modeling parameters with random walks.

In contrast, the “eSIR” model (Wang et al., 2020) is a state-space model based on the DBSSM which allows a time-varying transmission rate which can be associated with covariates. The eSIR authors published a software package of the same name, and their methodology has been deployed and extended in various contexts (Zhou et al., 2020; Ray et al., 2020); however, the time-varying transmission rate is assumed to have a known functional form as a simple combination of basis functions. Similar limitations are present in Johndrow et al. (2020) and others who model transmission rate as a piecewise function. Keller et al. (2022) developed a sophisticated methodology for estimating covariate effects with a smooth temporal deviation and a multiplicative process for state-evolution error.

Some large scientific collaborations succeeded in applying impressively complex models to effect estimation and forecasting tasks (e.g., Monod et al. 2021; IHME COVID-19 Forecasting Team, 2021), but these relied on ad hoc procedures for fitting and quantifying uncertainty. Moreover, there is limited software available to reproduce such analyses and the methods would be exceedingly difficult to port to other problems. The MERMAID methodology proposed in Quick et al. (2021) learns a smooth transmission rate as a function of covariates and spline terms but amplifies the likelihood contribution of seroprevalence data in order to achieve identifiability.

Recent and continued methodological advances appear to demand careful design of posterior computation algorithms. Johndrow et al. (2020) implemented an adaptive Metropolis algorithm (Haario et al., 2001) to facilitate rapid mixing of their iterative MCMC sampler. Zhou and Ji (2020) fitted a latent Gaussian process (GP) model using a Parallel Tempering MCMC algorithm (Woodard et al., 2009; Miasojedow et al., 2013), but did not incorporate a hierarchy to borrow information across parallel streams of incidence data, and reported challenges with covariate effect estimation. Keller et al. (2022) invoked HMC sampling algorithms implemented via Stan (Carpenter et al., 2017).

In this paper, we describe a novel model and computational algorithm for epidemic regression analysis. Our hierarchical, semi-parametric regression model is to our knowledge unique in the literature. We use latent GP priors to achieve smooth temporal random effects rather than the splines often used for that purpose. Moreover, several articles have discussed the potential value of hierarchical priors on covariate effects to enable borrowing of information across regions, but do not provide implementation. We achieve robust posterior computation for our model by taking advantage of techniques developed in the literature for Spatial Generalized Linear Mixed Models (SGLMMs) (Diggle et al., 1998). Our methodology is accessible to practitioners through an open-source, user-friendly R package smesir (2022), available on GitHub.

The remainder of the paper is organized as follows. In Section 2, we describe our data, notation, and model. We summarize the results of extensive simulation studies in Section 3 and highlight results from application to the COVID-19 epidemic in Section 4. We conclude with discussion of future work in Section 5. Additional results, sensitivity analyses, extended description of our methodology, and detailed description of our posterior computation algorithm are presented in the Supporting information.

2 Methods

2.1 Data and Notation

In this paper, we will be working with panel format incidence data, for which individual observations might be counts of new cases, new hospitalizations, or deaths that occur in a specific population during a given time interval. To avoid unnecessary loss of generality, we will refer to any such data as counts of incidence events or, simply, events.

We suppose that the data are observed for K non-overlapping regional populations over a sequence of T contiguous, equal-length time intervals. We then suppose the incidence data are arranged into a formula matrix formula of non-negative event counts. Additionally, we may optionally have P covariates of interest, arranged into a formula tensor formula of covariate values such that each of the formula covariate series is centered and scaled. If a vaccine is available, our model will take vaccinations into account via a formula matrix formula of non-negative counts of newly vaccinated individuals. We adopt the indexing convention that a contiguous slice of values formula may be abbreviated as formula. Bold notation will be used to indicate that indices have been suppressed for at least one dimension of a multidimensional variable, for example formula or formula. Finally, we will use bold formula to denote a vector of ones and formula as the indicator function.

We use an integer index over sequential observations, aligning indices, formula, to the specific times, formula, at which event counts were reported in our data. Epidemic event counts or covariate values reported at formula should reflect circumstances from the interval formula, and where no confusion is possible, we refer to these intervals by the index of their associated reporting time. Hence, formula and formula are referred to as observation time i and interval i, respectively. For sequential indices less than 1 or greater than T, we assume that the constant time-interval width is maintained. We define the regional outbreak times as the collection formula such that formula if the first case in region k was reported during interval i. Since the first reported cases should precede or coincide with reports of any other type of event, for each region k we assume formula for all formula.

Finally, we will adopt the convention of presenting Gaussian distributions (formula) in the mean-variance parameterization, half-Gaussian distributions (formula) in terms of the variance parameter of the corresponding unfolded Gaussian, gamma and inverse-gamma distributions (e.g., formula) in the shape-rate parameterization, exponential distributions (formula) in the rate parameterization, Poisson distributions (formula) in terms of their mean, and negative binomial distributions (formula) in terms of a mean and a dispersion parameter.

2.2 Review of Compartmental Epidemic Models and the Empirical Transmission Rate

A popular class of mathematical models for infectious disease is compartmental models. Abstractly, such models break a population into distinct compartments and specify rules by which portions of the population transition between compartments. A concrete example is the highly popular SIR model, named for its proposed division of the population into Susceptible, Infectious, and Recovered (immune) compartments (Kermack & McKendrick, 1927).

The transition rules of the original SIR model can be encoded as a system of discrete-time difference equations formula; formula; and formula; where β is the transmission rate, γ is the recovery rate, and t is an index over time. The variables formula, and formula represent the sizes of the compartments at index time t, measured as fractions of the population. Note the key term formula determines the portion of the population newly infected at time t in this model.

As discussed in Section 1, many fixed parameter compartmental models fail to anticipate the multiple waves of infection frequently observed in disease outbreaks. A simple way to induce the SIR model to admit these trajectories is by supposing the transmission intensity between susceptible and infectious populations can vary over time due to viral or host evolution, environmental, or behavioral factors. Following this line of reasoning, we replace the static parameter β with a dynamic empirical transmission rate parameter, formula. We usually refer to this variable simply as the transmission rate, but emphasize here that we are describing the realized rate at which a disease is being transmitted through the population, driven both by biology and behavior, and not the notion of an inherent “biological” transmission rate.

2.3 State-space Modeling and the Production Process

We call the proportional compartment populations (i.e., formula from the preceding discussion) the state of the population at time t. Unfortunately, measurements of the population state, such as seroprevalence studies to estimate formula, can be difficult to obtain and biologically unreliable (Muecksch et al., 2021). Instead we must often rely on incidence data to provide indirect information about the population state based on their relationship with state transitions, like the portion of the population newly infected during interval t.

In what follows, we will to refer to any probabilistic model for incidence data given recent infections as a production process. We let formula represent the number of individuals newly infected during period t. Note that while new infections appear to be modeled as a deterministic function of the population state, randomness will be injected into the infection series via stochastic evolution of the transmission rate. A production process can be written in general terms as formula, where formula is the incidence count data for a particular population at time t, formula is a family of probability distributions indexed by θ, and θ is a collection of relevant production parameters such as the infection fatality rate, testing rate, or a distribution of lag times from event detection to reporting. The types of quantities that must be included in θ will depend on whether the incidence data formula are cases, deaths, or some other type of incidence event.

To define the process formula, we follow the reasoning in Johndrow et al. (2020). Suppose at first that formula is Poisson, and the mean is a weighted linear function of recent infection counts, so formula where formula is a weighted sum over recent new infections. Specifically, formula such that formula is the probability that an infected individual will contribute to the reported incidence event count m intervals after the interval during which they were infected. We assume that there is a maximum production lag, M, so that formula for formula, and that formula.

The production parameters θ and corresponding weights formula will not be known exactly, nor do we hope to estimate them within our methodology. Therefore, ultimately we will not employ a model for formula but rather formula, where formula is an estimate of the production weights which we obtain from clinical or administrative data, formula, prior to our analysis. We can obtain the latter distribution by marginalizing formula over conditional uncertainty formula, leading to an overdispersed count distribution for formula. Specifically, it is convenient to model our uncertainty by supposing that formula, where formula is gamma distributed with mean 1 and variance formula. Then, we find the simple expression for the observation distribution formula. Example specifications of formula based on clinical data are given in the Supporting information and applied in Section 4.

In our model, the dispersion parameter is typically assumed to be a fixed value, formula for all t, and δ will be learned from the data. However, epidemic incidence data might feature “corrupted” intervals I for which we have very low confidence in our estimate formula or otherwise suspect poor data quality. In such instances, to avoid over-fitting it may be reasonable to set a hyperparameter formula and assert formula for any formula. We frequently let formula, which loosely corresponds to assigning a 95% probability to the fact that formula is within an order of magnitude of the most appropriate values.

The combination of the flexibly specified approximate acquisition probabilities formula, the corrupt-data flag I, and the corrupt-data over-dispersion parameter formula were meant to accommodate some of the irregularities of surveillance data observed during the COVID-19 pandemic. Notably, the realized lag times from an infection to a case report, hospitalization, or death, appeared to depend on uncertain biological characteristics as well as technical hurdles. In particular, for case detection, the most appropriate acquisition probabilities formula likely evolved substantially over geographic regions as well as short- and long-term time horizons due to test availability, symptomatic/asymptomatic test policy, and pervasive, observable intra-week reporting probabilities. In Section 4, we will show that a calibrated application of our strategy achieves our modeling objectives, imputing appropriate (large) uncertainty while still enabling us to extract some meaningful inferences from large data, while simulation studies described in Section 3 suggest the validity of our inferences is not substantially degraded by poor choices of formula.

2.4 Single-region Statistical Model

We have already outlined the rationale for our model in Sections 2.2 and 2.3. In this section, we formally describe our model for a single region, temporarily eliminating the need for the spatial indices formula described in Section 2.1. We will extend the model to accommodate panel data from multiple regions in the subsequent section. Implementations of both the single-region model described in this section and the multi-region model described in the next section are provided in our open source R package smesir (2022). A detailed development of our algorithm for posterior computation, building on techniques developed in the spatial statistics literature, is provided in Section A of the Supporting information.

For our observation model, we assume that the sequential event counts formula are jointly independent given the trajectory of new infections formula and that formula given formula is independent of formula for all t. Then,

(1)

where, here and afterward, for conciseness, we will no longer explicitly condition on quantities which will not be inferred from the data—in the above, formula, I, and formula—because those values are assumed to be fixed in advance as part of the model, just as the negative binomial family is fixed in advance.

Our state-evolution model extends the SIR model by assuming a time-varying transmission rate, and also introducing direct transfers from the susceptible to the removed compartment via vaccination. We denote the proportion of the population which is unvaccinated as of time t with formula, while the proportion of the population which is newly vaccinated during interval t is designated as formula. We assume that vaccines are administered independently of whether a person has previously been infected, but that previously vaccinated individuals are not vaccinated again. We also assume that vaccines confer immunity which is indistinguishable from prior infection. This leads us to the following set of state evolution equations:

(2)

For readability we have omitted min/max operators which are necessary to enforce, for example, that the implied new infections do not exceed the size of the susceptible compartment. We set the initial condition of the state to formula, where o is defined as in Section 2.1 and ι is the unknown initial infectious population which accounts for the detected cases at time o. We assume formula for formula. Note that this model treats the system as though it were closed. Implicitly we rely on a moderately large population size and sufficiently short observation time so that births, non-disease related deaths, immigration, and emigration do not contribute substantially to the evolution of the system. For modeling purposes, disease-related deaths can be conflated with acquired immunity so long as neither group is expected to transmit disease.

We note that this state-evolution model is inappropriate for endemic infectious diseases, for which immunity decay may become a significant dynamic. However, we contend that it is a reasonable model for many early-stage epidemics, including the COVID-19 outbreak through the end of 2021, during which time full vaccination and prior infection seemingly conferred robust, medium-term immunity to the extant viral strains. We discuss the need for model extensions and rigorous comparative model evaluation in Section 5.

We adopt a semiparametric log-linear mixed-effects model for formula to capture covariate-explained and unexplained temporal heterogeneity, formula, where α is an intercept term, formula is a vector of coefficients, and formula is a vector of temporal random effects. Specifically, we suppose that the random effect formula is the value of an unknown smooth function formula evaluated at time t, and model the function as a draw from a GP formula such that formula is the squared exponential kernel with lengthscale parameter ℓ.

As discussed in Melikechi et al. (2022), models which feature both the transmission rate and recovery rate as free parameters are not always practically identifiable. Therefore, we assume that the population size N, recovery rate γ, and estimated production weights formula are available to the user or can be reasonably estimated via some external procedure. We demonstrate elicitation of these parameters from clinical data in an analysis of U.S. COVID-19 data in Section 4. For computational convenience, we also require the lengthscale parameter of the random effect (ℓ) to be specified in advance based on intuition for how rapidly temporal dependence in the random effect formula should decay.

Taking a Bayesian approach to inference, we construct a joint prior as a product of independent marginals. By default, we adopt somewhat informative priors consistent with properties of observed epidemics: formula, formula, formula, formula, and formula. Our software allows specification of more diffuse priors, if desired.

2.5 Multi-region Statistical Model

Our model accommodating panel data from multiple regions are a direct hierarchical extension of the single-region model proposed in Section 2.4. Specifically, for formula, we have local population size formula, initial infectious population formula, outbreak times formula, vaccination proportion time series formula, and transmission rate models formula that collectively determine the region's state evolution: formula for formula, then formula when formula, and finally according to the system of equations in display (2) for formula. The differences across regions in the parameters of the log-linear model for formula can be interpreted as capturing differences due to population density, degree of compliance with public health mandates, or other unmeasured behaviors. Similarly, we assume that local hyperparameters formula are available and model observed regional event counts formula as conditionally independent across regions given local infection counts formula and dispersion parameters formula.

Our hierarchical prior structure reflects our belief that knowing the region-specific intercept and effects for the transmission rate in one region will provide information about the analogous parameters for the other regions, in the sense that they will be scattered around central, “global” parameter values. Hence, for formula, we have formula, formula and formula. We place weakly informative priors on the hyperparameters of this hierarchy, specifically formula, formula, and formula. Priors on the initial infectious populations and dispersion parameters are independent across regions formula and formula.

3 Simulation Study

To evaluate performance of our model and computational approach in recovering covariate effects, we conducted an in-depth case study of a simulated dataset as well as a broad simulation meta-analysis. In the case study, we generate and analyze a single, simulated, multi-region epidemic dataset. In the meta-analysis, we randomly generate 1,000 sets of model parameters and estimate the empirical coverage rate and concentration of our 95% posterior credible intervals. The results of our investigation of coverage rates when data are generated under the model are presented in Table 1, while details about our simulation studies and extended results, including estimates of empirical coverage rates when hyperparameters are misspecified, are available in the Supporting information, Section B. Overall, these studies demonstrate satisfying coverage properties of Bayesian credible intervals when data are simulated from the model, and show that performance is robust to errors in specification of the recovery rate γ and the production weights formula.

Table 1

Coverage rates of quantile-based 95% credible intervals for each parameter.

formula23450
formula989897979797
formula979997979897
formula929491929393
formula949393949393
formula9795979694
formula9595959593
formula97
formula94
formula23450
formula989897979797
formula979997979897
formula929491929393
formula949393949393
formula9795979694
formula9595959593
formula97
formula94

Note: Coverage rates are computed over 1,000 simulated datasets, each with formula, formula, formula, and a distinct (random) set of parameters. Column headings specify the pertinent region (0 for global parameters). Row labels specify the parameter.

Table 1

Coverage rates of quantile-based 95% credible intervals for each parameter.

formula23450
formula989897979797
formula979997979897
formula929491929393
formula949393949393
formula9795979694
formula9595959593
formula97
formula94
formula23450
formula989897979797
formula979997979897
formula929491929393
formula949393949393
formula9795979694
formula9595959593
formula97
formula94

Note: Coverage rates are computed over 1,000 simulated datasets, each with formula, formula, formula, and a distinct (random) set of parameters. Column headings specify the pertinent region (0 for global parameters). Row labels specify the parameter.

4 Application

To illustrate the use of the smesir package in explaining and forecasting the transmission rate, we analyze surveillance data from the COVID-19 epidemic in the 50 U.S. States. In particular, we use data from a repository we maintain (U.S.A. COVID-19 Data, 2022), which has been cleaned and processed to smooth out recording errors and artifacts due to within-week reporting cycles and other idiosyncrasies. Complete details, along with references to raw data sources and reproducible code for recreating the data files are also made available in that repository. Since the repository is updated periodically as new data are published, to ensure that the following analysis is reproducible we have included an exact copy of the data analyzed in our software package, smesir.

We select reasonable values for epidemiological parameters by considering side information from demographic reports and clinical studies. The full elicitation of our epidemic parameters and the production weights formula is presented in the Supporting information, Section C.2, and a sensitivity analysis wherein we examine the impact of these choices on our conclusions by repeating our study under various plausible alternative specifications is presented in Supporting information, Section C.4.

Motivated by the widely publicized challenges in data collection throughout the pandemic (Dean & Yang, 2021), coupled with the fact that uncertain demographic composition of the recently infected population could have a large effect on the production process (e.g., the average age of the newly infected population would have a massive impact on the realized infection fatality rate) we set formula and let formula, the complete set of observation times. We fix a lengthscale of 8 weeks in the squared-exponential covariance function of our temporal random effect, on the basis that large, unexplained swings in behavior seem unlikely to occur within time frames less than a month or so. Otherwise, we use the default prior specification as described in Section 2.5.

We generated samples from the posterior distribution of our parameters using the smesir package with default settings. The convergence diagnostics formula and effective sample size (Gelman et al., 2013), which are computed automatically within the smesir package, provided no evidence that the Markov chains failed to converge.

4.1 Effect Estimation

To demonstrate the estimation of covariate effects in practice we consider the impact that the emergence of the Delta variant had on the state-level transmission rates.

One example of how we can use our model to probe the relationship between the transmission rate and a covariate is by first modeling the log transmission rate with only the intercepts formula and GP temporal random effects formula. We use the posterior means as point estimates formula and plot them against the corresponding covariate values to visualize the marginal relationship. This approach may be of exploratory value when the relationship between the transmission rate and the covariate of interest is possibly not log-linear. We construct such a plot for our analysis of the Delta variant effect in Figure 1, excluding the many points where the prevalence covariate was exactly 0 or 1 for clarity.

Scatter plot of non-parametric estimates of the instantaneous local transmission rate  vs. the local prevalence of the Delta variant as a percent of new cases during the same time period. A loess line is superimposed to highlight the apparent nonlinear trend. This figure appears in color in the electronic version of this paper.
Figure 1

Scatter plot of non-parametric estimates of the instantaneous local transmission rate formula vs. the local prevalence of the Delta variant as a percent of new cases during the same time period. A loess line is superimposed to highlight the apparent nonlinear trend. This figure appears in color in the electronic version of this paper.

In Figure 1, we see that the log transmission rate seems to increase super-linearly with delta prevalence, and so we fit our full semiparametric model with squared prevalence as a covariate. In Figure 2, we visualize how our posterior distribution of the Delta variant's “global” effect ϕ0 on transmission rate concentrates as the amount of data observed increases. To do this, we fit our model twice—including the first 6 and then 19 weeks after the Delta variant first appeared in the United States. We summarize the final posterior distribution for all 50 of the local coefficients formula in Figure F of the Supporting information.

This collection of superimposed histograms illustrates the progressive posterior concentration for the effect of the Delta variant prevalence on transmission rate. Specifically, this coefficient is associated with the squared proportion of Delta variant infections among recent new COVID-19 cases. The three histograms show the N(0, 10) prior distribution (with upper and lower tails cropped out for clarity), the posterior distribution after week 77 (6 weeks after the Delta variant first appeared in the United States), and the posterior distribution after week 90. This figure appears in color in the electronic version of this paper.
Figure 2

This collection of superimposed histograms illustrates the progressive posterior concentration for the effect of the Delta variant prevalence on transmission rate. Specifically, this coefficient is associated with the squared proportion of Delta variant infections among recent new COVID-19 cases. The three histograms show the N(0, 10) prior distribution (with upper and lower tails cropped out for clarity), the posterior distribution after week 77 (6 weeks after the Delta variant first appeared in the United States), and the posterior distribution after week 90. This figure appears in color in the electronic version of this paper.

4.2 Forecasting

To demonstrate the forecasting performance of our method in a context where relevant covariates and vaccination data are both available, we again focus on the period just before and immediately after the appearance of the Delta variant in the United States. Since alternative methods do not admit vaccination data, we cannot compare our methodology to related approaches for this particular example. Instead, in Section D of the Supporting information we include an additional forecast analysis which highlights the strength of our forecast uncertainty quantification relative to a state-of-the-art alternative.

In the late spring and early summer of 2021, as vaccination rates increased and, in many places, social distancing measures were still in effect, new cases and deaths from COVID-19 were rapidly declining in the United States. Indeed, from these incidence data alone it appeared that, locally, the COVID-19 epidemic was coming to a close. However, with the easing of public health interventions and the emergence of the new, more transmissible Delta variant, cases, and eventually deaths, surged again as the summer ended. The sudden turn of events allows us to ask whether our method can detect a change in the underlying dynamics of disease transmission before a trend reversal is visually evident in the incidence data. To see this, we compare the forecast distributions of the model fit separately to the first 74 and 77 weeks of data—the former model is fit without an effect for the Delta variant prevalence as the covariate had been essentially zero up to that time.

To summarize our results, we sum the forecasts from all 50 states to obtain a forecast distribution for the entire U.S. population. The 90-week time series of weekly death totals for the combined 50 States, plus the forecast intervals for the model fit to the first 74 and 77 weeks of data, are plotted in Figure 3. While the 95% forecast intervals after 74 and 77 weeks both accommodate the surge in deaths associated with the Delta variant, we can make a more detailed comparison by identifying the quantile of the forecast distribution which most closely matches the observed data. Indeed, the actual course of death counts observed in late summer was regarded as being much more plausible after 77 weeks than after 74 weeks (see the comparison of the pointwise event forecast quantiles in Figure 3), despite the fact that incidence of deaths had declined in that 3-week period. Evidently, the stalling decline in deaths served as evidence that the transmission rate was actually increasing.

Comparison of forecast distributions after the first 74 and 77 weeks of the U.S. COVID-19 epidemic. Parallel pointwise forecast quantiles are labeled with their associated probability level. Note that, while both forecast distributions admit the possibility of a surge in deaths, the eventually observed surge in deaths is seen as much less surprising at 77 weeks, despite the fact that weekly deaths fell between weeks 74 and 77. This suggests that our model is not only sensitive to the current counts of deaths but also to the larger context of evolving rates of change. This figure appears in color in the electronic version of this paper.
Figure 3

Comparison of forecast distributions after the first 74 and 77 weeks of the U.S. COVID-19 epidemic. Parallel pointwise forecast quantiles are labeled with their associated probability level. Note that, while both forecast distributions admit the possibility of a surge in deaths, the eventually observed surge in deaths is seen as much less surprising at 77 weeks, despite the fact that weekly deaths fell between weeks 74 and 77. This suggests that our model is not only sensitive to the current counts of deaths but also to the larger context of evolving rates of change. This figure appears in color in the electronic version of this paper.

5 Discussion

It is possible that sequential algorithms for posterior computation may be suitable for fitting our model. Such tools would be appealing as the length of the incidence data history increases, since they obviate the need to completely retrain the model as new observations are obtained. Sequential algorithms were developed in both Shaman and Karspeck (2012) and Dukic et al. (2012). An additional strength of some sequential Monte Carlo algorithms is their ability to be parallelized (Durham & Geweke, 2014).

In this paper, we have described an innovative methodology for analyzing and forecasting multi-region epidemic incidence data, and contextualized that contribution within the rapidly evolving literature. Our software tool should facilitate retrospective analysis of the international COVID-19 pandemic as well as forecasting in the early stages of future infectious disease outbreaks. There are several areas for future work. First, the spatial literature we relied on to design our posterior computation algorithm includes other techniques which we have not yet explored in this context, such as random-projection algorithms for matrix inversion (Banerjee et al., 2013; Guan & Haran, 2018) and additional reparameterizations (Christensen et al., 2006) which might enable robust sampling of more parameters including the GP lengthscale and a larger number of linear coefficients.

Another direction for future work is expansion of the state evolution model itself. Minimally, it seems essential that immunity decay be introduced in order to express the full course of COVID-19 transmission dynamics. Another direction would be to introduce flows of transmission between regions, legitimizing the application of our model to higher spatial resolutions. Some work in this direction can be found in von Neumann and Burks (1966), White et al. (2007), and Zhou et al. (2020). Beyond that, we would like to rigorously elucidate how and when additional complexity in the state space model is warranted. In at least one instance, Roda et al. (2020) found a simple SIR model outperformed a four-compartment Susceptible, Infectious, Exposed, Removed (SEIR) model when evaluated using the Akaike information criterion (AIC).

Data Availability Statement

The data used in this paper are available in the R package smesir (2022). These data were derived from several resources available in the public domain. For details, see U.S.A. COVID-19 Data (2022).

Acknowledgments

The authors would like to thank the editor, associate editor, and referee for their insightful comments, which motivated substantial improvements in the quality of the manuscript. The research was funded by NIH grant R01-ES028804 from the National Institutes of Environmental Health Sciences.

References

Andersson
,
H.
&
Britton
,
T.
(
2000
)
Stochastic epidemic models and their statistical analysis, vol. 151. Lecture Notes in Statistics
.
New York
:
Springer-Verlag
.

Banerjee
,
A.
,
Dunson
,
D.B.
&
Tokdar
,
S.T.
(
2013
)
Efficient Gaussian process regression for large datasets
.
Biometrika
,
100
,
75
89
.

Carpenter
,
B.
,
Gelman
,
A.
,
Hoffman
,
M.D.
,
Lee
,
D.
,
Goodrich
,
B.
,
Betancourt
,
M.
, et al. (
2017
)
Stan: a probabilistic programming language
.
Journal of Statistical Software
,
76
,
1
32
.

Christensen
,
O.F.
,
Roberts
,
G.O.
&
Sköld
,
M.
(
2006
)
Robust Markov chain Monte Carlo methods for spatial generalized linear mixed models
.
Journal of Computational and Graphical Statistics
,
15
,
1
17
.

Cori
,
A.
,
Ferguson
,
N.M.
,
Fraser
,
C.
&
Cauchemez
,
S.
(
2013
)
A new framework and software to estimate time-varying reproduction numbers during epidemics
.
American Journal of Epidemiology
,
178
,
1505
1512
.

Cox
,
D.R.
,
Gudmundsson
,
G.
,
Lindgren
,
G.
,
Bondesson
,
L.
,
Harsaae
,
E.
,
Laake
,
P.
, et al. (
1981
)
Statistical analysis of time series: Some recent developments [with discussion and reply]
.
Scandinavian Journal of Statistics
,
8
,
93
115
.

Dean
,
N.
&
Yang
,
Y.
(
2021
)
Discussion of “Regression models for understanding COVID-19 epidemic dynamics with incomplete data”
.
Journal of the American Statistical Association
,
116
,
1587
1590
.

Diggle
,
P.J.
,
Tawn
,
J.A.
&
Moyeed
,
R.A.
(
1998
)
Model-based geostatistics
.
Journal of the Royal Statistical Society: Series C (Applied Statistics)
,
47
,
299
350
.

Dukic
,
V.
,
Lopes
,
H.F.
&
Polson
,
N.G.
(
2012
)
Tracking epidemics with Google flu trends data and a state-space SEIR model
.
Journal of the American Statistical Association
,
107
,
1410
1426
.

Durham
,
G.
&
Geweke
,
J.
(
2014
)
Adaptive sequential posterior simulators for massively parallel computing environments
.
Advances in Econometrics
,
34
,
1
44
.

Gelman
,
A.
,
Carlin
,
J.
,
Stern
,
H.
,
Dunson
,
D.
,
Vehtari
,
A.
&
Rubin
,
D.
(
2013
)
Bayesian data analysis
, 3rd edition,
Chapman and Hall
.

Guan
,
Y.
&
Haran
,
M.
(
2018
)
A computationally efficient projection-based approach for spatial generalized linear mixed models
.
Journal of Computational and Graphical Statistics
,
27
,
701
714
.

Haario
,
H.
,
Saksman
,
E.
&
Tamminen
,
J.
(
2001
)
An adaptive Metropolis algorithm
.
Bernoulli
,
7
,
223
242
.

Hamer
,
W.
(
1906
)
Epidemic disease in England
.
The Lancet
,
1
,
733
739
.

Hespanha
,
J.P.
,
Chinchilla
,
R.
,
Costa
,
R.R.
,
Erdal
,
M.K.
&
Yang
,
G.
(
2021
)
Forecasting COVID-19 cases based on a parameter-varying stochastic SIR model
.
Annual Reviews in Control
,
51
,
460
476
.

Hethcote
,
H.W.
(
2000
)
The mathematics of infectious diseases
.
SIAM Review
,
42
,
599
653
.

IHME COVID-19 Forecasting Team
, (
2021
)
Modeling COVID-19 scenarios for the United States
.
Nature Medicine
,
27
,
94
105
.

Johndrow
,
J.
,
Ball
,
P.
,
Gargiulo
,
M.
&
Lum
,
K.
(
2020
)
Estimating the number of SARS-CoV-2 infections and the impact of mitigation policies in the United States
.
Harvard Data Science Review
. https://hdsr.mitpress.mit.edu/pub/9421kmzi
[Accessed 1st October 2022].

Karako
,
K.
,
Song
,
P.
,
Chen
,
Y.
,
Tang
,
W.
&
Kokudo
,
N.
(
2021
)
Overview of the characteristics of and responses to the three waves of COVID-19 in Japan during 2020–2021
.
BioScience Trends
,
15
,
1
8
.

Keller
,
J.P.
,
Zhou
,
T.
,
Kaplan
,
A.
,
Anderson
,
G.B.
&
Zhou
,
W.
(
2022
)
Tracking the transmission dynamics of COVID-19 with a time-varying coefficient state-space model
.
Statistics in Medicine
,
41
,
2745
2767
.

Kermack
,
W.O.
&
McKendrick
,
A.G.
(
1927
)
A contribution to the mathematical theory of epidemics
.
Proceedings of the Royal Society of London: Series A
,
115
,
700
721
.

Melikechi
,
O.
,
Young
,
A.L.
,
Tang
,
T.
,
Bowman
,
T.
,
Dunson
,
D.
&
Johndrow
,
J.
(
2022
)
Limits of epidemic prediction using SIR models
.
Journal of Mathematical Biology
,
85
,
1
29
.

Miasojedow
,
B.
,
Moulines
,
E.
&
Vihola
,
M.
(
2013
)
An adaptive parallel tempering algorithm
.
Journal of Computational and Graphical Statistics
,
22
,
649
664
.

Monod
,
M.
,
Blenkinsop
,
A.
,
Xi
,
X.
,
Hebert
,
D.
,
Bershan
,
S.
,
Tietze
,
S.
, et al. (
2021
)
Age groups that sustain resurging COVID-19 epidemics in the United States
.
Science
,
371
, eabe8372.

Muecksch
,
F.
,
Wise
,
H.
,
Batchelor
,
B.
,
Squires
,
M.
,
Semple
,
E.
,
Richardson
,
C.
, et al. (
2021
)
Longitudinal serological analysis and neutralizing antibody levels in coronavirus disease 2019 convalescent patients
.
The Journal of Infectious Diseases
,
223
,
389
398
.

Orsted
,
I.
,
Molvadgaard
,
M.
,
Nielsen
,
H.L.
&
Nielsen
,
H.
(
2013
)
The first, second and third wave of pandemic influenza A (H1N1)pdm09 in North Denmark Region 2009–2011: a population-based study of hospitalizations
.
Influenza and Other Respiratory Viruses
,
7
,
776
782
.

Osthus
,
D.
,
Hickmann
,
K.S.
,
Caragea
,
P.C.
,
Higdon
,
D.
&
Del Valle
,
S.Y.
(
2017
)
Forecasting seasonal influenza with a state-space SIR model
.
The Annals of Applied Statistics
,
11
,
202
.

Osthus
,
D.
&
Moran
,
K.R.
(
2021
)
Multiscale influenza forecasting
.
Nature Communications
,
12
, 2991.

Our World in Data
(
2022
)
COVID-19 dataset
.
Available from:
https://github.com/owid/covid-19-data
[Accessed 22nd March 2022].

Patterson
,
K.D.
&
Pyle
,
G.F.
(
1991
)
The geography and mortality of the 1918 influenza pandemic
.
Bulletin of the History of Medicine
,
65
,
4
21
.

Quick
,
C.
,
Dey
,
R.
&
Lin
,
X.
(
2021
)
Regression models for understanding COVID-19 epidemic dynamics with incomplete data
.
Journal of the American Statistical Association
,
116
,
1561
1577
.

Ray
,
D.
,
Salvatore
,
M.
,
Bhattacharyya
,
R.
,
Wang
,
L.
,
Du
,
J.
,
Mohammed
,
S.
,
Purkayastha
,
S.
,
Halder
,
A.
,
Rix
,
A.
,
Barker
,
D.
,
Kleinsasser
,
M.
,
Zhou
,
Y.
,
Bose
,
D.
,
Song
,
P.
,
Banerjee
,
M.
,
Baladandayuthapani
,
V.
,
Ghosh
,
P.
, &
Mukherjee
,
B.
(
2020
)
Predictions, role of interventions and effects of a historic national lockdown in India's response to the COVID-19 pandemic: data science call to arms
.
Harvard Data Science Review
(
Special Issue 1
).
Available from:
https://hdsr.mitpress.mit.edu/pub/r1qq01kw
[Accessed 1st October 2022].

Roda
,
W.C.
,
Varughese
,
M.B.
,
Han
,
D.
&
Li
,
M.Y.
(
2020
)
Why is it difficult to accurately predict the COVID-19 epidemic?
Infectious Disease Modelling
,
5
,
271
281
.

Shaman
,
J.
&
Karspeck
,
A.
(
2012
)
Forecasting seasonal outbreaks of influenza
.
Proceedings of the National Academy of Sciences
,
109
,
20425
20430
.

smesir
(
2022
)
Github repository
.
Available from:
https://github.com/davidbuch/smesir
[Accessed 1st December 2022].

Tang
,
L.
,
Zhou
,
Y.
,
Wang
,
L.
,
Purkayastha
,
S.
,
Zhang
,
L.
,
He
,
J.
, et al. (
2020
)
A review of multi-compartment infectious disease models
.
International Statistical Review
,
88
,
462
513
.

The New York Times
(
2021
)
Coronavirus (Covid-19) Data in the United States
.
Available from:
https://github.com/nytimes/covid-19-data
[Accessed 22nd March 2022].

Thompson
,
R.
,
Stockwin
,
J.
,
van Gaalen
,
R.D.
,
Polonsky
,
J.
,
Kamvar
,
Z.
,
Demarsh
,
P.
, et al. (
2019
)
Improved inference of time-varying reproduction numbers during infectious disease outbreaks
.
Epidemics
,
29
, 100356.

U.S.A. COVID-19 Data
(
2022
)
GitHub repository
.
Available from:
https://github.com/davidbuch/usa_covid_data
[Accessed 1st December 2022].

Vasudevan
,
V.
,
Gnanasekaran
,
A.
,
Sankar
,
V.
,
Vasudevan
,
S.A.
&
Zou
,
J.
(
2021
)
Disparity in the quality of Covid-19 data reporting across India
.
BMC Public Health
,
21
,
1
12
.

von Neumann
,
J.
&
Burks
,
A.W.
(
1966
)
Theory of self-reproducing automata
.
Urbana and London
:
University of Illinois Press
.

Wallinga
,
J.
&
Teunis
,
P.
(
2004
)
Different epidemic curves for severe acute respiratory syndrome reveal similar impacts of control measures
.
American Journal of Epidemiology
,
160
,
509
516
.

Wang
,
L.
,
Zhou
,
Y.
,
He
,
J.
,
Zhu
,
B.
,
Wang
,
F.
,
Tang
,
L.
, et al. (
2020
)
An epidemiological forecast model and software assessing interventions on the COVID-19 epidemic in China
.
Journal of Data Science
,
18
,
409
432
.

White
,
S.H.
,
Del Rey
,
A.M.
&
Sánchez
,
G.R.
(
2007
)
Modeling epidemics using cellular automata
.
Applied Mathematics and Computation
,
186
,
193
202
.

Woodard
,
D.B.
,
Schmidler
,
S.C.
&
Huber
,
M.
(
2009
)
Conditions for rapid mixing of parallel and simulated tempering on multimodal distributions
.
The Annals of Applied Probability
,
19
,
617
640
.

Zhou
,
T.
&
Ji
,
Y.
(
2020
)
Semiparametric Bayesian inference for the transmission dynamics of COVID-19 with a state-space model
.
Contemporary Clinical Trials
,
97
, 106146.

Zhou
,
Y.
,
Wang
,
L.
,
Zhang
,
L.
,
Shi
,
L.
,
Yang
,
K.
,
He
,
J.
,
Bangyao
,
Z.
,
Overton
,
W.
,
Purkayastha
,
S.
, &
Song
,
P.
(
2020
)
A spatiotemporal epidemiological prediction model to inform county-level COVID-19 risk in the United States
.
Harvard Data Science Review
.
Available from:
https://hdsr.mitpress.mit.edu/pub/qqg19a0r
[Accessed 1st October 2022].

This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://dbpia.nl.go.kr/journals/pages/open_access/funder_policies/chorus/standard_publication_model)

Supplementary data