This tutorial is partly based on this nowcasting example from Sam Abbott and Sebastian Funk.

Outline

  • In this tutorial we make use of EpiNow2, a toolset for real-time analysis of infectious disease dynamics, for estimating reporting delays and using them to conduct a nowcast.
  • We first explore the HHS hospitalization data and estimate the reporting delay distribution using EpiNow2.
  • We then draw from the literature to define a plausible incubation period and reporting delay
  • We then show how the these estimates may be used in EpiNow2 to perform nowcasts, forecasts and estimation of reproduction numbers and growth rates.
  • Finally, we summarize strengths and weaknesses of this approach and highlight other options and areas for future work.

For more about the EpiNow2 package see the the package documentation. An alternative approach to the problems discussed in this tutorial is contained in the epinowcast package. This package is still under active development but aims to address the limitations EpiNow2 with a view to eventually replacing it.

Getting started

Using this tutorial

This tutorial is written in R Markdown and can be run in RStudio. See the tutorial.Rmd file for the code used to generate this tutorial, this is also the best place to start if you want to work through the tutorial yourself.

We also provide a standalone HTML version of this tutorial that can be viewed in any web browser. This version is static and cannot be run interactively.

You may find it useful to refer back to the slides from the workshop as you work through the tutorial.

Load required libraries

We first load the packages required for this tutorial. These can be installed or staged

renv::restore()

The first run of renv::restore() will take some time: It is a command that will automatically install the correct versions of all packages needed to run this analysis. Ideally, we recommend running it sometime before you want to work through the rest of the tutorial. Once you have all the necessary packages installed, subsequent calls to renv::restore() should be quick.

library("EpiNow2") # for estimating reporting delays and nowcasting
library("dplyr") # for manipulating data
library("ggplot2") # for plotting data
library("knitr") # for displaying data
library("tidyr") # for manipulating data
library("purrr") # for manipulating data
library("covidcast") # for downloading HHS hospitalization data
library("here") # for file paths

Setting the number of cores

EpiNow2 can use multiple cores to speed up the analysis - up to the number of MCMC chains used for model fitting (which here is 4). Here we set the number of cores to use to the number of cores available on your machine or 4 whichever is fewer.

options(mc.cores = min(4, parallel::detectCores()))

Specify states of interest

To start, we want to examine trends from several states. Here, we focus on the data from New York, Colorado, Ohio, Virginia, and North Carolina.

states_of_interest <- c("ny", "co", "oh", "va", "nc")

If you’re interested in looking at your state or adding some different states for comparison, feel free to add an abbreviated state name to the vector above in tutorial.Rmd.

Specify the time period of interest

We will focus on the period from the 1st of December 2021 to January 1, 2022 in this tutorial. In this period, the Omicron variant led to a large increase in cases in the US. It was a particular challenging time to model due to Omicron’s increase in transmissibility, it’s ability to evade immunity, changes in the infection to hospitalization ratio, and the holiday season.

time_period <- c("2021-12-01", "2022-02-01")

Load the data

We use the covidcast package to read in HHS data on confirmed Covid-19 hospital admissions. The primary source for these data is the HHS state-level COVID-19 hospitalization time series, which can be found on HealthData.gov. The covidcast package provides a convenient way to read in a subset of the full data set.

We could download the most recent version of this data directly. If you’re interested in how to do that, check out the hidden code chunk below. It pulls the most up-to-date version of the data and stores it in a data frame called covid_hospitalizations.

Don’t run this command yourself if you don’t have a covidcast API key key loaded into R using options(covidcast.auth = "<your-api-key>")!. Instead go to the next step.

# Add 28 days to the end of the time period to ensure we have all the data
# we need
end_period <- as.Date(time_period[2]) + 28

# Define a range of as-of dates
as_of_dates <- seq(as.Date(time_period[1]), end_date, by = "day")

covid_hospitalizations <- covidcast_signal(
      data_source = "hhs",
      signal = "confirmed_admissions_covid_1d",
      start_day = time_period[1],
      end_day = end_period,
      geo_type = "state",
      geo_value = states_of_interest
    ) |>
      as_tibble() |>
      filter(issue == max(issue)) |>
      select(geo_value, time_value, value) |>
      rename(state = geo_value, date = time_value, confirm = value)

# Save the data as a .rds file in the project folder
saveRDS(covid_hospitalizations, here("data", "covid_hospitalizations.rds"))

When evaluating methods, we’re often interested in how they would have performed on the data available at the time rather than the data in its final, fully-reported state. This data format is equivalent to having a linelist with columns for the date of hospitalization and the date of report.

💡 We use the terms report date and vintage to indicate the as-of date (on what date was this snapshot of the data observed). We use event date or simply date to indicate where a data point falls along the x-axis of the epidemic time series.

If you’re interested in downloading the vintages from winter 2021-2022 yourself, check out the hidden code chunk below (once you’ve done this you’ll be able to rerender the tutorial using your updated data if you wish).

Don’t run this command yourself if you don’t have a covidcast API key key loaded R using options(covidcast.auth = "<your-api-key>")!. Instead go to the next step.

# Add 28 days to the end of the time period to ensure we have all the data
# we need
end_period <- as.Date(time_period[2]) + 28

# Define a range of as-of dates
as_of_dates <- seq(as.Date(time_period[1]), end_date, by = "day")

# Read in the data for each as-of date
covid_hospitalizations_by_vintage <- as_of_dates |>
  purrr::map_df(\(x) (
    covidcast_signal(
      data_source = "hhs",
      signal = "confirmed_admissions_covid_1d",
      start_day = time_period[1],
      end_day = end_period,
      as_of = x,
      geo_type = "state",
      geo_value = states_of_interest
    ) |>
      mutate(report_date = x) |> # Add the date of report
      select(report_date, geo_value, time_value, value) |> # keep these columns
      rename(state = geo_value, # rename the columns for readability
             date = time_value,
             confirm = value) |>
      arrange(date, report_date)
  )) |>
  as_tibble()

# Save the data as a .rds file in the project folder
saveRDS(
  covid_hospitalizations_by_vintage,
  here("data", "covid_hospitalizations_by_vintage.rds")
)

Rather make everyone get an API key and query the covidcast API themselves, we instead use the data we have already downloaded.

# Read in the data, which are saved as a .rds file in the project folder
covid_hospitalizations <- readRDS(
  here("data", "covid_hospitalizations.rds")
) |>
  as_tibble()

glimpse(covid_hospitalizations)
## Rows: 315
## Columns: 3
## $ state   <chr> "co", "nc", "ny", "oh", "va", "co", "nc", "ny", "oh", "va", "c…
## $ date    <date> 2021-12-01, 2021-12-01, 2021-12-01, 2021-12-01, 2021-12-01, 2…
## $ confirm <dbl> 220, 128, 443, 576, 113, 216, 169, 490, 620, 137, 208, 157, 44…
covid_hospitalizations_by_vintage <- readRDS(
  here("data", "covid_hospitalizations_by_vintage.rds")
) |>
  as_tibble()

glimpse(covid_hospitalizations_by_vintage)
## Rows: 18,265
## Columns: 4
## $ report_date <date> 2021-12-03, 2021-12-03, 2021-12-03, 2021-12-03, 2021-12-0…
## $ state       <chr> "co", "nc", "ny", "oh", "va", "co", "nc", "ny", "oh", "va"…
## $ date        <date> 2021-12-01, 2021-12-01, 2021-12-01, 2021-12-01, 2021-12-0…
## $ confirm     <dbl> 235, 140, 456, 574, 107, 240, 141, 439, 556, 114, 240, 141…

Before we proceed, we check that the data we have downloaded contains the states we’re interested in.

if (!all(states_of_interest %in% covid_hospitalizations$state) ||
    !all(states_of_interest %in% covid_hospitalizations_by_vintage$state)) {
  stop(
    "Not all states of interest are present in the data. ",
    "You may need to update the `states_of_interest` variable or ",
    "download the data again using the code in the previous chunks."
  )
}

We also check that it covers the time period we’re interested in.

dates <- unique(covid_hospitalizations$date)
if (!all(map_lgl(time_period, \(x)(any(dates %in% x))))) {
  stop(
    "The data does not cover the time period of interest. ",
    "You may need to update the `time_period` variable or ",
    "download the data again using the code in the previous chunks."
  )
}

Data exploration

Visualize hospitalizations

We start by visualizing the currently reported hospitalizations for this time period. We’re treating these reports as finalized because it’s unlikely there will be substantial revisions for hospitalizations from so long ago.

covid_hospitalizations |>
  ggplot() +
  aes(x = date, y = confirm) +
  geom_col(alpha = 0.6) +
  theme_bw() +
  labs(
    x = "Date of hospitalization",
    y = "Incident hospitalizations",
    title = "Hospitalizations in select states for the winter 2021-2022 wave"
  ) +
  facet_wrap(vars(state), ncol = 1, scales = "free_y")

As expected we see a large increase in hospitalizations in all states during the winter 2021-2022 wave. However, we can also see that there is a large amount of variation in the number of hospitalizations reported on each date and state by state.

Visualize hospitalizations by date of report

While it is interesting to look back retrospectively, it is important to remember that the data we have now is not the same as the data that was available at the time. This can occur for a number of reasons; for example, there may be a delay between the date of hospitalization and the date of report or there may be changes in how hospitalizations are measured which leads to a retrospective change in the number of hospitalizations.

To begin to unpack these changes we can look at the number of hospitalizations as it was reported at the time. If the changes are only caused by reporting delays, we should see that that the number of hospitalizations reported on each date increases over time as more data becomes available. If reporting patterns are more complex than this, hospitalizations could increase or decrease over time. For example if hospitalizations can be recategorized to a different day, then the number of hospitalizations reported for a given day may decrease as well as increase over time.

covid_hospitalizations |>
  ggplot() +
  aes(x = date, y = confirm) +
  geom_point(alpha = 0.6) +
  geom_line(
    data = covid_hospitalizations_by_vintage,
    aes(x = date, y = confirm, col = report_date, group = report_date),
    alpha = 0.8
  ) +
  scale_y_log10() +
  theme_bw() +
  labs(
    x = "Date of hospitalization",
    y = "Hospitalizations",
    title = "Hospitalizations in select states for the winter 2021-2022 wave"
  ) +
  guides(col = guide_colorbar(title = "Date of report", barwidth = 15)) +
  theme(legend.position = "bottom") +
  facet_wrap(vars(state), ncol = 1, scales = "free_y")

You can see in this plot that the data as observed in real time is sometimes right-truncated, meaning that observations at the end of the time series are incomplete or subject to revision. (The light blue line and points represent final counts. The short darker lines represent preliminary counts that will later be revised.) In most epidemiological data, delays in reporting lead to a temporary undercount of what will eventually be reported. If left uncorrected when using real-time data sets, this truncation can lead to underestimation of the effective reproduction number, inaccurate forecasts, and potentially mislead reports to policy makers using these metrics.

However, in this dataset it appears that hospitalizations are also very commonly corrected down. This pattern indicates that there is a more complex reporting mechanism at play. One potential mechanism for this is that it appears that the value for the previous day is often used as a stand-in for the most recent data. This issue is relatively easy to correct for once it has been identified but other mechanism that lead to counts being corrected down are more difficult to account for with currently available methods. The development of new methods to interpret these patterns will likely depend on the interaction between those collecting the data and those developing new methods.

Visualize the reporting delay

We can also plot the distribution of reporting delays to look at how much delay is typical. For this plot, we calculate the delay between the date of hospitalization and the date of report for each hospitalization.

covid_hospitalizations_reporting_cdf <-
  covid_hospitalizations_by_vintage |>
  filter(date >= as.Date(time_period[1])) |>
  group_by(date, state) |>
  group_modify(
    ~ mutate(.x,
      diff = confirm - lag(confirm, default = 0),
      final_reported = .x |>
        filter(report_date == max(report_date)) |>
        pull(confirm)
    )
  ) |>
  ungroup() |>
  mutate(
    delay = as.numeric(report_date - date),
    cdf = confirm / final_reported
  )

glimpse(covid_hospitalizations_reporting_cdf)
## Rows: 18,265
## Columns: 8
## $ date           <date> 2021-12-01, 2021-12-01, 2021-12-01, 2021-12-01, 2021-1…
## $ state          <chr> "co", "co", "co", "co", "co", "co", "co", "co", "co", "…
## $ report_date    <date> 2021-12-03, 2021-12-04, 2021-12-05, 2021-12-06, 2021-1…
## $ confirm        <dbl> 235, 240, 220, 220, 220, 220, 220, 220, 220, 220, 220, …
## $ diff           <dbl> 235, 5, -20, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ final_reported <dbl> 220, 220, 220, 220, 220, 220, 220, 220, 220, 220, 220, …
## $ delay          <dbl> 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,…
## $ cdf            <dbl> 1.068182, 1.090909, 1.000000, 1.000000, 1.000000, 1.000…
covid_hospitalizations_reporting_cdf |>
  filter(delay <= 10) |>
  ggplot() +
  aes(x = delay, y = cdf, group = date) +
  geom_step(alpha = 0.4) +
  theme_bw() +
  labs(
    x = "Reporting delay (days)",
    y = "Hospitalizations reported relative to the final count",
    title = "Reporting delay in select states for the winter 2021-2022 wave"
  ) +
  facet_wrap(vars(state), ncol = 1, scales = "free_y")

We see that the reporting delay is somewhat unstable with a large amount of under/over reporting on the first report date and then very little change after that. This suggests that the reporting delay is not a simple delay but rather a more complex process. One major factor is likely the, previously noted, data processing practice of using the last value carried forward for the most recent date. This is the kind of assumption that needs to be carefully communicated to those using the data as it can lead to misinterpretation of the data as well as impacting nowcasts based on the data.

Choosing a state and target date to nowcast

Define the state to nowcast

For the rest of this tutorial we focus on Ohio. You can repeat the analysis for other states by changing the following variable to the abbreviated state name you are interested in (e.g. “ca” for California).

nowcast_state <- "oh"

Note: nowcast_state must be one of the states in states of interest above.

# Check the data contains the state of interest
if (!(nowcast_state %in% states_of_interest)) {
  stop("nowcast_state must be one of the states in states_of_interest")
}

Define the target date to nowcast for

We also define a target date of interest. Here we use the 14th of January 2022, but you can change this to any date between the 31st of December 2021 and the 14th of January 2022 (or other dates if you have updated the time_period variable and the data snapshot).

target_date <- as.Date("2022-01-14")
dates <- unique(covid_hospitalizations$date)
if (!any(dates %in% target_date)) {
  stop("target_date must be one of the dates in time_period")
}

Estimating the reporting delay

Using recorded data vintages for Ohio

We use EpiNow2 to estimate the distribution of reporting delays for Ohio. The model in EpiNow2 allow us to correct for right truncation in the data when we estimate the reproduction number. We restrict our dataset to include only what was available between the 31st of December 2021 and the 14th of January 2022. Later on we will generate a nowcast for the 14th of January 2022, so we only want to use data that is relevant to this date. Unfortunately, this model cannot account for over-reporting and so we will first have to remove it from the data.

est_df <- covid_hospitalizations_reporting_cdf |>
  filter(state == nowcast_state) |>
  filter(date >= as.Date(time_period[1]) + 13) |>
  filter(report_date >= target_date - 14) |>
  filter(report_date <= target_date) |>
  group_by(date) |>
  mutate(confirm = max(confirm, dplyr::lag(confirm, default = 0))) |>
  ungroup() |>
  # Over the new year reporting was unusually delayed.
  # This leads to problems for the model
  filter(report_date != as.Date("2022-01-05"))

💡 The anomalous reporting delay around the new year is a good example of one of the limitations of nowcasting we discussed in the slide on Limitations because the new year was also a critical period for the spread of Omicron. Current methods for nowcasting struggle to account for this kind of anomaly meaning that the nowcast may be biased when it is most needed.

truncation_est <- est_df |>
  select(report_date, date, confirm) |>
  group_split(report_date) |>
  map(~select(., -report_date)) |>
  estimate_truncation(
    truncation = dist_spec(mean = 0, sd = 0, mean_sd = 1, sd_sd = 1, max = 10),
    chains = 4, iter = 2000,
    control = list(adapt_delta = 0.99, max_treedepth = 15),
    verbose = interactive()
  )

# Assign the truncation delay
truncation_dist <- truncation_est$dist
truncation_dist
## 
##   Uncertain lognormal distribution with (untruncated) logmean -1.7 (SD 0.5) and logSD 0.61 (SD 0.16)

This model may throw warnings about divergent transitions. This is because the model is trying to estimate the truncation distribution for the data but there is very little truncation in the data. This is good news as it means we can use the data as observed to estimate the reproduction number. However, it does mean that estimate_truncation() won’t work well (hence the warnings you may see when running the code) as it is designed to estimate the truncation distribution for data where truncation is present.

A good way to check if the data might be unsuitable for estimate_truncation() is to look at the distribution of reporting delays. As a first pass we can simply look at the proportion of data that is finally reported is reported on the first available report date. If this is high then there is very little truncation in the data and estimate_truncation() will not work well. We also check the mean and standard deviation of the reporting delay to make sure that we are not actually seeing very extreme truncation (i.e., days with zero reported cases).

est_df |>
    filter(date >= target_date - 14) |>
    mutate(cdf = ifelse(cdf > 1, 1, cdf)) |>
    group_by(date) |>
    filter(report_date == min(report_date)) |>
    ungroup() |>
    summarise(
      `Mean proportion reported by the first report` = 
        paste0(round(mean(cdf) * 100, 1), "%"),
      `Standard deviation of the proportion reported by the first report` =
        paste0(round(sd(cdf) * 100, 1), "%)"),
      `Mean delay` = paste0(round(mean(delay), 1), " days"),
      `Standard deviation of the delay` = paste0(round(sd(delay), 1), " days")
    ) |>
    pivot_longer(
      everything(), names_to = "Summary statistic", values_to = "Value"
    ) |>
    kable()
Summary statistic Value
Mean proportion reported by the first report 96.4%
Standard deviation of the proportion reported by the first report 4.7%)
Mean delay 2.1 days
Standard deviation of the delay 0.3 days

As we expected we see a very high proportion of the data is reported on the first available report date. This means that there is likely very little truncation in the data and so estimate_truncation() will not work well as there is so little truncation for it to model. We also see that the mean of the reporting delay is around 2 with a low standard deviation. This is as expected as the HHS has a fixed two day reporting delay for all states.

Using synthetic vintages based on the latest data for Ohio

As we saw above, there is very little truncation in the data for Ohio for this time period. Because of this, we will instead use a synthetic data set to show how to use estimate_truncation() to estimate the distribution of reporting delays using the latest data from Ohio (as of the 14th of January 2022).

We first define a synthetic delay,

synth_truncation <- dist_spec(
  mean = 1.6, mean_sd = 0.025, sd = 0.5, sd_sd = 0.01, max = 15,
  distribution = "lognormal", fixed = FALSE
)
synth_truncation
## 
##   Uncertain lognormal distribution with (untruncated) logmean 1.6 (SD 0.025) and logSD 0.5 (SD 0.01)

Which we can then plot to see what it looks like,

plot(synth_truncation)

Now we use this to generate a synthetic data set by combining it with the latest data from Ohio (as of the 14th of January 2022).

simulate_truncation <- function(index, cases, dist) {
  set.seed(index)
  cmf <- cumsum(
    dlnorm(
      1:(dist$max + 1),
      rnorm(1, dist$mean_mean, dist$mean_sd),
      rnorm(1, dist$sd_mean, dist$sd_sd)
    )
  )
  cmf <- cmf / cmf[dist$max + 1]
  cmf <- rev(cmf)[-1]
  if (nrow(cases) < index + length(cmf)) {
    stop(
      "Not enough data to construct a synthetic data set with this ",
      "truncation distribution and index."
    )
  }
  untrunc_cases <- cases |>
    slice_head(n = nrow(cases) - index - length(cmf))

  trunc_cases <- cases |>
    slice_head(n = nrow(cases) - index) |>
    slice_tail(n = length(cmf)) |>
    mutate(confirm = as.integer(confirm * cmf))

  cases <- bind_rows(
    untrunc_cases,
    trunc_cases
  )
  return(cases)
}
synth_truncated_hosp <- map(c(20, 15, 10, 5, 4, 2, 1, 0),
  simulate_truncation,
  cases = covid_hospitalizations |>
    filter(state == nowcast_state) |>
    select(-state),
  dist = synth_truncation
)

We can now use estimate_truncation() to estimate the distribution of reporting delays for this synthetic data set.

syn_trunc_est <- estimate_truncation(
  synth_truncated_hosp,
  truncation = dist_spec(mean = 0, sd = 0, mean_sd = 1, sd_sd = 1, max = 15),
  chains = 4, iter = 2000,
  control = list(adapt_delta = 0.99, max_treedepth = 15),
  verbose = interactive()
)

Plotting the model predictions compared to synthetic data we can see that the model does a good job of reconstructing the data that will be observed in this synthetic example (we can see this as the earliest data points have been observed fully and so can be compared to the model predictions directly).

plot(syn_trunc_est)

💡 The black dots in this plot represent the data observed in real-time, whilst the bars are the latest available data. You should see that the data available at the time underestimated what was eventually reported and that this gets worse closer to date of the most recent data. This effect is referred to as right truncation. Another example of this was shown in the slides (Inputs needed for nowcasting). Having the information on previous reporting patterns allows us to fit a predictive model to this truncation, instead of making assumptions about the reporting delay distribution which is what we have to do when we don’t have this information. Sometimes assumptions can be wrong which may bias results..

Finally we can extract and plot the estimated reporting distribution,

est_synth_trunc_dist <- syn_trunc_est$dist

est_synth_trunc_dist
## 
##   Uncertain lognormal distribution with (untruncated) logmean 1.6 (SD 0.082) and logSD 0.62 (SD 0.041)
plot(est_synth_trunc_dist)

As expected for this synthetic example, the estimated distribution is very similar to the true distribution used to generate the data.

Getting ready to nowcast

Generation time estimate

The generation time is the time between infection of an individual and infection of their infector. In order to estimate the effective reproduction number with the renewal equation we need an estimate of the generation time. With this quantity, we relate the number of infections on day \(t\) to the number of infections on day \(t - \tau\) where \(\tau\) is the index of the generation time distribution. Mathematically this is

\[ I_t = R_t \sum_{tau = 1}^T I_{t - \tau} g(\tau), \]

where \(T\) is the maximum length of the generation time and \(g(\tau)\) is the probability of generation time \(\tau\).

💡 See the Estimating Rt from a time series of counts slide from the talk for a more visual discussion of the role of the generation time in the renewal equation.

Rather than estimating the generation time here, we instead use an estimate from the literature. Specifically we use an estimate from Ganyani et al. (2020) which is based on a few hundred cases of COVID-19 in China. For a real-world analysis we recommend thinking carefully about which generation time distribution to use. For example, the Ganyani et al. (2020) estimate is based on a sample of cases from China and so may not be representative of the generation time in other settings.

generation_time <- get_generation_time(
  disease = "SARS-CoV-2", source = "ganyani",
  max = 10, fixed = TRUE
)
generation_time
## 
##   Fixed distribution with PMF [0.18 0.2 0.17 0.13 0.1 0.074 0.054 0.039 0.028 0.02]

We can now visualize this distribution.

plot(generation_time)

The dark bars show a histogram of the generation time probabilities by day. The step plot shows the cumulative probability.

Delays from infection to hospitalization

The delay between infection and hospitalization can be decomposed into two distributions: the incubation period and the delay from symptom onset to hospitalization. We can estimate these distributions using EpiNow2 but for this tutorial we will use estimates from the literature.

Incubation period

The incubation period is the time between infection and symptom onset. Here we use an estimate from Lauer et al. (2020) which is based on a few hundred cases of COVID-19 in China. For a real-world analysis we recommend thinking carefully about which incubation period distribution, just as we did for the generation time.

incubation_period <- get_incubation_period(
  disease = "SARS-CoV-2", source = "lauer", fixed = TRUE,
  max = 15
)
incubation_period
## 
##   Fixed distribution with PMF [5.3e-05 0.013 0.093 0.18 0.2 0.17 0.12 0.083 0.053 0.033 0.02 0.012 0.0074 0.0046 0.0028]

We can now visualize this distribution.

plot(incubation_period)

Delay from symptom onset to hospitalization

The delay from symptom onset to hospitalization is the time between symptom onset and hospital admission. This time typically depends on the severity of symptoms, the robustness of the health system and the behavior of the individual. Here we use a toy estimate but ideally data would be available to estimate this quantity.

## Delay from symptom onset to report
reporting_delay <- dist_spec(
  mean = convert_to_logmean(3, 1),
  sd = convert_to_logsd(3, 1),
  max = 10
)
reporting_delay
## 
##   Fixed distribution with PMF [0.00064 0.14 0.43 0.29 0.11 0.03 0.008 0.0021 0.00053 0.00014]

We can again visualize this distribution.

plot(reporting_delay)

Convolving the delay from infection to hospitalization

As the incubation period and reporting delay can be represented as probability mass functions (i.e., vectors of probabilities) we can convolve them to find the distribution of delays from infection to hospitalization. (Convolution is the mathematical operation used to add random variables together. Essentially, we are adding the incubation and reporting delay distributions. Normally you wouldn’t be able to add two distributions together with a plus sign, but EpiNow2 provides special functionality that lets us do this.) Calculating this convolution outside of the model helps reduce the computational burden as we do not need to account for multiple delays. It is also useful as it allows us to understand the combined effect of the incubation period and reporting delay on the delay from infection to hospitalisation.

inf_to_hospitalization <- incubation_period + reporting_delay

We can now plot this convolved distribution.

plot(inf_to_hospitalization)

Putting it all together into a nowcast

Now we have all the components we need to construct a nowcast. First we construct the data set of hospitalizations we wish to nowcast using data as available on the 14th of January 2022.

oh_hosp_14th <- covid_hospitalizations_by_vintage |>
  filter(state == nowcast_state) |>
  filter(report_date == target_date) |>
  select(date, confirm)

glimpse(oh_hosp_14th)
## Rows: 43
## Columns: 2
## $ date    <date> 2021-12-01, 2021-12-02, 2021-12-03, 2021-12-04, 2021-12-05, 2…
## $ confirm <dbl> 576, 620, 571, 532, 562, 594, 680, 687, 667, 634, 615, 566, 60…

In order to evaluate our model we will use the most recently reported data up to the 28th of January 2022 (as we are forecasting for two weeks).

oh_hosp_28th_retro <- covid_hospitalizations |>
  filter(state == nowcast_state) |>
  filter(date <= target_date + 14) |>
  select(date, confirm)

We then use the estimate_infections() function contained in EpiNow2 on this data set to obtain a nowcast, forecast and reproduction number estimate. This model uses a renewal equation based method to estimate the reproduction number and then convolves this with the incubation period and reporting delay we defined earlier to obtain a nowcast. It generates a forecast extrapolating the reproduction number estimate into the future. This is just an example model as there are many other ways to estimate the reproduction number and many other ways to extrapolate it into the future. We recommend thinking carefully about which model to use for your analysis.

💡 We haven’t adjusted for truncation as there was little evidence of it in the weeks directly before this date.

rt_estimates <- estimate_infections(
  reported_cases = oh_hosp_14th,
  # Our generation time estimate is first preprocessed into a format the model
  # understands
  generation_time = generation_time_opts(generation_time),
  # Similarly our delay from infection to hospitalisation is also preprocessed.
  delays = delay_opts(inf_to_hospitalization),
  rt = rt_opts(
    # Here we specify a prior for the initial value of the reproduction number
    # We set this to be near 1 as we expect the epidemic to be growing slowly
    # at the start of the period of interest
    prior = list(mean = 1, sd = 0.1),
    # This indicates the period of the random walk we wish to use (7 days).
    rw = 7
  ),
  # Here we have turned off the default Gaussian process prior in favor of 
  # the random walk specified in rt_opts.
  # This speeds up the code.
  gp = NULL,
  # These options control the MCMC sampler used to estimate the posterior
  # This uses the No-U-Turn sampler with a target acceptance rate of 99%
  # and 2000 samples with 500 warmup iterations.
  stan = stan_opts(
    control = list(adapt_delta = 0.99),
    samples = 2000, warmup = 500
  ),
  # These options tell the model to use a Poisson error structure, and to
  # adjust for day-of-week effects
  obs = obs_opts(
    family = "poisson", week_effect = TRUE
  ),
  # We set a 14 day forecast horizon
  horizon = 14,
  # Should verbose output be printed? This is useful for interactive use.
  verbose = interactive()
)

⚠ The Omicron wave was a very different to previous waves of COVID-19 and is very challenging to model due to changes in infectiousness, severity and testing. We recommend using caution when interpreting these results.

Visualising the resulting nowcast

A simple summary table

We can use the summary() function to obtain a simple summary table of the nowcast and forecast. This can be useful for quickly checking the results of the model and for sharing with others.

summary(rt_estimates) |>
  kable()
measure estimate
New confirmed cases by infection date 753 (545 – 1013)
Expected change in daily cases Likely decreasing
Effective reproduction no. 0.9 (0.72 – 1.1)
Rate of growth -0.023 (-0.074 – 0.024)
Doubling/halving time (days) -30 (28 – -9.3)

Effective reproduction number estimates

Using the output of estimate_infections() we can visualise the effective reproduction number estimates using a call to plot() (this has a range of other plotting options which can be explored using ?EpiNow2:::plot.estimate_infections). Because we specified a weekly random walk for \(R_t\), our estimates show weekly steps.

plot(rt_estimates, type = "R")

Estimated latent hospitalizations by date of infection

We can also plot the estimated latent hospitalizations by date of infection. This is useful as it allows us to get a sense for the number of infections after accounting for delays from infection to hospitalization.

💡 This plot shows hospitalizations by date of infection not an estimate of infections by date of infection. This is because we are using hospitalizations as a proxy for infections without adjusting for the infection to hospitalization ratio. We could do this by specifying a value for scale in obs_opts() if we had a good estimate of this ratio.

plot(rt_estimates, type = "infections")

💡 As this plot is by date of infection we can use it to get a better sense of how the epidemic has changed over time. This could be useful if we were looking at the role of interventions as discussed in the slides (Estimates must be accurate, timely, and account for uncertainty) where it is critical to link intervention timing to changes in the epidemic.

Predicted hospitalizations

Another useful plot is the predicted hospitalisations. This can be obtained using the plot_estimates() function. This function takes a data frame of estimates and a data frame of reported cases and plots the estimates alongside the reported cases.

Whilst the forecast is useful it is important to note that it is based on the assumption that the reproduction number remains constant at the last estimated value (though other options are supported in the package). This is unlikely to always be true in practice and so the forecast should be communicated with this in mind.

rt_estimates |>
  pluck("summarised") |>
  filter(variable %in% "reported_cases") |>
  plot_estimates(
    reported = oh_hosp_28th_retro
  )

We see that the forecast performs relatively well when compared to more recent data.

💡 You should be able to see that the model can capture the day-of-the-week periodicity in the data that we discussed in the slides (Problem: Day of week effects). This is important as it makes us less likely to overestimate the reproduction number on weekends and underestimate it on weekdays, for example.

Summary

  • We have shown how to use EpiNow2 to estimate the effective reproduction number and forecast hospitalisations.
  • We have also explored how to use EpiNow2 to estimate the delay from infection to hospitalisation.
  • Finally, we have discussed some limitations of the approach and how to communicate these.

Strengths

  • The approach provides estimates for reproduction numbers, growth rates, and forecasts future hospitalizations using available data and accounting for real-world data issues.
  • The tutorial provides detailed guidance on how to estimate reporting delays which can be difficult using traditional methods due to right truncation.
  • EpiNow2 uses a renewal equation based method to estimate the reproduction number and convolves this with the incubation period and reporting delay to obtain a nowcast and forecast future hospitalizations. This approach has been widely used in the literature and is relatively easy to understand.

Limitations

  • The estimates may be affected by changes in hospitalization measures that lead to retrospective changes in the data.
  • The approach requires data on hospitalizations, but does not adjust for infection to hospitalization ratio, meaning that the results should be interpreted as estimates of hospitalizations rather than infections.

Getting more from this tutorial

Here are some ideas for how you can get more from this tutorial:

  • Vary the states being visualised, the state being used for modelling, or the time period (All of these options require you to download new data).
  • Explore the various different options available in estimate_infections() for specifying the model structure, priors, and MCMC sampler options (see ?estimate_infections for more details).
  • Try varying or removing the delay adjustment in estimate_infections() (see ?estimate_infections and the Delays from infection to hospitalization section in this tutorial for more details). What happens to the estimates?
  • Similarly, try updating the generation time distribution (see ?estimate_infections). What happens to the estimates?
  • Try exploring the synthetic data example for estimate_truncation(). How does altering the synthetic parameters impact the simulated degree of right truncation? How does the model perform when the degree of right truncation is varied?

Other resources

If you are interested in finding additional resources for estimating the effective reproduction number or learning about nowcasting in R, explore the following:

EpiNow2 resources

  • Documentation: The documentation for the EpiNow2 package. This is the package we will be using in this tutorial. It is designed to be easy to use, robust to a wide range of contexts, and flexible.
  • CDC Mpox Technical reports: For a recent example of EpiNow2 in use, see the CDC’s technical reports on Mpox. These reports use EpiNow2 to estimate the effective reproduction number and forecast future cases.
  • Reflections on two years estimating the effective reproduction number: This blog post reflects on the development of EpiNow2 and the challenges of estimating the effective reproduction number in real-time at scale.
  • Nowcasting example: This is a repository that uses simulated data to demonstrate how to nowcast using both EpiNow2 and Epinowcast.
  • Description of the first global outbreak of mpox: an analysis of global surveillance data: A global description of the 2022–23 multi-country mpox outbreak. This paper uses EpiNow2 to estimate the effective reproduction number with adjustments for known delays and right truncation.
  • Tutorial Q and A: If you have any questions about the tutorial, please post them here. We will try to answer them as quickly as possible.

Other packages

  • epinowcast: This package has been designed as the successor to EpiNow2 and is currently under development. It is designed to be more general and even more flexible than EpiNow2.
  • epidemia: This is another flexible package for estimating the effective reproduction number and forecasting. It is designed to be more flexible than EpiNow2 and epinowcast but is potentially more difficult to use. It also generally has less functionality for dealing with delays than EpiNow2 and epinowcast.
  • EpiEstim: This is a more mature package for estimating the effective reproduction number. It exploits a mathematically relationship to fit the renewal equation very quickly but is not currently able to handle reporting delays or to produce forecasts which the use of supporting packages.

Papers

 

Produced by Sam Abbott and the Center for forecast and outbreak analytics