COVID-19 Prediction Models, Explained with Pictures

“Throughout the course of the outbreak, the model predicts 2,777 total deaths in Georgia.”

The day when Florida is projected to have its most coronavirus deaths and will be using peak hospital resources is about two weeks earlier than originally calculated, data shows…. Now the date is April 21.”

You’ve likely seen news stories such as these, referencing “model predictions” for your state. In this column I’ll talk about the model most commonly used, and why the predictions and their accuracy change as more data accumulate. No formulas, just pictures (I promise)!

IHME Model Predictions for Arizona

Figure 1 shows the predictions for total number of deaths from the model developed by the Institute for Health Metrics and Evaluation (IHME) at the University of Washington for Arizona, using the data released on April 13. The model assumes that Arizona will implement three of four social distancing strategies through the end of May, and that after the end of May there will be a robust program of testing, contact tracing, and other measures to control the spread of the disease.

The solid blue line in Figure 1 shows the predicted number of total (cumulative) deaths at each date from January 3 to August 4. The shaded area is the uncertainty interval for the prediction (akin to a 95% confidence interval) for April 13 and later. As you can see, the shaded area is pretty large; the predicted total number of deaths on August 4 (blue line) is 1,005 but the range extends from 208 to 3,618.

Figure 1. Projected number of deaths in Arizona through August 4 from the IHME model (as updated on April 13, 2020). Data source: https://covid19.healthdata.org/united-states-of-america. April 13 is the date where the blue shading starts in the grap…

Figure 1. Projected number of deaths in Arizona through August 4 from the IHME model (as updated on April 13, 2020). Data source: https://covid19.healthdata.org/united-states-of-america. April 13 is the date where the blue shading starts in the graph.

Types of Models

Lots of models predicting the trajectory of the disease exist (see Brian Resnick’s excellent nontechnical summary of these at Vox.com).

Epidemiologic models (one highly cited example is from Imperial College) estimate the number of persons with the disease (and number of deaths from the disease) at different time points as functions of the infection rate (how many other persons will be infected by someone with the disease), the case fatality rate (what percentage of persons infected by the disease die from it), the incubation period, the amount of time an infected person is contagious, the sizes of the susceptible and infected populations at an initial time point, and other things. For such a model to make accurate predictions, the model and data must be accurate. The form of the mathematical model must capture the main features of disease transmission in the population (because the dynamics are so complicated so, usually, are the models). And you need accurate estimates of the model parameters such as infection rate and incubation period (often these parameters differ for different areas or subpopulations). If the model fails to capture the mechanism for disease transmission, or the parameter estimates are inaccurate, you can end up with poor predictions.

Statistical models, such as the IHME model, do not attempt to describe the mechanisms for disease transmission in the population. Instead, these models fit curves to data from around the world, and use those data and curves to predict the trajectory of the disease. As more data accumulate, and as more countries pass the period of peak infection, the predictions from this sort of model are expected to improve. You don’t have to estimate disease-related parameters such as infection rate or case fatality rate for this type of model — these are implicitly captured in the curve-fitting. But statistical models also have assumptions, as I’ll discuss in the last section. The form of the curve must be flexible enough to describe the actual trajectory of the disease, and, as with an epidemiological model, you need accurate data on the number of infections and deaths.

Form of the IHME Model

First, a disclosure. The following material is my explication of the IHME model from the article and technical appendix published online on March 30. Anyone commenting on the model should refer to the original article for the mathematical details.

The IHME model assumes that for each locality, the number of daily deaths starts at zero, then increases to a peak, and then decreases again back to zero. There are many types of curves that do this, but the IHME researchers stated that the curve forms shown in Figure 2 fit the data better than alternatives (and have the advantages of being curves that statisticians know well, and importantly, have lots of computer programs for fitting).

Figure 2. Function form used to estimate (a) daily number of deaths and (b) total (cumulative) number of deaths for each location.

Figure 2. Function form used to estimate (a) daily number of deaths and (b) total (cumulative) number of deaths for each location.

Figure 2 shows (a) the basic form of the curve predicting the number of daily deaths and (b) the form of the curve predicting the cumulative number of deaths, which for each time point t is calculated as the area under the top curve to the left of t. The top curve has the same functional form as the density curve of the normal distribution, but here it is not used as a probability distribution — it’s used because this type of shape provides a good fit to the pattern of daily deaths over time. The solid circle in the bottom graph marks the inflection point of the curve (the point where the curve changes from a smile to a frown), and corresponds to the time of peak daily deaths.

The curve for Arizona shown in Figure 1 is for total number of deaths, and it has the same basic shape as shown in Figure 2(b). It starts slowly, then has a rapid increase until the inflection point (predicted to be April 30), and then starts to flatten as social distancing and other mitigation measures take effect.

One advantage of this type of curve is that it can assume a wide variety of different shapes, depending on the parameters. The IHME model relies on three parameters in fitting the curve in Figure 2(b) for different locations (these in turn can depend on other covariates, such as the time at which social distancing was implemented):

  • the time at which the peak number of daily deaths occurs (the location of the inflection point on the horizontal axis)

  • the total number of deaths on August 4

  • the growth rate, or steepness of the curve

FIgure 3 shows eight of the infinitely many possible curves that can be drawn using the same basic form shown in Figure 2(b). Each of the eight curves shown has one of two levels of time at which peak number of daily deaths occurs (earlier or later), one of two levels of total number of deaths on August 4 (lower or higher), and either slower or faster growth. Of course, infinitely many more combinations are possible, which makes this basic function extremely flexible for modeling projections for different localities.

Figure 3. Eight of the infinitely many possible curves that can be drawn using the basic IHME model.

Figure 3. Eight of the infinitely many possible curves that can be drawn using the basic IHME model.

Estimating the Curve for a Locality

The IHME team assembled a database of information on cases and deaths from around the world. Many states (such as Arizona, in Figure 1) have very little data on which to base the whole curve. The red points in Figure 4 show the data for Arizona on actual cumulative number of deaths through April 12, drawn on the same scale as the predicted curve in Figure 1. As you can see, there’s not much information in that data for being able to draw the full curve — you can imagine a lot of different curves of the type in Figure 3 might fit the available data for Arizona pretty well. In particular, Arizona is still in the “rapid growth” stage of the number of deaths, and has not yet reached the inflection point (peak number of daily deaths) of the curve.

Figure 4. Data for the total number of deaths in Arizona, through April 12. There is little information on which to fit the whole curve, so model predictions for Arizona rely heavily on information from other localities.

Figure 4. Data for the total number of deaths in Arizona, through April 12. There is little information on which to fit the whole curve, so model predictions for Arizona rely heavily on information from other localities.

The IHME model estimates the three parameters for each locality using a combination of (1) the data from that locality and (2) estimates of the parameters using data from all the 142 localities in the database. (Statisticians call this a mixed effects model.) For localities with a lot of data (that is, those that are past the peak in daily deaths), the fitted curve relies more heavily on (1) and less on (2). For localities such as Arizona, where there is not as much information about the curve from the locality’s data alone, the predicted curve relies less on (1) and more heavily on (2). Thus, the predicted number of deaths for Arizona is based heavily on experiences in other localities that have more data. The shaded area in Figure 1, representing the uncertainty, is large because at the time the model predictions were calculated, there was a lot of uncertainty about when Arizona would reach its peak daily deaths, what the growth rate would be, and what the total number of cases would be.

As more data accumulate for Arizona, its curve will rely more heavily on the Arizona-specific data and won’t need to borrow as much from the experiences of other localities. Also, the parameters estimated from other localities, and from Arizona, will be more precise as more data accumulate. Thus, as more data accumulate, the predictions from the model are expected to be more accurate, and the area of the shaded region in the graph (representing the uncertainty about the prediction) is expected to decrease. In addition, as more data accumulate, the modelers may be able to add more features to the model to provide a better fit to the data.

Assumptions for the Model (or, What Might Go Wrong in the Predictions)

One consideration to keep in mind is that the uncertainty estimates for the predictions are based on assumptions about the model and the data. If any of these assumptions are violated, then the predictions from the model will not be as accurate as the shaded regions indicate.

Here are some of the assumptions that might affect the accuracy of the predictions (but not be reflected in the uncertainty estimates):

  • Curves of the general form in Figure 2 will accurately describe the daily number of deaths, and total number of deaths, for each location. The authors of the IHME paper state that the curve in Figure 2(b) fit the data on total number of deaths better than a logistic function, but it’s possible that neither curve will end up describing the cumulative number of deaths well for localities that are early in the disease trajectory. In particular, the curve in Figure 2(a) assumes that the number of daily deaths goes down (after social distancing takes effect or as population saturation occurs) at the same rate that it went up, but it’s possible that the decline in deaths might be faster or slower than the incline.

    The curves also do not consider a possible second wave of infections.

  • Localities will maintain (or in some cases, implement) the social distancing practices assumed by the model. For example, the projections for South Dakota assume that the state will implement 3 of the 4 social distancing practices on April 19, but as of April 13 South Dakota had implemented only one of these.

    One would expect to see another increase in number of deaths in localities that relax social distancing or reopen the economy too soon. In that case, the predictions would turn out to be inaccurate because the behavior assumed in the predictions changed.

    In addition, for states such as Arizona that do not yet have much data, the projections are based primarily on experiences in other places around the country and around the world. The uncertainty limits for the Arizona predictions are based, in part, on the variability in trajectories from places that have had cases for a longer period of time. But since there are currently relatively few places in the world that are on the downslope for daily deaths, that variability may be underestimating the diversity of experiences one might see in other parts of the world, and thus the shaded blue area in Figure 1 may be too narrow.

  • Data on deaths are accurate. But there is evidence that in many areas, the official counts of deaths ascribed to COVID-19 are lower than the true number of persons who have died from the disease. The official statistics often count only cases in which the decedent died in a hospital and was tested for the virus, excluding people who died at home and/or who were not tested. On April 14, for example, New York City added more than 3,700 people to the death toll for the city because health officials started including people who were presumed to have died from the virus but were not tested for it.

The fact that predictions from the model change as more data accumulate is a good feature of the model, showing its adaptability to new situations. The predicted total number of deaths in the United States dropped from 81,000 from the model fit in the article (using data through March 25) to 69,000 when the model was refit using data through April 13. This is not because the model is deficient, but because the more recent data show the effects of social distancing and other behavior changes that were implemented in the meantime.

If you are standing on the railroad tracks with a train barreling at you, the models from physics predict that in two minutes you are going to be flattened. If you move off of the tracks and escape, that doesn’t mean that the models were wrong, but that the data fed into the models have changed, leading to a different prediction. The same thing would be expected for any type of model predicting outcomes from COVID-19. If behavior changes, we would expect the model predictions to change too. This is not evidence that the models are wrong (as some have claimed), but that the predictions improve as more information accrues.

Copyright (c) 2020 Sharon L. Lohr




coronavirusSharon Lohr