Answering Reader Questions about Coronavirus Data
My previous post on data about coronavirus (COVID-19) spurred some reader questions. I’ve consolidated these by topic, and will do my best to answer them.
The evening news reported that the case fatality rate for flu is 0.1 percent. Where does that number come from, and is it calculated the same way as the rate for COVID-19?
The case fatality rate (CFR) for a disease is usually expressed as the percentage of persons who die from the disease among all persons who have been diagnosed with the disease in a specific time period. But there are a lot of ways of estimating the numerator and denominator for a CFR, and different methods for estimation as well as different data source inputs can lead to different estimates.
The number of deaths attributable to influenza, and the number of people who have had symptomatic cases of influenza, are estimated using data from a variety of sources. The method used by the U.S. Centers for Disease Control and Prevention (CDC) to estimate number of deaths and cases of influenza is given here and here. It’s a multi-step process. In step 1, they collect data on flu hospitalizations from a judgment sample of surveillance areas. The CDC says the approximately 70 counties in the sample are geographically diverse (they’re located in California, Colorado, Connecticut, Georgia, Maryland, Michigan, Minnesota, New Mexico, New York, Ohio, Oregon, Tennessee, and Utah) and cover about 9% of the U.S. population, but the states and counties are not randomly selected, so the sample is not guaranteed by mathematical theory to be representative of the U.S. as a whole (a probability sample, which is randomly selected, would be guaranteed to be representative). The CDC’s FluView website displays the hospitalization rates for the counties in the surveillance areas, which are calculated by dividing the number of laboratory-confirmed influenza hospitalizations by the population of the counties. But the numbers in FluView underestimate the incidence of influenza in hospitals, because testing for flu is done at the discretion of a patient’s medical team (so some people in the hospital who have influenza are not tested for it, sometimes because their illness is attributed to another cause), and a test does not always detect influenza in a person who has it. The CDC compensates for this by multiplying the FluView rates for different age groups by factors that account for the estimated proportion of people who are tested and the estimated proportion of tests (of people with the disease) that are positive.
The estimated number of hospitalizations in each age group serves as the basis for the estimated number of deaths and the estimated number of symptomatic cases. For symptomatic cases, the number of hospitalizations in each age group is multiplied by an estimated ratio (obtained from health surveys and other sources) of symptomatic cases in the community to hospitalizations. For deaths, the number of hospitalizations in each age group in the surveillance areas is multiplied by the ratio of deaths to hospitalizations, with a further adjustment to account for the estimated deaths that occur out of hospital. These rates are extrapolated from the sample to the U.S. population, under the assumption that data from the surveillance areas can represent the counties and states not in the sample.
There are a lot of assumptions and a lot of uncertainty in making these estimates, and the CDC gives a range for the estimated numbers of flu cases, hospitalizations, and deaths to reflect that uncertainty. The CFR for influenza varies from year to year and from country to country. It also depends on what quantity is used as the denominator. Early estimates of CFR for the 2009 pandemic influenza A/H1N1 from studies across the world ranged from 0.001% to 0.01% when the denominator was estimated number of infected persons (determined by serology studies); from 0% to 1.2% when the denominator was estimated number of symptomatic cases; and from 0.11% to 13.5% when the denominator was estimated number of laboratory-confirmed cases.
So far (as of March 9) in the 2019-2020 flu season, the CDC estimates that there have been 34 million symptomatic cases of flu and 20,000 deaths, giving an estimated CFR for this season of approximately 0.06%. For 2018-2019, there were an estimated 35.5 million cases and 34,000 deaths, giving an estimated CFR (using estimated symptomatic cases as the denominator) of 0.1%. To reiterate, these estimates depend heavily on the assumptions used to calculate numerator and denominator, and should be accompanied by a measure of uncertainty such as a confidence interval (the CDC web pages do not give enough information to calculate a confidence interval for the CFR although they provide separate uncertainty intervals for numerator and denominator).
By contrast, since COVID-19 is so new, there isn’t enough information at present to be able to obtain a good estimate for any quantity that would be used in the denominator of a CFR. There is also uncertainty about the number of deaths, since some deaths caused by COVID-19 might be attributed to another cause. The CFRs reported right now simply divide the number of known deaths to date by the number of laboratory-confirmed cases to date. Using the World Health Organization statistics for March 9, that figure is 100 x 3809/109577 = 3.5%.
But this statistic, 3.5%, is not a reliable measure of the risk a person faces of dying if he or she contracts COVID-19, and many news stories are repeating it without saying where it comes from and what it means. At this point, not enough is known to be able to have a good estimate of CFR for COVID-19. Ideally, one would use number of persons who have contracted the disease as the denominator for the CFR, but that typically requires testing a random sample of the population for antibodies to the pathogen, and no reliable estimate of the total number of COVID-19 cases is available at this time. The number of laboratory-confirmed cases for COVID-19 underestimates the total number of infections, but the statistic is used for the denominator because it is available.
The CFR being reported for COVID-19 uses a different denominator than is typically used for influenza. For most CFRs given for influenza, the denominator is symptomatic cases, and this is based on modeling the number of cases in the general population (including estimates for people who are hospitalized but not tested, outpatients, and those who do not seek medical treatment). For COVID-19, the denominator is laboratory-confirmed cases. In general, the number of people in each possible denominator for a CFR has the following ordering:
lab-confirmed cases < symptomatic cases < people infected.
The number of symptomatic cases for flu is expected to be higher than the number of laboratory-confirmed cases because many people who get the flu never get tested or seek medical treatment. Similarly, some infected persons never exhibit symptoms, so the number of people infected is expected to be larger than the number of symptomatic cases. This means that a CFR with laboratory-confirmed cases as the denominator will often be larger than a CFR with symptomatic cases as the denominator, which in turn will be larger than a CFR with the number of people infected as the denominator (this pattern was seen above in the CFR ranges for the 2009 H1N1 flu). Thus, the influenza CFRs and the COVID-19 CFRs are measuring different things.
The number used in the denominator of the CFR being reported for COVID-19, laboratory-confirmed cases, is not a good measure of how many people have the disease. No one knows how many people have been symptomatic but have not had their cases confirmed by a laboratory. Or how many people have been infected but have not displayed symptoms. Often, the earliest lab-confirmed CFRs reported for a disease are higher than later ones simply because more severe cases are tested in the early stages, but it’s unknown right now whether this will be true for COVID-19.
The number of laboratory-confirmed cases depends heavily on how many tests have been performed, and the number of tests performed has varied greatly from country to country. In general, one would expect that more mild cases would be found if more tests have been conducted, and that would result in a lower lab-confirmed CFR. South Korea, for example, has tested nearly 200,000 people as of March 9, with 7,478 confirmed cases and 53 deaths for a lab-confirmed CFR of 0.7%. The ratio of laboratory-confirmed cases to total number of infected persons is likely to be higher in countries with extensive testing programs than in other countries. The CFR of 3.5% is calculated using all cases in all countries, and there is wide variability in testing practices as well as medical care available for patients with the disease.
In the U.S., however, few people have been tested as of March 9 and it is likely that the number of cases is much higher than currently reported. Which brings us to the next question …
I went to your link for the CDC data and didn’t see the number of tests that have been done in the table. Where did you get the statistic that only 459 tests had been performed as of February 28?
When I wrote my February 28 post, the CDC website I gave in the link contained a table on the number of cases, deaths, and tests that had been performed. On March 2, the CDC changed the information posted on the site and removed the number of tests. Today (March 9) there is a footnote midway down the page that says “As of March 8, 2020 1,707 patients had been tested at CDC. This does not include testing being done at state and local public health laboratories, which began this week.”
If one includes state health departments and laboratories, certainly more than 1,707 patients have been tested in the United States. But I have not found an official government webpage that gives the number of nationwide tests. Many states do not post information on number of tests on their websites, so assembling the statistics is laborious. Robinson Meyer and Alexis Madrigal of The Atlantic contacted the public health departments of all 50 states plus the District of Columbia; as of March 6, they could only verify that 1,895 people have been tested altogether [on March 9, the authors updated this figure to 4,384 people tested]. These authors also point out that the number of cases on the CDC website is less than that listed by Johns Hopkins University.
I do not know why the CDC removed the statistics on testing from the website, but it would be of great help to researchers worldwide to have those. The CDC would be in a better position than journalists or independent researchers to gather the information from state health departments and laboratories.
I’m over 80 years of age. Do the statistics on CFR for people in my age group mean I have a 15% chance of dying if I get COVID-19?
Not necessarily. That statistic came from a study in China reporting 1,408 confirmed cases and 208 deaths among persons age 80+. The value of 14.8% (rounded to 15% in newspaper reports) was calculated as 100 x 208 / 1408. But there are several reasons why that statistic does not necessarily mean you have a 15% chance of dying if you get COVID-19. First, as stated earlier, the denominator for the CFR in any age group is really uncertain at this point. It’s unknown how many people in any age group actually have the disease but have not been tested. The lab-confirmed CFR in China has gone down over time, possibly because patients with severe cases were overrepresented among those tested early on, and possibly because knowledge for medical treatment improves as more is learned about the disease.
Second, results from this one study do not necessarily apply in other situations and locations. For example, the most recent Global Adult Tobacco Survey indicated that more than half of Chinese men (2% of women) use tobacco products; perhaps smoking, or factors unique to Hubei province, are associated with higher or lower fatality risk. The study also indicated that persons with pre-existing respiratory conditions or diabetes had higher fatality rates, but the authors did not have enough data to look at mortality differences for persons with these conditions separately by age group.
Two people of the same age can have very different risk of getting or succumbing to a disease. The American College of Cardiology Cardiovascular Disease Risk estimator, for example, asks about cholesterol levels, blood pressure, and smoking status, and the estimated risk of getting cardiovascular disease in the next 10 years depends heavily on these three variables. Something similar is likely true for COVID-19.
Postscript added March 14, 2020
More than two months after the initial reports of COVID-19 in China, the U.S. still does not have reliable estimates of the number of people infected. The CDC has now removed even the footnote giving the number of people tested in its labs from the website on number of cases, and the number of cases on that website undoubtedly severely underestimates the true number of people infected. The website now states: “Now that states are testing and reporting their own results, CDC’s numbers are not representative of all testing being done nationwide.” True, but the CDC could gather and report the data from the states on this page.
I said in my first post on COVID-19 that “data collection and analysis will be the key to managing COVID-19” and that is still true. But the data are needed now.
Copyright (c) 2020 Sharon L. Lohr