Over the past few months, we have seen many coronavirus statistics. We have also seen many statistical errors, which have made some of us skeptical. My complaint is that not enough of us have been skeptical. From the point of view of someone who uses statistics as part of his work, these errors are simplistic. Allow me to name a few of the errors and show how they can deceive people.
Choose which attribute to measure (deaths versus cases). The COVID-19 pandemic is an experiment with seven billion guinea pigs: us. At the beginning of the experiment, the cases and deaths were counted. The cases were determined with a doctor carefully examining the patient and deciding if this is from COVID-19. The deaths were determined with a doctor following the patient until the patient’s unfortunate demise and again deciding if this is from COVID-19. The latter decision is likely more accurate, so it is a more reliable statistic. Counting the cases is still useful because it leads to a probability of death given infection. In the first few months, this probability varied a great deal. This means that one of these statistics is poorly measured. The inaccurate statistic is undoubtedly the number of cases.
Measure the attribute the same way throughout the experiment (avoid changing the way “cases” is defined). Initially, cases were defined with a doctor’s examination. Later, a test for the virus was used. Later still, a test for the virus or antibodies was used. Currently, the patient and anyone with whom he had recent contact are called “cases.” The CDC is responsible for much of this. For example, the CDC has been recently asking states to change the way they define cases. If that does not sound confusing enough, only some states are making this change, so, state by state, statistics will differ.
Look for problems with data integrity. There have been accusations of people fudging the data. For example, the CDC has been accused of changing old statistics.
Normalize the data (take into account different population sizes). Often, COVID-19 statistics are displayed state by state. Does this link say New York and California have a lot of COVID-19 deaths or just a lot of people? It turns out that as of 17 July 2020, New York has 167 COVID-19 deaths per 100,000 people, and California has just 19.