On Damn Lies & Statistics

Quantization is the process of mapping input values from a large (or continuous set) to output values in a (more countable) smaller set. “Rounding” and “truncation” are typical quantization processes that you hear about. Going to percentiles, quartiles, or quintiles are others. When we work on voter models, for instance, our clients are usually looking for some meaningful way to sort between different blocs of voters, and we land on some process to essentially separate them into distinct piles. What we are yielding is some quantization of this data. Though it is not spoken of as such, terms like “Prime Voters” and “Double Prime Voters” are actually someone or other’s quantizations of a voter file.

There is not necessarily a right or wrong answer to how you quantize. It depends what you need to do. Perhaps you are right to identify the top fifth of voters through some statistical means, but it may have borne out that you could have carried the top quarter of voters instead. Even going with quintiles over quartiles–for instance–is a meaningful decision that will “bias” your output. In operative politics, there isn’t a specific reason to use a specific statistical technique when you are pursuing real world outcomes outside of a lab.

And that is what this post is really about: how quantization can be used to manipulate a presentation of data. What follows are 3 line charts that were all made using the exact same dataset. For all intents and purposes we are looking at some kind of data that is tallied for each day of the week, and we are seeing Monday through Sunday, categorically, on each of these matched charts. As you peruse them, do you notice how the first chart elicits the most dramatic effect, whereas the third one suggests something relatively muted? Do you know what varies between each of these 3 charts?

Yep, it’s the quantization. But in this case, we are not mapping a collection of values down, but we are doing something else: we are manipulating the y-scale of the charts. While each chart is the same width–reflecting 7 days of data–the actual height of the numberline shown to you here varies. The slope of any line or curve on a chart like this is ultimately determined by how much height you supply.

In short, to make a curve more dramatic, remove headroom from the chart so that the top of the axis skims over the highest value you’re charting. On the other hand, to flatten or mute something you’re measuring, simply make the y-axis taller to inflate the space. That is all that is happening as we move from one chart to the next here: we are manipulating the y-scale. This is one of the most frequent ways to lie with statistics. When comparing the campaign finance of various campaigns with our visual work, we make a strong point of examining them on identical y-axes.

The number-values displayed on the charts above range from 0.1 to 0.6, by the way. And it does not matter if these represent vast deltas or infinitesimal ones–the units we’re in don’t even matter here. How you are led to perceive the output, that is what matters. There are no y-axis labels on the charts above, but usually you are shown just one chart at a time and you accept the y-scale, even when it is shown to you. When making an inference from data it is very important to know whether you are able to form an opinion from the actual facts and figures, or whether you are just being shown a line that somebody massaged into a desired format. This is especially important when you are shown two lines on a graph and meant to think they correlate–when it actual fact they do not even share a y-scale with each other and may not be compared as such.

Now let’s repeat this exercise with 3 maps. All 3 maps here feature the same data too: we are tallying all individual contributions to candidates for House, Senate, and President, that were filed in 2020 and itemized under Schedule A. These maps detail an expenditure of just under $19,000,000,000.00. That’s 19 billion. So we’re dealing with very large figures, and also trying to paint in a useful amount of contrast for our purposes in that these maps want to show you where that money is coming from.

So what you’re seeing here are 3 alternative ways to quantize this data. Essentially, we are shading each State on this map as an expression of some ratio. Watch what happens on this map as we slide that ratio across 3 settings and you will realize that how you want to use this map is very important.

Imagine which of these 3 maps you might want to use if your advocacy is centered on the importance of Illinois. It might be that second map, where Illinois pops into the picture. You’d publish that map with your press release. Map #1 might be better if you are interested in highlighting outcomes for NY, CA, TX, and FL. Map #3 is entirely useless, and I put it there because I just wanted to show you that there definitely can be too much of a good thing with quantization. Pull hard on the ratio in the other direction and soon you’re lighting up CA and NY only or you get a map with no shading at all. There isn’t necessarily a right answer, but in consultation–there is usually an answer that is the best answer.