Case Study: How India Eats

When I saw this image, I knew it had to be the first case study.  Why? Because I would clearly bite off more than I could handle, and I would learn a whole lot in the process.  To begin, Huffington Post India ran the article Vegetarian India A Myth? Survey Shows Over 70% Indians Eat Non-Veg, Telangana Tops List, which featured this map:

How India Eats – Original

It took me several read-throughs to understand what it was trying to say.  Probably about four, which is about three too many.  Perfect for modification.

Now, let’s take break down the image according to our analysis standard.

Overview

The visualization aims to show which regions in India are vegetarian and which are non-vegetarian.  The data is pulled from census data by the Government of India.  The image represents the percentages of Vegetarians with green circles and non-vegetarians as red circles.  These circles are then placed on top of each, with the dominate (majority) color on the outside.   The circles are then placed in the approximate center of the Indian state of which they represent.

Explanation of data

First, let’s look at the data.  The data source is the government of India’s census site.  Specifically, this data is from the Sample Registration System Baseline Survey 2014, which was published in 2016.  The data set is a sample of 8858 individuals across India’s 36 states, and territories.   The data about eating habits is one of many tables in the complete dataset.

Reviewing the dataset brings about a few different concerns about the dataset and how it is utilized:

  • The dataset never actually states how these individuals were polled and does not indicate if there was a potential selection bias.
  • Table 5.2, which is the basis of the dataset, breaks the rate of vegetarianism down by male and female.  The graphic appears to use a simple mean of those two numbers.  However, the dataset never states what the breakdown of males and females is overall, or within individual states.  In some states, the difference is as much as five percentage points.
  • With 8858 individuals polled, we have a relatively small sample error of about 1%.  When we look at the amount of individuals used in the large states (which the map utilizes), the smallest sample number is 212 individuals, while the largest is 662.  This places the sample error for the individual states somewhere between 4% and 7%, which are not insignificant numbers.
  • The last thing is that the dataset lacks any context for the actual population of the state.  For example, Uttar Pradesh which is the most populous state in India, has a population twice that of West Bengal.  However, in the survey, West Bengal had 555 individuals, while Uttar Pradesh is represented by 500.  This matters when West Bengal, which is less than 2% vegetarian, while Uttar Pradesh is about 47% vegetarian.   I seriously doubt that the dataset adequately represents the population.

The data is one-dimensional data and is binary.  The individual states are either vegetarian or they are not.  There a number of different ways we could represent this, which this graphic did not.

It should be noted, the data visualization likely had good intentions.  The creator of the graphic was probably told to make a map, so they made a map, utilizing the data as best as possible.

Recommendation 1:  The data visualization should account for the fact that different states have different populations which are not currently represented.

Explanation of visualization techniques

Each time, I look at these graphics, I’m actually a little bit more confused.  It starts off straightforward by placing a circle at the approximate center of each of the major Indian states (21 in total).  Then, there is a circle representing the percentage of vegetarianism and one for non-vegetarianism.   This is where it gets weird, whichever circle is over 50%, it begins to engulf the other one.  This seems fairly straightforward when you look at states with an overwhelming majority as seen in Assam (AS) (non-vegetarian)

and Rajasthan (RJ) (vegetarian)

This technique causes some odd and confusing images.  Take the circle presenting Madhya Pradesh (MP)

Almost the entire circle is red, which would make you think this state is overwhelmingly non-vegetarian.  Upon closer inspection, it is slightly in favor of vegetarianism!  Look at these two circles for Madhya Pradesh and Uttar Pradesh (UP):

Despite being only 3.5% different, they look completely different.  What would the circle have looked like if the state was exactly 50-50?  The graphic system breaks down at these important middle values.  In order to get meaning from the circle, the labels are absolutely necessary to get the correct meaning at the microscopic level, and no amount of labeling will prevent it from misleading at the macroscopic level.

Recommendation 2: Use a visualization method that does not have the ambiguity caused by the overlapping circles.

This would also fail Tufte’s data-ink standards because it labels vegetarians and non-vegetarians.  If there are only two possibilities, you only need to display the values of one, because we then know the other.  In fact, because of that fact. Tufte would likely say that this graphic displays twice as much data as it needs to.

Recommendation 3: Display only the pertinent half of the data.

Effectiveness of the visualization

So, despite the flaws stated above, is the visualization effective? Well, sorta.  If you believe that the dataset accurately determined that India is about 70% non-vegetarian (this is questionable, but for now, let’s assume that’s correct), this graphic does appear to achieve its objective.  Visually, at a glance, it does appear to favor  non-vegetarianism by about two-thirds.  So, the macroscopic level seems to reflect the bottom line of the dataset.  However, when you consider the flaws, one cannot suspect this is more luck than anything.

What is not clear, is how does it represent the meat preference of the population centers.  Maps are notoriously known for distorting population preferences because they falsely equate area with population.  This graphic seems to take this a step further by then using outer circles of the same size for each state.  If the maximum size of each circle was relative to their population, we could actually account for these.

It does show that there seems to be a preference for non-vegetarianism the closer you get to the Indian coastline.  This is not impacted by the map area fallacy mentioned above, because we don’t care about the population of the region.

Recommendation 4: When accounting for the population centers (as in Recommendation 1) attempt to retain the geographic preference for non-vegetarianism.

Integrity of the visualization

In terms of the integrity of the visualization, it does a fair job of trying to represent the data from the census report (which we’ve already established as having its own issues).

Since there was not a composite number of vegetarianism by state, the creator averaged the male and female population.  Given a lack of other options, this was not a terrible assumption. While it introduced a little bit of error (as much as 5%), that would only be marginally perceived in a graphic.  While the purest in me hates it, I can live with it.

The biggest bias that it has is the land area  equals population fallacy that many map representations suffer from.  While I know that India is an immensely populated country, I have no feel for how that is distributed across the country, and the graphic does not help that.  Previous recommendations have already stated that relative population needs to be accounted for in any improvement.

Design

The graphic chooses red for non-vegetarians (think meat) and green for vegetarians (for plants).  While those are someone logical choices given the topic, the are terrible for color-blind individuals, and likely will not print very well.

Recommendation 5: Select colors that can be universally understood.

The idea of using just the outline of country is a nice approach.  However, the background color of yellow is too close to the gray used for the outline of India.   Mentally, you have to pause to determine where the country outline is. It distracts from the actual image.

Recommendation 6: If a map is used again, increase the contrast between the outline of India and the background color.

The labeling is a bit inconsistent.  Understandably, this is driven by the fact that some states are much bigger geographically than others.  This leads to labels being located in different locations.   For example, in this snippet, three different styles of labeling the circles can be seen:

In one case, the state initial is in the center, in another, the inner percentage is, and in another, nothing is.  These subtle differences cause the user to expend additional thought to understand what is going.

Recommendation 7:  Labels need to be applied consistently to all data points.

How India Eats – Fixed

After the analysis, we have seven different recommendations on how this graphic can be improved.  To review those are:

  1. The data visualization should account for the fact that different states have different populations which are not currently represented.
  2. Use a visualization method that does not have the ambiguity caused by the overlapping circles.
  3. Display only the pertinent half of the data.
  4. When accounting for the population centers (as in Recommendation 1) attempt to retain the geographic preference for non-vegetarianism.
  5. Select colors that can be universally understood.
  6. If a map is used again, increase the contrast between the outline of India and the background color.
  7. Labels need to be applied consistently to all data points.

The immediate issue is what do we do with the dataset?  We know that the methods for the data collection are not known, and that there could be unknown biases.  However, there is no additional data pertaining to the percentages of vegetarianism.  I know that the errors range between four and seven percent for individual states.  While, this makes me a bit quesy to say and do, we will use the percentages from the graphic as-is.  This does introduce some error to the graphic, which I will attempt to account for.  However, if it cannot be done, we should still be able to pull out overall trends.

Since we did like the map idea as it does show a geographic trend, and showing trends are good things, I sketched a map of India with the circles on top.  Except, this time, the shade of the circles would represent only the vegetarian percentage, while the area represented the population.

Sketch 1 - Circles represent population, and shade represents percent vegetarian.
Sketch 1 – Circles represent population, and shade represents percent vegetarian.

If you look really hard at this, you can see some really faint circles.  That’s because a few areas have very small percentages of vegetarians.  This sketch does achieve the objective of showing that the more vegetarian population is more to the northwest region of India.  I think I can do better.

In the next sketch, I added a gradient, since that would ensure that I can show the range of vegetarianism in the various regions.  For fun, I also attached labels to it:

Sketch 2 – Color gradient, where yellow is non-vegetarian and blue is vegetarian. Labels added for fun.

Sketch 2 I think is getting better, you can see the trend of vegetarianism to the northwest, and non-vegetarianism to the southeast.  Also, I realize that placing the names of the individual states, does not actually add much information, since the map already shows it.  Since this map was presented to an Indian audience (it was Huffington Post India), it’s safe to assume they generally know where their states are located.  So, adding those labels would be a waste of data-ink.  The color gradient did work out nicely, though, with our color-blind friendly coloring.

I got to thinking, I could attempt to put the gradient on the individual states.  However, that would just do everything we hate about map graphics where we commonly equate  area with population.  By looking at the map above, it’s clear that population and areas differ greatly across these many states.

I did think that there was something interesting about the geographic position, which are namely the latitude and the longitude.  The map itself was not that relevant.  Going back to the basic Tufte principles, we should strive for the minimal amount of data-ink.  After doing some complex geo-mapping and overlaying of a gradient, I went back to basics and attempt to graph these scatterplots.

Sketch 3 – Simple Scatterplot of North Latitude Versus Percentage Vegetarian

That’s a beautiful graph!  It’s simple, and we can see that as you go farther north (increasing degrees north latitude) that the population is a general trend to be vegetarian, although there is a wide swath of possibilities there.  There is not a ton of extra ink, just enough to be descriptive.  I did think about putting in a trend-line, but I thought that would be extraneous, as your eye naturally tells you there is a trend.

Now, let’s see what this graph looks like for longitude.

Sketch 4 – Simple Scatter plot Degrees East Longitude versus Percentage Vegetarian

This one is even better! The trend is much more visibly significant.  As you go farther east, the likelihood of vegetarianism goes down.  It’s simple and it tells the data’s story fairly well.

In Sketches 3 and 4, I left off the names of the individual states.  It is not necessarily obvious that this is the correct course of action.  It depends on your goal.  If your goal is to show how vegetarian trends across the country, then you don’t need the state names.  The geography does the work for you.  If you want the user to be able see the states (because everyone wants to be able to identify what the story is in their state), then you might want something a little more like Sketch 5, where when you move the mouse over a particular state, it produces information about the Indian state.

Sketch 5 – Scatterplot of Longitude versus Vegetarianism with Mouseover InfoBy taking advantage of interactivity, we do not have to pollute our graph with extra data-ink, and we are able to provide the information, if desired, to the reader.  We wouldn’t want that much information on every circle, especially since a few of the circles are close to each other.  This allows us to keep our graphic clean, while providing the ability to go indepth.  This was something that Tufte did not have the advantage of in his earlier writing.

Another thought — should this be a 3-d graphic? From all of these graphs, we know that the trends vary as you north-south and as you go east-west.  A three-dimensional graph could be interesting to show this trend; however, we know many three-dimensional graphs fail at this miserably.  I would prefer to stick with these two separate graphs.

Conclusion

So, let’s see how I did against the seven recommendations for improvement:

  1. The data visualization should account for the fact that different states have different populations which are not currently represented.
    • Achieved!  We used different areas to represent population.
  2. Use a visualization method that does not have the ambiguity caused by the overlapping circles.
    • Achieved! We choose to only represent the percentage of vegetarians, because India is mainly known for being a vegetarian country.  The percentage non-vegetarian is implied, as this was a binary choice.
  3. Display only the pertinent half of the data.
    • Achieved! We used the vegetarian half.  We just as well could have used the non-vegetarian half, and would have had the same results.  I should also note that I ultimately made the same assumption/compromise the original creator did and average the male and female populations out, because while it introduced a little bit of error into our graph, it’s not significant enough to matter, since the male and female values are fairly close.
  4. When accounting for the population centers (as in Recommendation 1) attempt to retain the geographic preference for non-vegetarianism.
    • Achieved! We used latitude and longitude on our two scatterplot x-axes.  Understanding the relationship between geography and vegetarianism is retained.
  5. Select colors that can be universally understood.
    • Achieved! By going to the scatterplot, we only needed one color.  So, whether these charts were printed in black or white, or viewed by a color-blind individual, it should be able to be picked up in the blue that was utilized.
  6. If a map is used again, increase the contrast between the outline of India and the background color.
    • Achieved! Our final graphs did not use the map at all.  In Sketches 1 and 2, when we did use the map, we went with a white background, and simple grey outline for the map on India, allowing the focus to be on the circles.  I would highly recommend doing that again for future maps.
  7. Labels need to be applied consistently to all data points.
    • Achieved! The only labels I went with are on the axes and title.

My big take away from this is that while making big fancy graphics is tempting and a lot of fun, the graphics that might tell the best story are the simplest ones.

For future improvement, while this graphic is nice, clean, and simple, it lacks a Wow! factor.  This graph is not likely to be shared, which is one of the goals of online content.  Right now, the goal of this project is to make better visualizations.  I’ll focus on making better visualizations that are also Wow!-worthy in the future.

If you have any comments or questions, please feel free to place them below, I’d love to discuss this more with you.  If you have a suggestion for another case study, please let me know!

Thanks!
Derrick

References:

  1. http://www.huffingtonpost.in/2016/06/14/how-india-eats_n_10434374.html
  2. http://www.censusindia.gov.in/vital_statistics/BASELINE%20TABLES08082016.pdf
  3. https://en.wikipedia.org/wiki/List_of_states_and_union_territories_of_India_by_population