Case Study: U.S. Tourism

Tourism in the United States is a $1.5 trillion industry.  I came across this infographic from The Traveler Zone in 2009 that attempted to visualize  where U.S. tourism was coming from.

What is going on here?  Source: TheTravelerszone.com

The top half of the graphic hurts my head.  There appears to be some scale used for the larger countries, but the smaller countries all seem to be about the same scale.  I’m not sure exactly what the design is.  Actually, it reminds me of Candyland.

Yep, that;’s probably what they were going for. Source: FanPop

I guess that means the U.S. is the castle at the end?  The top half of the infographic needs a major re-do.

The bottom half of the graphic is not that bad, with one exception.  The first bar in the bar is truncated.

Huh?

Travel from Canada is over 9 million, but they stopped it at 7 million. This probably happened because they ran out of space, and/or didn’t want to make the rest of the bars look too short.  Otherwise, it’s not that bad of a bar graph.  I wouldn’t classify it as troubled, just slightly lost.  For the purposes of this case study, I’m going to focus on just the top-half of the infographic, which is in dire need of help.

Let’s give it some TLC

Alright, on to the analysis and rehabilitation.

Overview

The data visualization is an infographic, designed to draw attention to the website.  The visualization is about the Top 20 Tourist generating countries traveling to the United States of America.  The dataset is twenty countries, with one value for each.  The value represents the number of people from that country that visited the United States.  The data visualized with two different visualizations.  One is a path reminiscent of Candyland that has each of the twenty countries, along with the number of visitors from that country.  Around the path, the data set is also presented in text with the name of the country, the visitors from that country, and the country’s flag.

Explanation of data

The data set is generated from visitor travel data made publicaly available by the U.S. Department of Commerce.  The data is based on the immigration form that everyone completes entering the country.  The original data set is a credible source.  It is one dimensional data.  There is a country and the associated number of visitors.  Fairly straight forward.

There is an issue with how the data was then used in order to create the visualization.  The data visualization does not indicate the time-frame.  Is this a month’s total? A yearly total?  I bet you never would have guessed it is the total from January 2009 through September 2009.  That is an odd-time frame, so if you are going to use that, then needs to be indicated.

Recommendation 1: Indicate the time-frame used, or if possible, use a more intuitive timeframe.

Explanation of visualization techniques

I love board games.  Candyland was a great one when I was a kid.  One thing to note on the candyland board is that every square is more or less the same size.  On this visualization, the squares (err, rectangles) are not all sized the same.

Scale, we don’t no stinking scale!

The spaces for Canada and Mexico are larger than most, and then the United Kingdom on the turn appears to be the largest of them all.    Germany, Japan, and France all appear to be the same size, despite very different values.  This just leaves to confusion as the large rectangles indicate it’s a scale, while the other ones are a more consistent size.

Recommendation 2: Use a scale that makes sense for representing the number of arrivals for each country.

It fails to mention what the label is in the box.  In the title, it does say “Tourist Generating,” and on the text next to the flags it says arrivals.   In the part that readers are most likely to read (the Candyland path) it does not have a label.  Perhaps it’s my reading, I think that could be confused because we think of things like income generating not tourist generating.  Could someone think those are revenue generated numbers, and that’s the metric of interest? Possibly.

Recommendation 3: If the title is not clear, properly label the values.

Lastly, around the path, the graphic created a rather large data table of names around it.  If you’ve already put the names and visitor values on the candyland path, why do you need the table? The flags make for nice eye candy, but do not do anything to actually enhance the visualization.

Recommendation 4: State the data once, and one time only.

Another thing to note with the unnecessary flags is that the creator did not pay close attention to which flags to use.  First, the flag for Japan was:

Flag of the Japan Maritime Self-Defense Force, not of the Country of Japan

This appears to be the flag of the Japan Maritime Self-Defense Force:

And, not the flag of the country of Japan:

Then, when I looked at the data visualization, it portrayed the data as being from only  Hong Kong:

Data was not from only Hong Kong

As it turns out, upon reviewing the actual data set, it was for Hong Kong and the People’s Republic of China (PRC).  Since Hong Kong has been part of China since 1999, the PRC flag would be more appropriate:

Overall, those are unforced errors that the visualization creator could have avoided.

Effectiveness of the visualization

If the goal of the visualization is let the reader know that Canada and Mexico are the countries that provide the most visitors to the United States, then yes, it does achieve that.  Otherwise, it is not a terribly effective visualization.  The Candyland path is actually a bit difficult to follow along, and the reader loses interests as you move along it.  The data table is interesting in that it’s legible, but it’s unnecessary.  Overall, this visualization needs some simplification

Recommendation 5:  Simplify the overall design.

While this is our objective for most of our work, this one needs particular special emphasis here.  It’s a non-intuitive piece of information that needs to become easily digestible for the reader.

Integrity of the visualization

The data is definitely distorted as there is not a consistent scale across the Candyland path.  This could lead people to think that the level of tourism is more or less the same for the top twenty countries except for those couple at the top.  This is probably the highest priority for a revised version of this data visualization.  Just about any properly applied data visualization technique should be able to correct for this distortion.

Design

The design is a bit engaging because the Candyland path makes the read wonder what is going on here.  Once it has the interest of the reader, it loses it rather quickly with the color   The dark background does make it difficult to read or comprehend all of the numbers that are presented in the table.  Further, the bright colors on the dark background are a little harsh on the eyes.  I think a softer palette on light background would do much better.

Recommendation 6: Use a lighter background and softer palette of colors.

Interesting

This data visualization is definitely meant for mass media, as it is not focused towards any particular part of the tourism industry.  Tourism is definitely an interesting thing, as most people have traveled somewhere in their lives.   It is in the “fun facts about travel” category.  It is the kind of thing that is meant to drive traffic to the The Traveler’s Zone.  Overall the graphic will draw interest, but will lose it due to its confusing nature.

On the rehabilitated graphic, I will definitely need to focus on keeping the visualization interesting and playful for the mass media.

U.S. Tourism – Fixed

Let’s review all of the recommendations:

  1. Indicate the time-frame used, or if possible, use a more intuitive timeframe.
  2. Use a scale that makes sense for representing the number of arrivals for each country.
  3. If the title is not clear, properly label the values.
  4. State the data once, and one time only.
  5. Simplify the overall design.
  6. Use a lighter background and softer palette of colors.

One of the first decisions I made, is that I would use the data for all of 2009 instead of the first nine months of 2009.  There’s no reason to truncate that data after nine months.  It does change our data set a bit:

Old Data – Jan – Sep 2009 New Data – All of 2009
COUNTRY OF RESIDENCE ARRIVALS COUNTRY OF RESIDENCE ARRIVALS
CANADA 14,043,658 CANADA 17,958,000
MEXICO 4,295,124 MEXICO 6,023,225
UNITED KINGDOM 2,905,909 UNITED KINGDOM 3,899,167
JAPAN 2,169,716 JAPAN 2,918,268
GERMANY 1,263,344 GERMANY 1,686,825
FRANCE 930,265 FRANCE 1,204,490
BRAZIL 613,347 BRAZIL 892,611
KOREA, SOUTH 560,405 ITALY 753,310
ITALY 558,594 KOREA, SOUTH 743,846
AUSTRALIA 526,441 AUSTRALIA 723,576
PRC & HONG KONG 488,484 PRC & HONG KONG 640,840
SPAIN 451,530 SPAIN 596,766
INDIA 447,079 INDIA 549,474
NETHERLANDS 416,870 NETHERLANDS 547,790
VENEZUELA 344,802 VENEZUELA 507,185
IRELAND 296,771 COLOMBIA 424,526
COLOMBIA 288,439 IRELAND 411,203
ARGENTINA 271,737 ARGENTINA 356,428
SWITZERLAND 262,595 SWITZERLAND 355,727
ISRAEL 237,818 SWEDEN 324,417
Total Top 20 31,372,928 Total Top 20 41,517,674
All Worldwide  Arrivals 35,990,071 All Worldwide Arrivals 47,737,409

Overall, it does not make that much of a difference in terms of rank or relative value.  A handful of countries swapped places, and Israel was replaced by Sweden as the 20th country.  The most important part is that we now have a consistent, logical data set.

Next, I played around the data to get a feel for the data. So, I plotted a trusty bar chart, because bar charts don’t lie:

Sketch 1 – Bar chart goodness

The first thing strikes me in the bar chart in Sketch 1 is that Canada is by far in first, followed by Mexico and the United Kingdon a distant second and third.  This scale will make it difficult to show both the smaller values in detail and the larger values.

Next, I thought a waffle chart would work, with a total of 1000 squares.  This attempt failed miserably.  It was so bad, I actually stopped in the middle of making it, because I could see this would get me absolutely no where.

Sketch 2 – This was just headed for disaster, so I stopped.

Aside from a terrible choice in colors by me, I did not see this as anything that would have any clarity, or something that would be interesting.  I got confused making it!  Time to rethink this visualization.

Then, I thought back to some of my favorite visualizations that I’ve seen.  I began to wonder if I could do something like the Avengers network graph.

Awesome looking Avengers network graph.

My thought would be for me to group the countries by region, and then each region would flow into the United States.  I thought I might be able to modify the Avenger code or use it at least to guide me through what I needed to do on my own code.  When I looked at the underlying code, I realized this was too many steps beyond my current skill.  It uses d3, along with bootstrap, and some custom javascript.  This is something that I will aspire to be able to create in the future, but right now, it’s just a dream.

The reason I pointed that failed attempt out is because it gave me the idea to break the data into geographic regions.  First, I sketched it out, with old fashioned pen and paper.

Sketch 2 – Sometimes, you need to do it by hand.

This gave me a feel for what it might look like.  It’s not pretty, but it was enough for me to commit to using this approach.  Next, using Gephi, I created a network diagram of the top 20, and I added supplemental data points to account for the rest of the world.

Sketch 3 – Country of Origin for Visitors to the United States in 2009

I like this.  The countries are placed in familiar locations based on a Mercator map, but there is no actual map.  The thickness of the line indicates the relative amount of tourists.   Canada, which is by Number 1, has a thick like, while “Rest of Oceania” is fairly faint.   It needs a little bit more description (like a title and legend), but I think this is heading in  the right direction.

I prefer to keep my text to a minimum, and let legends and titles give a lot of the context.  With that in mind, here is the final version of the network graph.

Sketch 4 – Final – Titles and a Legend Help

The differences are subtle, namely a title that gives the worldwide number and makes it clear we are talking about where people came from.  I added a legend by indicating the lines that indicated the most and the fewest visitors.  This creates a mental bound that the reader can use to judge the relative flow of people.  Ultimately, I decided this was best because the reader does not care about the exact number, but the relative scale.  At a glance, the reader can see that the Canada’s line is the thickest, with Mexico, the UK, and Japan as the next big contributors.

I knew it would be a challenge to use a visualization technique that showed the great range of values that the data set had.  In my handmade sketch, I had the area of the node represent the volume.  On the final graph in Gephi, one of the decisions I had to make was to determine whether the nodes would be scaled to the number of visitors from that country, or if that information would encoded into the edges connecting the nodes.  Ultimately, I choose to go with the edges, because proportional nodes would have resulted in barely visible nodes for the low value countries.

Conclusion

Now, let’s see how the final image did against the recommendations:

  1. Indicate the time-frame used, or if possible, use a more intuitive timeframe.
    • Achieved! I used the data for all of 2009, and clearly indicated it in the title.  It should not be an issue for anyone to understand the timeframe.
  2. Use a scale that makes sense for representing the number of arrivals for each country.
    • Achieved! (Barely!) This was the hardest part.  There were three orders of magnitude that needed to be properly visualized in this one-dimensional data set.  I think the thick edge lines work, but only barely.  If this was an interactive graphic, highlighting the path would definitely be an improvement to this.
  3. If the title is not clear, properly label the values.
    • Achieved! The title makes it clear we are talking where the people come from.  Further, with proportional edge lines, I only really needed to label the minimum and maximum cases.
  4. State the data once, and one time only.
    • Achieved! I got rid of the flags, text, and values that reiterated information that people were likely not that concerned with.
  5. Simplify the overall design.
    • Achieved! There is only one visualization of the data, instead of two, and there is a minimal amount of text.  The title gives all of the context that is needed to understand what is going on in the data visualization.
  6. Use a lighter background and softer palette of colors.
    • Achieved! The white background was easy.  The colors are not actually that much different than the ones used in the original.  Because of the white background, they are much easier on the eye.

The thing I spent a lot of time on, was trying to make something interesting and playful.  I like to think this data visualization is interesting because it seems familiar to the reader (a map they have seen thousands of time), has alluring colors, and appears to be active.  It’s an abstract map, if you will.

Another interesting thing to note is that dark background should only be used minimally.  It worked really well for the Solar Eclipse Case Study .  But, it only had two non-gray colors.  The original visualization had 12 different colors.  On black background, they look terrible.  On mine, I used 11 colors, and it looked much better on a white background.

Overall, this one was a challenge.  What I wanted to create was limited by my technical skills.  Ultimately, I went back to my skill set, and figured out how to make something interesting and informative.

Thank you for your time, and please leave any comments below. I would love to hear from you.

References

  1. https://www.statista.com/topics/1987/travel-and-tourism-industry-in-the-us/
  2. http://www.thetravelerszone.com/travel-information/top-20-tourist-generating-countries-to-the-us-2009/
  3. http://www.fanpop.com/clubs/candy-land/images/35166496/title/candyland-board-photo
  4. http://bl.ocks.org/nbremer/864b11eb83aac3a1f6a2

Case Study: Solar Eclipse

Remember back to the summer and the mass hysteria (well-deserved hysteria, but hysteria nonetheless) of the solar eclipse? All sorts of fun graphics were floating around.  As part of #MakeoverMonday during the summer, Eva Murray at Tri My Data and Andy Kriebel at Viz Wiz took their hand making over this map:

Lots and Lots of Solar Eclipses.

This map has a lot going on, so it’s clear why Kriebel and Murray took a crack at it.  I picked this one for a case study because I really like the approaches that Murray and Kriebel took.  I didn’t try to outdo them (because that wasn’t going to happen).  I wanted to see if I could find something that could complement their efforts.

Murray created this captivating visualization focused on the duration of solar eclipses related to latitutde:

Eva Murray did an awesome job with this solar eclipse makeover!

It’s pretty clear that as you get closer to the equator, the eclipses last longer, with the longest eclipses almost exclusively located at the equator.

I recommend you check out her write-up at Tri My Data.  At Viz Wiz, Andy Kriebel created this simplification of latitude versus eclipse duration by century.

Andy Kriebel shows where the eclipses are likely to occur.

Kriebel’s visualization shows another dimension of the data.  While the longer eclipses tend towards the equator, you are more likely to see an eclipse (of any duration) somewhere around 44-degrees North or South Latitude.

Now that we have seen what can be done, let’s go back to our original visualization and critique it.  Then, I’ll take my own cut at it.

Overview

The original data visualization’s objective is to allow users to interact with eclipses from the past twenty years.  It focuses on the path of eclipses from 2001 through 2020.  The path of the maps are plotted on a mercator projection of the earth.  It is interactive data visualization that allows the user to get details about a specific eclipse or explore an area to find out when eclipses have or might be coming to the region.

Explanation of data

The eclipse data set is a subset of an extremely large eclipse data set originally pulled from NASA.  It has nearly 12,000 data points of eclipses going back 5000 years.  There are all sorts of fun types of data sets.  There are some geographic features like latitude, longitude, and sun altitude.  There is a time-date feature.  Of most interest, are the quantitative features that describe the characteristics of the eclipses.  These include the eclipse magnitude, duration, and gamma.

The data is highly credible as it originated from NASA data.  If we can’t trust NASA’s calculations of eclipses, we can’t trust anything.

The original visualization used almost all of the features.  It does use only a 20 year window of data of that data.  For the my version, I want to simplify the amount of features.  While the  original map attempts to leverage every dimension, I think taking it back a step and simplifying to focus on one or two features, like Kriebel and Murray did, would result in a potentially more impactful graphic.

Recommendation 1: Explore a different feature that was not utilized by the previous versions.

Explanation of visualization techniques

The visualization is a plot of solar eclipse swaths on a Mercator map of the world.  In fact, it’s an interactive visualization of these eclipses from the 2001 until 2020.   Built on top of the Google Maps API, it allows users to zoom in and out of different regions and get an unlimited amount of macroscopic and microscopic views of the paths of the eclipses.  Both annular and total eclipses are included on the map, and designated with different colors.

Overall, not a bad approach, aside from the standard issues with Mercator maps.  As we will discuss, the design itself becomes more problematic.

Effectiveness of the visualization

Interactive data visualizations are a great way for people to have a personal experience with data visualization.  When everyone looks at a map like this, more than likely, the first thing they want to do is so where they live and ask “When’s the eclipse coming to me?”  By making this an interactive map, it becomes highly effective at captivating the reader.

The reader taking that action is based on the assumption that the reader is not overwhelmed by the map.  My first reaction when I looked at that map was to think that there’s a lot going on.  That is why I would s the point that I would worry about the map losing people’s interest before they even ever have it.

There are a number of different ways to visualize the eclipse data, particularly since there are many different features.  As I mentioned earlier, one of my objectives to use a different feature and try to create something that complements some of the other enhanced visualizations, so no additional recommendation is needed.  It will be done.

Integrity of the visualization

The biggest issue with Mercator projections is that it warps the poles, adding extra value when there is none.  The following graphic was taken from the “Elements of map projection with applications to map and chart construction” written in 1921 by Charles H. Deetz and Oscar S. Adams.  We’ve known for a very long time that Mercator is not that great a map to use.

This is why we care about map projections!

The point of the original data visualization is to show where eclipses occur.  We know from our other visualizations that eclipses tend closer to the equator.  Because of that, I’m okay with the use of a Mercator projection.  It does introduce some distortion and error into our visualization near the poles.  In return, we have map, that despite its flaws, is familiar to most people, and accurately portrays most of data, since our data tends towards the equator.  As we learned earlier, data visualizations are about trade-offs.  For my new data visualization, I may not use Mercator, but I do not fault the use of it on the original.  It made sense.

Design

The eclipse map hits you like a freight train.  “Bam! Everything you wanted to know about the last two years of eclipses all at once!”  It’s a bit overwhelming.  The paths of the eclipses are not easy to pick out since there are so many of them, and they are densely populated near the equator.

Recommendation 2: Create a simplified design that does not overwhelm.

When I look at the original visualization a little more, I think it used opacity nicely to show the regions of the total eclipse within the context of the total swath.  The map really becomes overwhelming when we can’t see the paths behind all of the labels for each of the paths.

That’s a lot of labels. Has anyone seen Northern Africa?

Really, that’s a lot of paths.  When you look at the visualization, the thing you see first are all of the label tags, and not the paths of the eclipses.

Recommendation 3: Reduce the use of overly burdensome labels and put more focus on the data itself.

Interesting

I struggle with determining if it’s interesting.  At first glance, it is a regular Mercator map, with lines and labels on it, meant for a mass audience.  Only with the use of a mouse, does it get interesting to the reader.  On the whole, I would not classify this as interesting because most of the audience would be lost before they even found the interesting part.

The timing of the map should be taken into consideration.  The total eclipse of 2017 in the United States received a ton of coverage.  At that time, people were consuming all sorts of eclipse-related content.  It was inherently interesting, which allows people to overlook design and dig in to the map.  Without eclipse fever forcing the interaction, the visualization has to be more interesting now.

Recommendation 4: Make it more interesting; something that might attract a casual reader to stop and review the data visualization instead

Solar Eclipse – Rethought

Let’s review the recommendations so far:

  1. Explore a different feature that was not utilized by the previous versions
  2. Create a simplified design that does not overwhelm.
  3. Reduce the use of overly burdensome labels and put more focus on the data itself.
  4. Make it more interesting; something that might attract a casual reader to stop and review the data visualization instead.

First thing I did was look at the data set in order to get a feel for the data.  One of the things that I thought was interesting was eclipse magnitude.  This is a measure of the ratio of how big the moon appears in the sky relative to the sun.  Only when the eclipse is greater than one do you have a total eclipse.  If it is less than one, you have an annular eclipse.  An annular eclipse is when a ring appears around the moon as the moon passes completely in front of the sun.

So, I plotted eclipse magnitude against duration, and got the following plot.

Sketch 1  — Eclipse Magnitude vs Eclipse Duration — Pretty interesting, huh?

When I saw Sketch 1, I was pretty captivated.  I haven’t seen anything like it before, and it drew me in right away.  What’s going on here exactly? When you look at Magnitude = 1, you see the duration is just a few seconds.  That makes sense because the moon is just barely bigger than the sun.  So, as the moon continues to process through its orbit, it will move out of view relatively quickly, ending the total eclipse.

As you move to the right, where the moon appears larger than the sun, it blocks the sun longer.  And similarly, as you move to the left, where the moon gets smaller relative to the sun, the annular ring appears for a longer period of time.

Just for fun, I plotted the eclipse duration on a logarthmic scale.

Sketch 2 – Eclipse Duration on Log Scale – Interesting, but not necessary

Sketch 2 is pretty interesting to look out, and tells the same story.  Since logarthmic scale is not that intuitive for mass media, I decided to follow the path I started with Sketch 1.

While Sketch 1 is accurate, simple, and precise, it’s a boring scatterplot.  We are not any better off than the original visualization.  In fact, I would say we are worse off because most people would just look right past it.  Sketch 1 would be my basis for the visualization, but it would need to be improved.

After playing around with some colors, I created this final visualization using d3 and SVG:

Sketch 3 – Eclipse Final

It’s really cool!  I’m still amazed I came up with that!  There are a few different things that I had never done before in SVG that I did here.

First, since we are talking about eclipses and dark skies, I first used a black background.  That was too stark, and went with a much richer charcoal black.  Each eclipse is represented with a lunar gray circle, which gets larger the longer the duration of the eclipse.  The data points are 80% translucent since there thousands of them.  I do have to admit that Murray’s data visualization inspired me to use that scaling for the circles.   Then, I added a legend on the bottom to explain the size of data points.  It’s just a little bit of text and then

Next, since eclipse magnitude is not necessarily intuitive or that well understood for many people, I thought some explanatory text and graphics in the middle would help.  Not too much text, two simple images that most can conceptualize and understand.

One thing of interesting that I discovered is that doing text-wrap in svg is not that easy.  You actually need to create a javascript function to do it.  Since I was only using a minimal amount of text, I just manually created the word wrap by creating multiple text statements.  If I was using SVG to create a poster or something with more extensive text, I would likely go the function route.

Lastly, I thought about making the title a little more interesting. When I first saw plot in Sketch 1, it reminded me of wings.  Then, the annular ring reminded me of a halo.  With those pieces of imagery in mind, that’s where I came up with the title for the data visualization. of “On Angel’s Wings? Eclipse Magnitude Lets You See the Heavebs” While a little metaphorical, the title is a lot more interesting than “Eclipse Magnitude vs. Eclipse Duration.”

Conclusion

How did my new version do against the recommendations:

  1. Explore a different feature that was not utilized by the previous versions
    • Achieved! The magnitude of the eclipse was not featured, and it made for an interesting visualization.
  2. Create a simplified design that does not overwhelm.
    • Achieved! It’s a simple scatterplot, with a little bit of explanatory text and imagery.  There is a lot being explained with a minimal amount of items.
  3. Reduce the use of overly burdensome labels and put more focus on the data itself.
    • Achieved! Despite having thousands of data points, there is a minimal amount of labeling needed.
  4. Make it more interesting; something that might attract a casual reader to stop and review the data visualization instead.
    • Achieved!  The shape of the plot is unlike anything I’ve seen before, which should draw some interest right away.  The descriptive text also provides some education. The coloring is clean and consistent.

This visualization and the one in the prior case study on Baby Boomers have really challenged me to make sure that in addition to making the graphics accurate and simple.

While my recent visualizations may have issues of its own, overall they are improving.  These are accurate and are also interesting items that people will want to read and understand.    People understanding the information and then being able to use it in their own lives is our ultimate goal with data visualization.

Thank you for your time, and please leave any comments below. I would love to hear from you.

References:

  1. http://moonblink.info/Eclipse/lists/solcat
  2. https://trimydata.com/2017/08/20/mm-week34/
  3. http://www.vizwiz.com/2017/08/solar-eclipses.html
  4. http://www.makeovermonday.co.uk/data/
  5. https://books.google.com/books/about/Elements_of_Map_Projection_with_Applicat.html?id=0QnlAAAAMAAJ&printsec=frontcover&source=kp_read_button#v=onepage&q&f=false
  6. http://geoawesomeness.com/amazing-image-1921-will-explain-essence-map-projections/

Case Study: How Baby Boomers Describe Themselves

If you search for bad infographics, you’ll regularly come up with this one from Beyond.com (now Nexxt.com) about Baby Boomers and HR professionals:

Something isn’t right with those percentages

The reason why is two-fold.  First, the percentages add up to over 100%, indicating multiple answers were possible.  Second, the colors do not seem proportional.  There was likely a good meaning designer behind this visualization, because of the rest of the infographic is rather pleasant.

Alright, let’s jump into the full rubric now:

Overview

The point of the visualization is to highlight the differences between how Baby Boomers perceive themselves and how Human Resources professionals view Baby Boomers.  This is measured on five different indicators, which are contrasted side-by-side.   The technique for representing this are two generic stick-figure silhouettes, with it colored to represent the percentages for the five different attributes.  At the bottom, there is a also a lot of text explaining what Baby Boomers should do to compensate for these differences in perception.

Explanation of data

The dataset is rather straight forward.  There are two sets of five percentages indicating whether Boomers are perceived as Leaders, Willing to Learn, Tech-Savvy, People-Savvy, or Creative.  These two sets are contrasted with each other.  Because we have five values, we have five dimensions to represent with just two values for each of them.

It is relatively safe to assume that the 6,361 individuals surveyed were allowed to select more than one attribute.  This is not noted, which leads to some misleading numbers.  The left side of the image sums to 243%, while the right side sums to 162%.  So, even if we ignore the fact that the percentages are over 100%, so that compounds the problematic comparison further.  We need to displays the values in a much clearer way.

Recommendation 1: Since each individual attribute is capable of going to 100%, the visualization should reflect this potential maximum.

Explanation of visualization techniques

This data visualization attempts to use a modified pie chart essentially.  On a graphic like this, we are trained to equate color with proportional values.   It appears the creator wanted to visualize things as parts of a whole, and that everything would equal 100%.  We could think that was an innocent mistake that was made treating the percentages as integers (thus overflowing the 100%) only looking at the left side of the image.  On the right side,  we see that this is not the case.

17% > 44%. Wait, what?

This is where the graphic completely loses the reader.  The purple, representing 44%, is significantly smaller than the aqua 17%.  Either there was a typo with the numbers, a change in values, or someone just wanted to make boomers look a lot more creative than they actually were.

Recommendation 2: Since we have multiple percentages, we should not use a single pie-chart like method for visualizing the data.

Another theory is that someone just wasn’t paying attention to the numbers.  There’s a similar visualization that this site generated on millennials, and it is just as misleading.

Only slightly better

Seemingly, sometimes, pretty colors just win.

Effectiveness of the visualization

This visualization struggles greatly to meet its objective, since the coloring and percentages are misleading.  The reader is only able to glean meaning by comparing the numerical values for each category.  The only thing going for this visualization is that at least they used the same colors on each side of the graphics to represent the five dimensions.

The reality of this visualization is that almost any other technique would have done a better job of accurately representing the data.  The previous recommendation already stated going away from the pie-chart means of representing the data, so no need to repeat that recommendation.

Integrity of the visualization

As I mentioned earlier, the representations of the numbers and the height are completely skewed.  The right side of the visualization only represents two-thirds as many percent (as much as that makes sense to say) as the left side.  The height of each segment within the stick figure is out of proportion within each stick figure and across each one.  For example, look at the head of each stick-figure Boomer.

40% > 55%. Huh?

The left side, at 40%, occupies the whole head, while on the right side, at 55%, it doesn’t even cover the whole head.  As soon as someone begins looking at the image, their sense of reality gets distorted.

Recommendation 3:   All numbers need to be on the same scale and proportion for both sides of the visualization.

Design

Overall, the design of the infographic is actually not that bad.  The color palette could run into some issues with black and white printing or color-blindness.  In its full color, it is captivating and draws attention to the center graphic.  We only run into problems with the graphic when when we stop to read it.

There is something of  value at the bottom of the chart.  It’s the text that provides tips to Baby Boomers on how to overcome the stereotypes.  That is something of real value.  These conclusions are based on the differences in the five dimensions.  This is useful information.

These help! That’s good stuff.

I actually like that text enough that I’m not even going to bother trying to change it.  Those are clear direct actions that someone can take.

If the reader gives up trying to divine meaning from the images, and decides to read the text, there might actually be something useful here.

Interesting

This is definitely meant for a mass audience–Baby Boomers.  There is also specific niche market that would find it interesting–hiring managers and HR professionals.  This graphic does have the challenge of being interesting to both audiences.

Because the design (prior to reading) is done well, it should be able to draw attention to it for both of these audiences.  The text is the only thing that saves it for either audience.

Overall, it is an interesting comparison because it addresses a real problem faced by hiring managers, HR, and Baby Boomers.  The problems faced by Baby Boomers are of interest to all of these groups.

How Baby Boomers Describe Themselves – Fixed

First, let me recap the recommendations from the analysis section:

  1. Since each individual attribute is capable of going to 100%, the visualization should reflect this potential maximum.
  2. Since we have multiple percentages, we should not use a single pie-chart like method for visualizing the data.
  3. All numbers need to be on the same scale and proportion for both sides of the visualization.

This was actually the smallest set of recommendations to date out of all the case studies.  This is mainly attributed to the fact that other than the stick figures, the rest of the visualization was actually well done, and even the color scheme wasn’t that bad.

First, I took a simple look at the attributes on a couple of bar graphs to stoke some ideas.

Sketch 1 – A clean horizontal bar chart. Not bad.

This isn’t that bad to begin with.  The biggest strength this has is that you are able to clearly see the differences between each of the traits as the two values are plotted right next to each other.

Sketch 2 – This is actually pretty good.  Go bar charts!

The vertical bars in Sketch 2 actually work much better than the horizontal ones in Sketch 1.  If this was a technical journal, I would be done at this point.  Either bar chart in Sketch 1 or 2 would be more than sufficient.  However, the audience on this data visualization was mass media, as Baby Boomers would be targeted.  So, since it’s mass media, it has to be more interesting than a simple bar chart.

Those two bar charts did give me a good feel for the data.  I’ve been looking for an opportunity to use a waffle chart lately, so I thought since we were talking percentages, this would be a good time to experiment with the waffle chart.

Final Sketch – Waffle Chart for the win! I really like how this one came out.

YOWZA! That’s a fine looking data visualization.  You can clearly see the side-by-side comparison of where HR professionals and Baby Boomers differ since it is center-aligned.  The colors work in black-and-white, or color-blind.  The amount of data-ink is minimized.

Most importantly, it’s interesting to look at, and could drop that right into the middle of that infographic.

In designing it, I was not sure about how to make sure it was clear how the waffle chart was to be read, although I do think they are common enough that most people would understand them intuitively.  I wrote one line under the title as a descriptor (perhaps too small) saying:

The colored portion of each 10×10 square represents the percent of Boomers who embody that trait according to the corresponding group

I’m not sure if that is best text, but it’s descriptive enough.  I would gladly welcome suggestions on how that could be refined.  When I researched how best to label the waffle chart, I discovered that most waffle charts actually tend to put the actual percentage inside the waffle grid.  You can see in these many examples from a Google Image search for Waffle Charts:

Why the percentages? Are the waffles not doing their job?

Using the percentages confused me.  That makes it seem like the entire waffle chart is unnecessary, or the label is unnecessary.  By including the actual percentage instead of having it gleaned from the grid, all we are doing effectively is simply adding background art and making our data table much bigger than it needed to be.  If it’s for making it clear for a mass audience, then perhaps it’s worth it.  Otherwise, I would leave it off, as it is stating something already indicated and deducible from the waffle chart.

As an aside, one of my grammatical pet peeves (yes, there are multiple ones) is with lists.  Each list item should be the same part of speech (noun, adjective, verb), and you do not mix parts of speech.  The original version had a noun (leader) mixed with adjectives (creative, tech-savvy, etc).  I fixed that in this final version by changing leader to leadership.  Details matter.

Conclusion

Let’s see how I did with the recommendations:

  1. Since each individual attribute is capable of going to 100%, the visualization should reflect this potential maximum.
    • Achieved! The waffle chart and bar chart sketches all allow us to see that each item could go as high as 100%.
  2. Since we have multiple percentages, we should not use a single pie-chart like method for visualizing the data.
    • Achieved! This was perhaps the easiest thing.  In fact, I went with 10 different pies (square, waffle-shaped pies) that a
  3. All numbers need to be on the same scale and proportion for both sides of the visualization.
    • Achieved! Each square on each waffle chart is exactly the same amount as every other square.  You can now properly visually compare the area of the Boomers against the HR Professionals.

Overall, I’m really happy with the way this one came out.  It’s much more visually interesting than the other visualizations that I’ve done.  I feel like I have definitely internalized a lot of the lessons learned from the prior cases.  It is definitely possible to make something interesting and interesting.

Right now, post your thoughts and any questions in comments.  Thanks!

References

  1. https://www.nexxt.com/articles/infographic-shows-what-hr-pros-think-of-millennials-12625-article.html
  2. https://about.nexxt.com/infographics/how-veteran-hr-professionals-really-feel-about-job-seekers-from-the-millennial-generation/
  3. https://www.google.com/search?q=waffle+chart&tbm=isch
  4. https://vizfix.com/lessons-learned-so-far/

Case Study: The Bitcoin Economy

For this next case study, I came across an interesting visualization attempting to put Bitcoin economy into perspective.  Bitcoin has been on a tear lately, and which has to make most people wonder how big a deal it is.  We found a visualization that tried to do just that:

The Bitcoin Economy in Perspective (from HowMuch.net)There are a few, many things here that just make me cringe.  So, let’s get into our analysis, and then figure out how we might want to improve it.

Note: According to Coin Desk, the price of a bitcoin was approximately $2800 on June 21, 2017, when this graphic was published.  As of publication, it is nearing $10000.  For consistency with all of the other numbers, we will keep using the June number.

Overview

The goal of this data visualization is to show how small Bitcoin is relative to the monetary supply of the world.  The data used are a mixed of meaningful, and not so meaningful, monetary markers.  The data are visualized with circles representing the relative size of each marker.  There are several factors that impact the ability of this visualization to present a clear, unbiased image.

Explanation of data

The creator of the graphic uses a mix of monetary markers to represent the relative strength of Bitcoin.  The first issue that I came across is that the total value of Bitcoin is on one end of the values.  We only find out the value of things worth more than Bitcoin.  We don’t learn the things it has already surpassed.  This presents a bias in the data intended to show that Bitcoin is still relatively insignificant.

Recommendation 1: Supplement the revised chart with similar monetary markers that are below the value of Bitcoin.

The monetary markers used for the most part are fairly meaningful.  The creator used the the total value of stocks, money, physical money, gold, and U.S currency.  These are things that Bitcoin is often compared to, so these are good comparisons.  The next set of comparison markers are not as meaningful, which are companies and wealthy individuals.  Perhaps these are comparisons that some people would understand.  Larry Page is worth as much as all Bitcoin.  Great! Not sure if it adds anything of real value.   It sends the message that Bitcoin isn’t worth as much as a person.  Again, it would be more interesting to see the names of individuals and companies that Bitcoin was already worth more than, and  that might add some context for comparison to individual wealth.  Overall, I think the dataset is credible, perhaps not the most relevant, and might be skewed to paint a particular picture.

This is one-dimensional data.  Simply, the value of X is Y.  One thing to note is that the complete dataset ranges from $41B to $83.6T.  That’s three orders of magnitude that are not adequate represented.  To properly show that in a graphic, we would likely need to use a logarithmic scale.

Recommendation 2: Use a logarithmic scale to represent the monetary scale.

An interesting thing to note is the url for the article suggests that the title was originally “World’s Money in Perspective,” and was later changed to “The Bitcoin Economy, in Perspective.”  Bitcoin being the hot new thing, a copy writer may have thought that would garner it more attention.

Explanation of visualization techniques

The monetary values are visualized along an axis with the most valuable marker “All Money” on the left, and with Bitcoin, the least valuable, on the far right.  Each of the markers total circle is represented by the area of a circle.  Actually, I, think it’s area, but I’m not positive. This is the first issue I ran into trying to determine whether these circles are relatively sized.  Look at the image below of the right side of the visualization with some area approximations.

What’s going on the right side?

The right side of the visualization is important because it is what we will anchor our understanding of the value of Bitcoin in the graphic.  What I did here was measure the radius in pixels, and find the approximate area in pixels.  You would expect that the area for for Bitcoin to be about one-tenth the area of Amazon, but is only about one-fifth.  Something is not quite right here.

Then, I figured it out with some back-of-the-envelope (okay, I used Excel) calculations.   I have two theories as to how these circles were portioned.  The first one is that you have to take a look at the shading.  It has that lighting effect to make it look three-dimensional.  As in, those are individual spheres, not circles.  The value of each monetary marker is represented by the volume of the sphere.  If this is the case, the graphic falls well short, and over-complicates the graphic.  Why make it three-dimensional when one will do?  Use area if you must and make it two-dimensional.

The other theory I have is that the area represents the logarithmic value of the monetary marker.  In this case, the graphic recognizes that the data goes across three orders of magnitude and must do something to account for it.  If this is the case, the biggest issue is that it’s not indicated anywhere! How are we supposed to know this?  It’s not a terrible method.  Although, in this case we are still using two dimensions to represent one-dimensional data.

I can’t know for certain which method was employed, since I’m limited to measuring the radius from the screen.  The one thing I am certain, either method over-complicated a rather straight-forward dataset.

Recommendation 3: Since it is a one-dimensional dataset, represent it in one dimension.

Effectiveness of the visualization

An important question to ask is whether this visualization achieved its objective.  As I understand from reading the article, the goal of this visualization is to show that despite Bitcoin’s amazing growth, it is still just a tiny amount of the world monetary universe.  This is despite the questionable techniques it used in presenting in the design of the circle.

One of the things this does that is a bit unusual is that it put the largest value on the left, and the smallest one on the right.  I found this particularly odd because the smallest one is the data point of interest.  If you were trying to tell a story about things bigger than Bitcoin, it would be more intuitive to start on the left, and then go to the right.  I think this flip-flopping of large to small causes a delay in understanding the visualization.  Because the eye sees those large circles first, a reader could think that Bitcoin has amassed a large amount of relative value, and is not as insignificant as the others.

Recommendation 4: Present the information in a more intuitive way, which is likely to be left to right, or top to bottom.

Integrity of the visualization

As I mentioned earlier, understanding the relative size of Bitcoin is a bit distorted, since we do not have any monetary markers below the value of Bitcoin for comparison.  I’ve already recommended including some monetary markers below the value of Bitcoin.  This will allow the data to give a more complete picture of how big Bitcoin already is and how far it has to go before it is a major player.

Another thing that is a bit misleading are the labels.  Each label is placed slightly below the one to its right, implying a hierarchy and value.

These labels are adding meaning that’s not there!

The big issue is that despite the fact that Larry Page and Bitcoin are the same value, the lower placement of the Larry Page label implies that Larry Page is worth more than Bitcoin.   The main reason they are oriented this way is because it looks nice.  It doesn’t add one bit of actual information in the way it is arranged, and it implies information that is not there.

Recommendation 5: Display the labels so that they do not add any implied value to the chart.

I should also again note that the circles do not represent anything intuitive.  Either the circles represent the log10 scale of the value of the monetary markers, or the circles are actually spheres that represent the monetary markers’ values.  It should definitely be clear what the actual value of the monetary marker is.  The only reason we know what the value of each marker is that it’s written.  That’s a waste of good data-ink.

Recommendation 6: Since it is one-dimensional data, we should be able to utilize a graphing method that makes labeling the values unnecessary.

Design

Despite all of the issues mentioned, earlier, I don’t think it’s poorly designed.  I like the choice in pink as the primary color because it’s distinctive, and draws your eye to the visualization.

Another good feature of the design is that it is aligned along a single axis.  However, it mars the effectiveness of this single axis, by then using the area of the circle (or volume of a sphere!).  This causes some awkwardness when on the left side of the visualization the circles overlap, and on the right side they have plenty of space.  As Recommendation 3 states, we need to pick a single axis approach.

The little icons used with the labels are all a different style.  Some of logos, some are headshots, and some are clipart-style representations.  These are not a cohesive visual strategy.  In fact, with the labels there, there is no actual need for the imagery.  It’s extraneous data-ink.  We could probably eliminate those images all together.

Recommendation 7: Remove the images (or the text labels).  Having both of them is redundant.

The Bitcoin Economy – Fixed

In total, there were seven different recommendations on enhancing this visualization to be more truthful and effective.  Let’s review:

  1. Supplement the revised chart with similar monetary markers that are below the value of Bitcoin.
  2. Use a logarithmic scale to represent the monetary scale.
  3. Since it is a one-dimensional dataset, represent it in one dimension.
  4. Present the information in a more intuitive way, which is likely to be left to right, or top to bottom.
  5. Display the labels so that they do not add any implied value to the chart.
  6. Since it is one-dimensional data, we should be able to utilize a graphing method that makes labeling the values unnecessary.
  7. Remove the images (or the text labels).  Having both of them is redundant.

Before I decided on what format the visualization should take, I saught out new data.  Let’s start with the people that are used as monetary markers.  Larry Page and Bill Gates are known well, but not nearly as well known as say Tiger Woods and Michael Jordan.  According to Celebrity Net Worth, Michael Jordan is worth $1.5 billion, while Tiger Woods is worth $740 million.  These are well-known athletes that everyone knows to be crazy rich.  The fact that Bitcoin has blown by them in value does say that Bitcoin is at least somewhat significant.

Another data point that we’ll add to the set is the Gross Domestic Product of a couple of countries.  Bitcoin’s value of $41B is higher than roughly half of the countries-by-GDP listing. Some of the highlights are it being larger than first-world Iceland’s GDP of $23B and oil-rich Bahrain’s GDP of$32B.

Lastly, we’ll add the market cap of Twitter ($17B).  Bitcoin is more than double the value of a relatively common stock.  It’s a bit of cherry-picked number as there is nothing special about Twitter other than it being well-known.  The reason to add this point is to give additional depth to understanding the significance of Bitcoin’s value.

We have one monetary indicator to add – the amount of interest the United States pays each month to service its debt.  For October 2018, that number was $24B.  In other words, all of the world’s bitcoin could service the interest on the U.S. Debt for about one-and-a-half months.

Just for fun, since this was a relatively simple dataset (17 data points, including my six additional ones) I decided to limit myself to using Excel.  I know what you’re saying — Excel is guilty of many visual atrocities past and present.  My thought is that creating a clean data visualization of such a simple dataset should be easy to do with just about any tool.

Before, I show you what worked, I’ll throw out some of the options that I looked at before finalizing my visualization.  In this first one, I thought I would see if I could make the circles work:

Sketch 1 – Circles? No circles.

That didn’t work, and Sketch 1 was the circle version one that looked the best.  Perhaps if it was multi-dimensional and the big circles didn’t overlap, we would entertain it for more than the two seconds I entertained it.  I should note, the black and white coloring does look nice and simple.  It does have that going for it.

Next up was our good friend the donut chart, also known as the pie chart’s hipper, younger brother.

Sketch 2 – Donut chart? No….

I’ll understand if you need a moment to collect your thoughts after seeing that.  It’s a bit overwhelming, I know.  Aside from the fact that you can’t see the Bitcoin value, you can’t see much of anything else.  This is why people hate Excel.

How about we try something in three-dimensions?  It would be counter-intuitive to go three dimensional, but sometimes the opposite of what you think would be true is true.

Sketch 3 – Three-dimensional conical bar chart.

Nope, that didn’t work.  And, I recind my comment about donut charts being the reason people hate Excel.  That’s because Sketch 3 is the reason why people hate Excel.  Has anyone ever seen a good visualization using this technique?  Aside from the inability to determine dimension, to pick out the actual value, and to identify the labels, it’s just terrible to look out.  It looks like something that might be used in a circus side-show.  Next.

Alas, progress was made.  One of the methods that has intrigued me since I learned about it is the radar chart.  It has an interesting take on representing one-dimensional data.  Looking at the results in Sketch 4 below,  I can see why.

Sketch 4 – Radar Chart – I can live with this.

Sketch 4 isn’t that bad.  The labels are easy to pick out and so are the values of each individual monetary marker.  I would be fine with using this one if I had to.  My only complaint is that it is does make the visualization a little more complicated than it needs to be.  It’s 17 data points, we should do simpler.

Which brought me to the decision I settled on — the bar chart.  Good’ole bar chart.  Simplifying data for centuries.

Sketch 5 – I like this one!

This came out relatively nicely.  It’s clean.  The colors are easy on the eyes (I do have an affinity for orange, so easy on my eyes?), and I used a complementary color of blue to self-identify the Bitcoin bar as the bar of interest.  One of the things I did not like was that the labels are really small when they are horizontal.  Perhaps that’s the reason why the original dataset had 11 points.   In Sketch 6 below, you can see it with labels at a 45-degree angle.  Honestly, either Sketch 5 or 6 work for me.

Sketch 6 – Or this one

One thing that I should note is that I put these on a logarithmic scale.  That’s the only way you can make it so you can see the relative size of everything.  Otherwise, almost everything but the last four data points were essentially zero.  You do risk that individuals won’t notice the scale or won’t know how to read it.   Using the log scale is a reasonable trade-off to make here.

I did take a look at using the original data, and I kind of like that one a little bit better.

Sketch 7 – With Bitcoin back on one end.

In Sketch 7, I moved Bitcoin to the left.  Even on the logarithmic scale, you can see the huge difference in the orders of magnitude between Bitcoin and the rest of the monetary market.  It also allows the labels to get large enough that they are much more legible while horizontal.

Conclusion

So, let’s see how I did against the seven recommendations:

  1. Supplement the revised chart with similar monetary markers that are below the value of Bitcoin.
    • Achieved! I found six other data points that met the same spirit of the original data points.  In Sketch 7, I also found that by putting it on the right scale and visualization, the supplemental data points may not be as useful.
  2. Use a logarithmic scale to represent the monetary scale.
    • Achieved! That was the only way this data set looked good once visualized.
  3. Since it is a one-dimensional dataset, represent it in one dimension.
    • Achieved! Found that a simple bar chart or radar chart does a fairly good job of creating an effective data visualization.
  4. Present the information in a more intuitive way, which is likely to be left to right, or top to bottom.
    • Achieved! Bar charts are classic, simple, and easy to understand.  When you learn about charts, bar charts are probably the first thing you learned.
  5. Display the labels so that they do not add any implied value to the chart.
    • Achieved! The labels are all at the same level.  No implied value imparted.
  6. Since it is one-dimensional data, we should be able to utilize a graphing method that makes labeling the values unnecessary.
    • Achieved! There’s a clean logarithmic scale that allows you to determine each value upon look-up.  We don’t need to include the values.
  7. Remove the images (or the text labels).  Having both of them is redundant.
    • Achieved! No extra images! Yeah! We didn’t waste ink on images that still need explanation.

Overall, score another win for simplicity.  Using Excel was an interesting experience, because it gave me a lot of different options to explore in a very short amount of time.   I easily could have selected something that made for a sub-par visualization.   By keeping my objectives in mind, I’m pretty happy with the results.

Please let me know your thoughts in the comments!

References:

  1. https://howmuch.net/articles/worlds-money-in-perspective
  2. https://www.coindesk.com/price/
  3. https://www.celebritynetworth.com/list/top-50-richest-athletes/
  4. https://finance.yahoo.com/quote/TWTR/
  5. https://www.treasurydirect.gov/govt/reports/ir/ir_expense.htm

Case Study: How India Eats

When I saw this image, I knew it had to be the first case study.  Why? Because I would clearly bite off more than I could handle, and I would learn a whole lot in the process.  To begin, Huffington Post India ran the article Vegetarian India A Myth? Survey Shows Over 70% Indians Eat Non-Veg, Telangana Tops List, which featured this map:

How India Eats – Original

It took me several read-throughs to understand what it was trying to say.  Probably about four, which is about three too many.  Perfect for modification.

Now, let’s take break down the image according to our analysis standard.

Overview

The visualization aims to show which regions in India are vegetarian and which are non-vegetarian.  The data is pulled from census data by the Government of India.  The image represents the percentages of Vegetarians with green circles and non-vegetarians as red circles.  These circles are then placed on top of each, with the dominate (majority) color on the outside.   The circles are then placed in the approximate center of the Indian state of which they represent.

Explanation of data

First, let’s look at the data.  The data source is the government of India’s census site.  Specifically, this data is from the Sample Registration System Baseline Survey 2014, which was published in 2016.  The data set is a sample of 8858 individuals across India’s 36 states, and territories.   The data about eating habits is one of many tables in the complete dataset.

Reviewing the dataset brings about a few different concerns about the dataset and how it is utilized:

  • The dataset never actually states how these individuals were polled and does not indicate if there was a potential selection bias.
  • Table 5.2, which is the basis of the dataset, breaks the rate of vegetarianism down by male and female.  The graphic appears to use a simple mean of those two numbers.  However, the dataset never states what the breakdown of males and females is overall, or within individual states.  In some states, the difference is as much as five percentage points.
  • With 8858 individuals polled, we have a relatively small sample error of about 1%.  When we look at the amount of individuals used in the large states (which the map utilizes), the smallest sample number is 212 individuals, while the largest is 662.  This places the sample error for the individual states somewhere between 4% and 7%, which are not insignificant numbers.
  • The last thing is that the dataset lacks any context for the actual population of the state.  For example, Uttar Pradesh which is the most populous state in India, has a population twice that of West Bengal.  However, in the survey, West Bengal had 555 individuals, while Uttar Pradesh is represented by 500.  This matters when West Bengal, which is less than 2% vegetarian, while Uttar Pradesh is about 47% vegetarian.   I seriously doubt that the dataset adequately represents the population.

The data is one-dimensional data and is binary.  The individual states are either vegetarian or they are not.  There a number of different ways we could represent this, which this graphic did not.

It should be noted, the data visualization likely had good intentions.  The creator of the graphic was probably told to make a map, so they made a map, utilizing the data as best as possible.

Recommendation 1:  The data visualization should account for the fact that different states have different populations which are not currently represented.

Explanation of visualization techniques

Each time, I look at these graphics, I’m actually a little bit more confused.  It starts off straightforward by placing a circle at the approximate center of each of the major Indian states (21 in total).  Then, there is a circle representing the percentage of vegetarianism and one for non-vegetarianism.   This is where it gets weird, whichever circle is over 50%, it begins to engulf the other one.  This seems fairly straightforward when you look at states with an overwhelming majority as seen in Assam (AS) (non-vegetarian)

and Rajasthan (RJ) (vegetarian)

This technique causes some odd and confusing images.  Take the circle presenting Madhya Pradesh (MP)

Almost the entire circle is red, which would make you think this state is overwhelmingly non-vegetarian.  Upon closer inspection, it is slightly in favor of vegetarianism!  Look at these two circles for Madhya Pradesh and Uttar Pradesh (UP):

Despite being only 3.5% different, they look completely different.  What would the circle have looked like if the state was exactly 50-50?  The graphic system breaks down at these important middle values.  In order to get meaning from the circle, the labels are absolutely necessary to get the correct meaning at the microscopic level, and no amount of labeling will prevent it from misleading at the macroscopic level.

Recommendation 2: Use a visualization method that does not have the ambiguity caused by the overlapping circles.

This would also fail Tufte’s data-ink standards because it labels vegetarians and non-vegetarians.  If there are only two possibilities, you only need to display the values of one, because we then know the other.  In fact, because of that fact. Tufte would likely say that this graphic displays twice as much data as it needs to.

Recommendation 3: Display only the pertinent half of the data.

Effectiveness of the visualization

So, despite the flaws stated above, is the visualization effective? Well, sorta.  If you believe that the dataset accurately determined that India is about 70% non-vegetarian (this is questionable, but for now, let’s assume that’s correct), this graphic does appear to achieve its objective.  Visually, at a glance, it does appear to favor  non-vegetarianism by about two-thirds.  So, the macroscopic level seems to reflect the bottom line of the dataset.  However, when you consider the flaws, one cannot suspect this is more luck than anything.

What is not clear, is how does it represent the meat preference of the population centers.  Maps are notoriously known for distorting population preferences because they falsely equate area with population.  This graphic seems to take this a step further by then using outer circles of the same size for each state.  If the maximum size of each circle was relative to their population, we could actually account for these.

It does show that there seems to be a preference for non-vegetarianism the closer you get to the Indian coastline.  This is not impacted by the map area fallacy mentioned above, because we don’t care about the population of the region.

Recommendation 4: When accounting for the population centers (as in Recommendation 1) attempt to retain the geographic preference for non-vegetarianism.

Integrity of the visualization

In terms of the integrity of the visualization, it does a fair job of trying to represent the data from the census report (which we’ve already established as having its own issues).

Since there was not a composite number of vegetarianism by state, the creator averaged the male and female population.  Given a lack of other options, this was not a terrible assumption. While it introduced a little bit of error (as much as 5%), that would only be marginally perceived in a graphic.  While the purest in me hates it, I can live with it.

The biggest bias that it has is the land area  equals population fallacy that many map representations suffer from.  While I know that India is an immensely populated country, I have no feel for how that is distributed across the country, and the graphic does not help that.  Previous recommendations have already stated that relative population needs to be accounted for in any improvement.

Design

The graphic chooses red for non-vegetarians (think meat) and green for vegetarians (for plants).  While those are someone logical choices given the topic, the are terrible for color-blind individuals, and likely will not print very well.

Recommendation 5: Select colors that can be universally understood.

The idea of using just the outline of country is a nice approach.  However, the background color of yellow is too close to the gray used for the outline of India.   Mentally, you have to pause to determine where the country outline is. It distracts from the actual image.

Recommendation 6: If a map is used again, increase the contrast between the outline of India and the background color.

The labeling is a bit inconsistent.  Understandably, this is driven by the fact that some states are much bigger geographically than others.  This leads to labels being located in different locations.   For example, in this snippet, three different styles of labeling the circles can be seen:

In one case, the state initial is in the center, in another, the inner percentage is, and in another, nothing is.  These subtle differences cause the user to expend additional thought to understand what is going.

Recommendation 7:  Labels need to be applied consistently to all data points.

How India Eats – Fixed

After the analysis, we have seven different recommendations on how this graphic can be improved.  To review those are:

  1. The data visualization should account for the fact that different states have different populations which are not currently represented.
  2. Use a visualization method that does not have the ambiguity caused by the overlapping circles.
  3. Display only the pertinent half of the data.
  4. When accounting for the population centers (as in Recommendation 1) attempt to retain the geographic preference for non-vegetarianism.
  5. Select colors that can be universally understood.
  6. If a map is used again, increase the contrast between the outline of India and the background color.
  7. Labels need to be applied consistently to all data points.

The immediate issue is what do we do with the dataset?  We know that the methods for the data collection are not known, and that there could be unknown biases.  However, there is no additional data pertaining to the percentages of vegetarianism.  I know that the errors range between four and seven percent for individual states.  While, this makes me a bit quesy to say and do, we will use the percentages from the graphic as-is.  This does introduce some error to the graphic, which I will attempt to account for.  However, if it cannot be done, we should still be able to pull out overall trends.

Since we did like the map idea as it does show a geographic trend, and showing trends are good things, I sketched a map of India with the circles on top.  Except, this time, the shade of the circles would represent only the vegetarian percentage, while the area represented the population.

Sketch 1 - Circles represent population, and shade represents percent vegetarian.
Sketch 1 – Circles represent population, and shade represents percent vegetarian.

If you look really hard at this, you can see some really faint circles.  That’s because a few areas have very small percentages of vegetarians.  This sketch does achieve the objective of showing that the more vegetarian population is more to the northwest region of India.  I think I can do better.

In the next sketch, I added a gradient, since that would ensure that I can show the range of vegetarianism in the various regions.  For fun, I also attached labels to it:

Sketch 2 – Color gradient, where yellow is non-vegetarian and blue is vegetarian. Labels added for fun.

Sketch 2 I think is getting better, you can see the trend of vegetarianism to the northwest, and non-vegetarianism to the southeast.  Also, I realize that placing the names of the individual states, does not actually add much information, since the map already shows it.  Since this map was presented to an Indian audience (it was Huffington Post India), it’s safe to assume they generally know where their states are located.  So, adding those labels would be a waste of data-ink.  The color gradient did work out nicely, though, with our color-blind friendly coloring.

I got to thinking, I could attempt to put the gradient on the individual states.  However, that would just do everything we hate about map graphics where we commonly equate  area with population.  By looking at the map above, it’s clear that population and areas differ greatly across these many states.

I did think that there was something interesting about the geographic position, which are namely the latitude and the longitude.  The map itself was not that relevant.  Going back to the basic Tufte principles, we should strive for the minimal amount of data-ink.  After doing some complex geo-mapping and overlaying of a gradient, I went back to basics and attempt to graph these scatterplots.

Sketch 3 – Simple Scatterplot of North Latitude Versus Percentage Vegetarian

That’s a beautiful graph!  It’s simple, and we can see that as you go farther north (increasing degrees north latitude) that the population is a general trend to be vegetarian, although there is a wide swath of possibilities there.  There is not a ton of extra ink, just enough to be descriptive.  I did think about putting in a trend-line, but I thought that would be extraneous, as your eye naturally tells you there is a trend.

Now, let’s see what this graph looks like for longitude.

Sketch 4 – Simple Scatter plot Degrees East Longitude versus Percentage Vegetarian

This one is even better! The trend is much more visibly significant.  As you go farther east, the likelihood of vegetarianism goes down.  It’s simple and it tells the data’s story fairly well.

In Sketches 3 and 4, I left off the names of the individual states.  It is not necessarily obvious that this is the correct course of action.  It depends on your goal.  If your goal is to show how vegetarian trends across the country, then you don’t need the state names.  The geography does the work for you.  If you want the user to be able see the states (because everyone wants to be able to identify what the story is in their state), then you might want something a little more like Sketch 5, where when you move the mouse over a particular state, it produces information about the Indian state.

Sketch 5 – Scatterplot of Longitude versus Vegetarianism with Mouseover InfoBy taking advantage of interactivity, we do not have to pollute our graph with extra data-ink, and we are able to provide the information, if desired, to the reader.  We wouldn’t want that much information on every circle, especially since a few of the circles are close to each other.  This allows us to keep our graphic clean, while providing the ability to go indepth.  This was something that Tufte did not have the advantage of in his earlier writing.

Another thought — should this be a 3-d graphic? From all of these graphs, we know that the trends vary as you north-south and as you go east-west.  A three-dimensional graph could be interesting to show this trend; however, we know many three-dimensional graphs fail at this miserably.  I would prefer to stick with these two separate graphs.

Conclusion

So, let’s see how I did against the seven recommendations for improvement:

  1. The data visualization should account for the fact that different states have different populations which are not currently represented.
    • Achieved!  We used different areas to represent population.
  2. Use a visualization method that does not have the ambiguity caused by the overlapping circles.
    • Achieved! We choose to only represent the percentage of vegetarians, because India is mainly known for being a vegetarian country.  The percentage non-vegetarian is implied, as this was a binary choice.
  3. Display only the pertinent half of the data.
    • Achieved! We used the vegetarian half.  We just as well could have used the non-vegetarian half, and would have had the same results.  I should also note that I ultimately made the same assumption/compromise the original creator did and average the male and female populations out, because while it introduced a little bit of error into our graph, it’s not significant enough to matter, since the male and female values are fairly close.
  4. When accounting for the population centers (as in Recommendation 1) attempt to retain the geographic preference for non-vegetarianism.
    • Achieved! We used latitude and longitude on our two scatterplot x-axes.  Understanding the relationship between geography and vegetarianism is retained.
  5. Select colors that can be universally understood.
    • Achieved! By going to the scatterplot, we only needed one color.  So, whether these charts were printed in black or white, or viewed by a color-blind individual, it should be able to be picked up in the blue that was utilized.
  6. If a map is used again, increase the contrast between the outline of India and the background color.
    • Achieved! Our final graphs did not use the map at all.  In Sketches 1 and 2, when we did use the map, we went with a white background, and simple grey outline for the map on India, allowing the focus to be on the circles.  I would highly recommend doing that again for future maps.
  7. Labels need to be applied consistently to all data points.
    • Achieved! The only labels I went with are on the axes and title.

My big take away from this is that while making big fancy graphics is tempting and a lot of fun, the graphics that might tell the best story are the simplest ones.

For future improvement, while this graphic is nice, clean, and simple, it lacks a Wow! factor.  This graph is not likely to be shared, which is one of the goals of online content.  Right now, the goal of this project is to make better visualizations.  I’ll focus on making better visualizations that are also Wow!-worthy in the future.

If you have any comments or questions, please feel free to place them below, I’d love to discuss this more with you.  If you have a suggestion for another case study, please let me know!

Thanks!
Derrick

References:

  1. http://www.huffingtonpost.in/2016/06/14/how-india-eats_n_10434374.html
  2. http://www.censusindia.gov.in/vital_statistics/BASELINE%20TABLES08082016.pdf
  3. https://en.wikipedia.org/wiki/List_of_states_and_union_territories_of_India_by_population