According to the World Economic Forum, the world produces 2.5 quintillion bytes of data every day, and 90% of all data has been created in the last two years. With so much data, it’s become increasingly difficult to manage and make sense of it all. It would be impossible for any single person to wade through data line-by-line and see distinct patterns and make observations. Data proliferation can be managed as part of the data science process, which includes data visualization.
Data visualization can provide insight that traditional descriptive statistics cannot. A perfect example of this is Anscombe’s Quartet, created by Francis Anscombe in 1973. The illustration includes four different datasets with almost identical variance, mean, correlation between X and Y coordinates, and linear regression lines. However, the patterns are clearly different when plotted on a graph. Below, you can see a linear regression model would apply to graphs one and three, but a polynomial regression model would be ideal for graph two. This illustration highlights why it’s important to visualize data and not just rely on descriptive statistics.
Faster Decision Making
Companies who can gather and quickly act on their data will be more competitive in the marketplace because they can make informed decisions sooner than the competition. Speed is key, and data visualization aides in the understanding of vast quantities of data by applying visual representations to the data. This visualization layer typically sits on top of a data warehouse or data lake and allows users to discover and explore data in a self-service manner. Not only does this spur creativity, but it reduces the need for IT to allocate resources to continually build new models.
For example, say a marketing analyst who works across 20 different ad platforms and internal systems needs to quickly understand the effectiveness of marketing campaigns. A manual way to do this would be to go to each system, pull a report, combine the data, and then analyze in Excel. The analyst will then need to look at a swarm of metrics and attributes and will have difficulty drawing conclusions. However, modern business intelligence (BI) platforms will automatically connect the data sources and layer on data visualizations so the analyst can slice and dice the data with ease and quickly come to conclusions about marketing performance.
Let’s say you’re a retailer and you want to compare sales of jackets to sales of socks over the course of the previous year. There’s more than one way to present the data, and tables are one of the most common. Here’s what this would look like:
The table above does an excellent job showing precise if this information is needed. However, it’s difficult to instantaneously see trends and the story the data tells.
Now here’s the data in a line graph visualization:
From the visualization, it becomes immediately obvious that sales of socks remain constant, with small spikes in December and June. On the other hand, sales of jackets are more seasonal, and reach their low point in July. They then rise and peak in December before decreasing monthly until right before fall. You could get this same story from looking at the chart, but it would take you much longer. Imagine trying to make sense of a table with thousands of data points.
To understand the science behind data visualization, we must first discuss how humans gather and process information. In collaboration with Amos Tversky, Daniel Kahn did extensive research on how we form thoughts, and concluded that we use one of two methods:
Describes thought-processing that is fast, automatic, and unconscious. We use this method quite frequently in our everyday lives and can accomplish the following:
Describes a slow, logical, infrequent, and calculating thought and includes:
With these two systems of thinking defined, Kahn explains why humans struggle to think in terms of statistics. He asserts that System I thinking is based on heuristics and biases to handle the volume of stimuli we encounter daily. An example of heuristics at work is a judge who sees a case only in terms of historical cases, despite nuances and differences unique to the new case. Further, he defined the following biases:
A tendency to be swayed by irrelevant numbers. For example, this bias is manipulated by skill negotiators who offer a lower price (the anchor) than they expect to get and then come in slightly higher above the anchor.
The frequency at which events occur in our mind are not accurate reflections of the actual probabilities. This is a mental shortcut – to assume that events that can be remembered are more likely to occur.
This refers to our tendency to substitute difficult questions with simpler ones. This bias is also famously called the conjunction fallacy or “Linda Problem.” This example askes the question:
Linda is 31 years old, single, outspoken, and very bright. She majored in philosophy. As a student, she was deeply concerned with issues of discrimination and social justice, and also participated in anti-nuclear demonstrations.
Which is more probable?
1) Linda is a bank teller
2) Linda is a bank teller and is active in the feminist movement
Most participants in the study chose option two, even though this violates the law of probability. In their minds, option two was more representative of Linda, so they used the substitution principle to answer the question.
Optimism and loss aversion
Kahn believed that this may be the most significant bias we have. Optimism and loss aversion give us the illusion of control because we tend to deal only with the possibility of known outcomes that have been observed. We often don’t consider known unknowns or completely unforeseen outcomes. Our neglect of this complexity explains why we use a small sample size to make strong assumptions about future outcomes.
Framing refers to the context in which choices are presented. For example, more subjects were inclined to opt for a surgery if it was framed by a 90% survival rate as opposed to a 10% mortality rate.
This bias is often seen in the investing world when people continue to invest in an under-performing asset with poor prospects instead of getting out of the investment and into an asset with a more favorable outlook.
With Systems I and II, along with biases and heuristics, in mind, we should seek to ensure that data is presented in a way that correctly communicates to our System I thought process. This allows our System II thought process to analyze data accurately. Our unconscious System I has the ability to process about 11 million pieces of information/second vs. our conscious, which can process only 40 pieces of information/second.
We must also look at how each system utilizes our senses to take in information. According to Tor Norretanders' "The User Illusion", the visual sense processes the most information in both systems:
Since our sub-conscious system processes more information through vision, data visualization is a perfect solution to communicate patterns and insights from data sets. When someone sees a visualization of data, it will take less than 500 milliseconds for the eye and the brain to process what are called pre-attentive visual properties of an image. According to Colin Ware’s Information Visualization: Perception for Design, he defines four pre-attentive visual properties:
These four components make up the composition of each data visualization and should be carefully considered for presentation.
These are one of the most basic and commonly used visualizations. They show a change in one or more variables over time.
When to use: You need to show how a variable changes over time.
A variation of line charts, area charts display multiple values in a time series.
When to use: You need to show cumulative changes in multiple variables over time.
These charts are like line charts, but they use bars to represent each data point.
When to use: Bar charts are best used when you need to compare multiple variables in a single timeframe or a single variable in a time series.
Population pyramids are stacked bar graphs that depict the complex social narrative of a population.
When to use: You need to show the distribution of a population.
These show the parts of a whole in the form of a pie.
When to use: You want to see parts of a whole on a percentage basis. However, many experts recommend using other formats instead because it’s more difficult for the human eye to make sense of the data in this format because due to increased processing time. Many argue that a bar chart or line graph make more sense.
Tree maps are a way to display hierarchal data in a nested format. The size of the rectangles are proportional to each category’s percentage of the whole.
When to use: These are most useful when you want to compare parts of a whole and have many categories.
These compare an expected value vs. the actual value for a given variable.
When to use: You need to compare expected and actual values for a single variable. The above example shows the number of items sold per category vs. the expected number. You can easily see sweaters underperformed expectations above all other categories, but dresses and shorts overperformed.
Scatter plots show the correlation between two variables in the form of an X and Y axis and dots that represent data points.
When to use: You want to see the correlation between two variables.
Histograms plot the number of times an event occurs within a given data set and presents in a bar graph format.
When to use: You want to find the frequency distribution of a given dataset. For example, you wish to see the relative likelihood of selling 300 items in a day given historical performance.
These are non-parametric visualizations that display a measure of dispersion. The box represents the second and third quartile (50%) of data points and the line within the box represents the median. The two lines extending outside the box are called whiskers and represent the first and fourth quartile, along with the minimum and maximum value.
When to use: You want to see the distribution of one or more datasets. These are used instead of histograms when space needs to be minimized.
Bubble charts are like scatter plots but add more functionality because the size and/or color of each bubble represents additional data.
When to use: When you have three variables to compare.
A heat map is a graphical representation of data in which each individual value is contained within a matrix. The shades represent a quantity as defined by the legend.
When to use: These are useful when you want to analyze a variable across a matrix of data, such as a timeframe of days and hours. The different shades allow you to quickly discern the extremes. The above example shows users of a website by hour and time of day during a week.
Choropleth visualizations are a variation of heat maps where the shading is applied to a geographic map.
When to use: You need to compare a dataset by geographic region.
The Sankey diagram is a type of flow diagram in which the width of the arrows is displayed proportionally to the quantity of the flow.
When to use: You need to visualize the flow of a quantity. The example above is a famous example of Napoleon’s army as it invaded Russia during a cold winter. The army begins as a large mass but dwindles as it moves towards Moscow and retreats.
These display complex relationships between entities. It shows how each entity is connected to the others to form a network.
When to use: You need to compare the relationships within a network. These are especially useful for large networks. The above shows the network of flight paths for Southwest airlines.
Data visualization is used in many disciplines and impacts how we see the world daily. It’s increasingly important to be able to react and make decisions quickly in both business and public services. We compiled a few examples of how data visualization is commonly used below.
According to research by the media agency Magna, half of all global advertising dollars will be spent online by 2020. Because of this, marketers need to stay on top of how their web properties are creating revenue along with their sources of web traffic. Visualizations can be used to easily see how traffic has trended over time as a result of marketing efforts.
Finance professionals need to track the performance of their investment choices to make decisions to buy or sell a given asset. Candlestick visualization charts show how the price has changed over time, and the finance professional can use it to spot trends. The top of each candlestick represents the highest price within a period of time and the bottom represents the lowest. In the example, the green candlesticks show when the price went up and the red shows when it went down. The visualization can communicate the change in price more easily than a grid of data points.
The most recognized visualization in politics is a geographic map which shows the party each district or state voted for.
Shipping companies use visualization software to understand global shipping routes.
Healthcare professionals use choropleth visualizations to see important health data. The below shows the mortality rate of heart disease by county in the U.S.