The Chicago Cubs took home the World Series title last night by knocking off the Cleveland Indians in dramatic fashion—ending a 108-year drought and the “Curse of the Billy Goat” in the process. As a lifelong baseball fan I’ve been following the series closely, but as someone who works in the data analytics space I couldn’t help but wonder if I could use data to tell the story of the series.
The Chicago Cubs took home the World Series title last night by knocking off the Cleveland Indians in dramatic fashion—ending a 108-year drought and the “Curse of the Billy Goat” in the process. As a lifelong baseball fan I’ve been following the series closely, but as someone who works in the data analytics space I couldn’t help but wonder if I could use data to tell the story of this series.
Long story short, I came into the office this morning and built the dashboard below that looks at win probability statistics to measure the ebb and flow of the series. You can check it out here, or keep reading below to learn more about how I put it together.
I started digging around looking for data on the series, and came across an interesting dataset on Fangraphs (a great site for all things baseball). The dataset is a play log that captures information on every play in every game (here’s the game log for Game 7), and has some pretty amazing information in it including written descriptions of every play and some advanced win probability statistics like Win Expectancy (WE) and Win Probability Added (WPA).
Information for each game was in a different table, so I manually copied and pasted the information into Excel and then brought it into MicroStrategy Desktop to analyze. I had an idea of what I wanted to do, but quickly realized that there was no order to the plays within a given game, i.e. ‘play 1’ versus ‘play 2’, so there would be no way for me to visualize the data in chronological order. So I decided to go back to Excel and quickly enrich the dataset with two additional columns (Game and Play).
Now I was able to create a line graph that broke all this data down by game and included all the plays along the horizontal axis in the order that they occurred—letting me take a birds-eye view of WE over time for each of the games, and really getting a good idea of the ebb and flow of the series.
But that got me thinking, could I make a visualization that shows the biggest plays in terms of impact on a team winning? At first it seemed like this would be tricky because the values for WPA are both positive and negative, but then I realized that I could make a derived metric to show the absolute value of the WPA metric.
I then created a second derived metric to rank these values and used it to add thresholds to the visualization I created—allowing us to quickly see the plays that caused the top 10 biggest shifts in Win Expectancy throughout the World Series.
If you’re interested in more of this type of data-driven sports analysis I urge you to check out FiveThirtyEight and Fangraphs. If you're looking for raw data to analyze, Sports-Reference.com is a great resource for Baseball and other sports like Football, Basketball, Hockey, and Soccer.
Want to quickly create your own dashboards and visualizations? Download MicroStrategy Desktop today to get started!