Predictive Modeling: The Only Guide You'll Need


Predictive modeling is useful because it gives accurate insight into any question and allows users to create forecasts. To maintain a competitive advantage, it is critical to have insight into future events and outcomes that challenge key assumptions.

Analytics professionals often use data from the following sources to feed predictive models:

  • Transaction data
  • CRM data
  • Customer service data
  • Survey or polling data
  • Digital marketing and advertising data
  • Economic data
  • Demographic data
  • Machine-generated data (for example, telemetric data or data from sensors)
  • Geographical data
  • Web traffic data

Analytics leaders must align predictive modeling initiatives with an organization’s strategic goals. For example, a computer chip manufacturer might set a strategic priority to produce chips with the greatest number of transistors in the industry by 2025. Analytics professionals could construct a predictive model to forecast the number of transistors per chip to become a leader if they feed the model product, geography, sales, and other related trend data. Additional sources could include data about the most transistor-dense chips, commercial demand for computing power, and strategic partnerships between chip manufacturers and hardware manufacturers. Once initiatives are in motion, analytics professionals can perform backward-looking analyses to assess the accuracy of predictive models and the success of the initiatives.

Analysts must organize data to align with a model so computers can create forecasts and outputs for hypothesis tests. BI tools provide insights in the form of dashboards, visualizations, and reports. A process should be put in place to ensure continued improvement. Important things to consider when integrating predictive models into business practices include:

  • Benchmark analysis
  • Data-gathering
  • Data-cleansing
  • Analysis
  • Evaluating goals and KPIs
  • Creating action plans based on analysis
  • Executing on plans
  • Streamlining processes

Predictive Modeling and Data Analytics

Of the four types of data analytics, predictive modeling is most closely related to the predictive analytics category. The fur types of data analytics are:

Descriptive Analytics

Descriptive analytics describes the data. For example, a software-as-a-service (SaaS) company sold 2,000 licenses in Q2 and 1,000 licenses in Q1. Descriptive analytics answers the question of how many licenses were sold in Q1 vs. Q2.

Diagnostic Analytics

Diagnostic analytics is the why behind descriptive analytics. To use the previous example, diagnostic analytics takes data a step further. A data analyst can drill down into quarterly software license sales and determine sales and marketing efforts within each region to reference them against sales growth. They could also see if a sales increase was a result of high-performing salespeople or rising interest within a certain industry.

Predictive Analytics

Predictive analytics utilizes techniques such as machine learning and data mining to predict what might happen next. It can never predict the future, but it can look at existing data and determine a likely outcome. Data analysts can build predictive models once they have enough data to make predicted outcomes. Predictive analytics differs from data mining because the latter focuses on discovery of the hidden relationships between variables, whereas the former applies a model to determine likely outcomes. A SaaS company could model historical sales data against marketing expenditures across each region to create a prediction model for future revenue based on marketing spend.

Prescriptive Analytics

Prescriptive analytics takes the final step and offers a recommendation based on a predicted outcome. Once a predictive model is in place, it can recommend actions based on historical data, external data sources, and machine learning algorithms.


What are the types of predictive models?

Broadly speaking, predictive models fall into two camps: parametric and non-parametric. Although these terms might seem like technical jargon, the essential difference is that parametric models make more assumptions and more specific assumptions about the characteristics of the population used in creating the model. Specifically, some of the different types of predictive models are:

  • Ordinary Least Squares
  • Generalized Linear Models (GLM)
  • Logistic Regression
  • Random Forests
  • Decision Trees
  • Neural Networks
  • Multivariate Adaptive Regression Splines (MARS)

Each of these types has a particular use and answers a specific question or uses a certain type of dataset. Despite the methodological and mathematical differences among the model types, the overall goal of each is similar: to predict future or unknown outcomes based on data about past outcomes.

What are the Benefits of Predictive Modeling?

At its core, predictive modeling significantly reduces the cost required for companies to forecast business outcomes, environmental factors, competitive intelligence, and market conditions. Here are a few of the ways that the use of predictive modeling can provide value:

  • Demand forecasting
  • Workforce planning and churn analysis
  • Forecasting of external factors
  • Analysis of competitors
  • Fleet or equipment maintenance
  • Modeling credit or other financial risks

What are the Biggest Challenges of Predictive Modeling?

Predictive models and technologies promise huge benefits, but that doesn’t mean these benefits come seamlessly. In fact, predictive modeling presents a number of challenges in practice. These challenges include:

  • Sufficiently large and comprehensive datasets
  • Adaptability of models to new problems
  • Data organization and hygiene
  • Data privacy and security

The Future of Predictive Modeling

The future of predictive modeling is, undoubtedly, closely tied to artificial intelligence. As computing power continues to increase, data collection rises exponentially, and new technologies and methods are born, computers will bear the brunt of the load when it comes to creating models. The global management consulting firm McKinsey and Co. recently studied future trends, some of which are detailed below.

Technological Advancements

Partially due to recent advancements in computing power and data quantities, predictive modeling technologies have improved the impact of regular newsworthy breakthroughs. Predictive algorithms are becoming extremely sophisticated in many fields, notably computer vision, complex games, and natural language.

Changes in Work

With more intelligent computers, the work of predictive modeling professionals, much like with other occupations, will change to adapt to newly available predictive technology. People who work in predictive modeling will not likely become obsolete, but their roles will shift in a way that complements new predictive technological features and abilities, and they will need to acquire new skills to excel in these new roles.

Risk Mitigation

Advances in predictive technology are extremely promising in terms of commercial and scientific value creation, but they do require risk mitigation as well. Some of these risks center on data privacy and security. With exponential increases in data volume, the importance of protecting data from hackers and mitigating other privacy concerns increase as well. Additionally, researchers point out the risk of hard wiring overt and unconscious societal biases into predictive models and algorithms, an issue that will be of great importance to policymakers and big technology companies.

The Limitations of Predictive Modeling

Despite its numerous high-value benefits, predictive modeling certainly has its limitations. Unless certain conditions are met, predictive modeling may not provide the entirety of its potential value. In fact, if these conditions are not met, predictive models may not provide any value over legacy methods or conventional wisdom. It is important to consider these limitations to capture the maximum amount of value from predictive modeling initiatives. According to McKinsey and Co., which recently analyzed use cases, value creation, and limitations, here are some of the challenges:

Data Labeling

Especially in Machine Learning, in which a computer is constructing the predictive model, data must be labeled and categorized appropriately. This process can be imprecise, full of errors, and a generally colossal undertaking. However, it is a necessary component of constructing a model, and, if proper classification and labeling cannot be completed, any predictive model produced will suffer from poor performance and issues associated with improper categorization.

Obtaining Massive Training Datasets

In order for statistical methods to be consistently successful at predicting outcomes, a basic tenet needs to be met: sufficient sample size. If a predictive modeling professional doesn’t have sufficient amounts of data to construct the model, the model produced will be unduly influenced by noise in the data that is used. Of course, relatively small datasets tend to exhibit more variation or, in other words, more noise. Currently, the number of records required to reach sufficiently high model performance ranges from the thousands to the millions. In addition to size, the data used must be representative of the target population. If the sample size is large enough, the data should have a wide variety of records, including unique or odd cases, to refine the model.

The Explainability Problem

As more complex and esoteric models and methodologies become available, it will often be a great challenge to untangle models to determine why a certain decision or prediction was made. As models intake more data records or more variables, factors that could explain predictions become murky, a significant limitation in some fields. In industries or use cases that require explainability, such as environments that have significant legal or regulatory consequences, the need to document processes and decisions can hinder the use of complex models. This limitation will likely drive demand for new methodologies that can handle huge data volumes and complexities while also remaining transparent in decision making.

Generalizability of Learning

Generalizability refers to the ability of the model to be generalized from one use case to another. Unlike humans, models tend to struggle with generalizability, also known as external validity. In general, when a model is constructed for a particular case, it should not be used for a different case. Although methods like transfer learning, an approach that attempts to remedy this very issue, are in development, generalizability remains a significant limitation of predictive modeling.

Bias in Data and Algorithms

Though it’s more of an ethical or philosophical issue than a technical one, some argue that researchers and professionals creating predictive models must be careful when choosing which data to use and which to exclude. Because historical biases can be engrained at the lowest level of data, great care must be taken when attempting to address these biases, or their repercussions could be perpetuated into the future by predictive models.

Predictive Modeling Tools

Apache Hadoop

Recognized in the technology industry for its distinctive yellow elephant logo, Apache Hadoop, commonly referred to as Hadoop, is a collection of open source software utilities that are designed to help a network of computers work together on tasks that involve massive quantities of data. Hadoop mainly functions as a storage and processing utility. The processing utility is a MapReduce programming model. Hadoop can also refer to a number of additional software packages in the Apache Hadoop ecosystem. These packages include:

  • Apache Pig
  • Apache Hive
  • Apache Phoenix
  • Apache HBase
  • Apache Spark
  • Apache Zookeeper
  • Cloudera Impala
  • Apache Flume
  • Apache Sqoop
  • Apache Oozie
  • Apache Storm

Hadoop has become extremely useful and important in the field of predictive modeling, especially for models or problems that require big data storage. Predictive modeling professionals with skills or expertise in the Hadoop ecosystem, especially MapReduce and packages like Apache Hive, can find a salary premium for those skills.


R is an open-source programming language for statistical computing and graphics. Analysts will require technical skills to work efficiently with this tool. It includes capabilities such as linear regression, non-liner modeling, and time-series tests. Use cases include:

  • Employee churn analysis (e.g. how does age affect churn?)
  • Data cleansing
  • Data organization
  • Predictive analysis (e.g. Is an employee that has a history of moving to new jobs likely to repeat that?)
  • Correlation studies


Python is a high-level programming language made for general programming. While R was built specifically for statistics, Python exceeds R when it comes to data mining, imaging, and data flow capabilities. It’s more versatile than R and more commonly used with other programs. Python is generally easier to learn than R, and is best used for task automation.


MicroStrategy is an enterprise analytics and mobility platform which includes R, Python, and Google Analytics integration. It has 60+ data source connectors, so analysts can gain insights by blending disparate data. This data can be output into data visualizations and dashboard reports to gain insights quickly, and can be easily shared throughout the organization. MicroStrategy also includes advanced analytics capabilities, including predictive analytics, with over 300 native analytics functions and open source and 3rd party statistical programs. Some examples include:

Careers in Predictive Modeling

Predictive modeling is a field poised for high growth in the coming years due to the explosion of data, technological advances, and proven value add capability. In fact, in 2017, IBM forecasted that demand for data science and analytics professionals would grow by 15% by the year 2020.

While many companies know they need to apply predictive modeling to their businesses, there is currently a shortage of candidates with the appropriate skillsets. Because of this, businesses have offered substantial salaries to qualified applicants in order to lure them away from competitors or other jobs. While the number of qualified candidates is increasing, the demand for such professionals is growing at a significant rate.


Some common job titles include:

  • Data Scientist
  • Statistician
  • Predictive Analytics Analyst
  • Senior Predictive Analytics Analyst
  • Predictive Modeler
  • Analytics Developer
  • Data Analyst
  • Business Analyst
  • Predictive Analytics Consultant
  • Business Intelligence Manager
  • Forecasting Analyst
  • Advanced Analytics Manager


  • Machine learning
  • Python programming
  • R programming
  • SQL Programming
  • Stata programming
  • Matlab programming
  • Hadoop
  • Communication skills

How Much Do Predictive Modeling Professionals Make?

Salaries vary depending on a candidate’s background and the company’s need, but data science skills translate into higher salaries. Some of the skills that pull higher salaries are MapReduce, Apache Hive, and Apache Hadoop.