Retailers use data mining to better understand their customers. Data mining allows them to better segment market groups and tailor promotions to effectively drill down and offer customized promotions to different consumers.
Banks deploy data mining models to predict a borrower’s ability to take on and repay debt. Using a variety of demographic and personal information, these models automatically select an interest rate based on the level of risk assigned to the client. Applicants with better credit scores generally receive lower interest rates since the model uses this score as a factor in its assessment.
Financial institutions implement data mining models to automatically detect and stop fraudulent transactions. This form of computer forensics happens behind the scenes with each transaction and sometimes without the consumer knowing about it. By tracking spending habits, these models will flag aberrant transactions and instantly withhold payments until customers verify the purchase. Data mining algorithms can work autonomously to protect consumers from fraudulent transactions through an email or text notification to confirm a purchase.
Healthcare professionals use statistical models to predict a patient’s likelihood for different health conditions based on risk factors. Demographic, family, and genetic data can be modeled to help patients make changes to prevent or mediate the onset of negative health conditions. These models were recently deployed in developing countries to help diagnose and prioritize patients before doctors arrived on-site to administer treatment.
Data mining is also used to combat an influx of email spam and malware. Systems can analyze the common characteristics of millions of malicious messages to inform the development of security software. Beyond detection, this specialized software can go a step further and remove these messages before they even reach the user’s inbox.
Recommendation systems are now widely used among online retailers. Predictive consumer behavior modeling is now a core focus of many organizations and viewed as essential to compete. Companies like Amazon and Macy’s built their own proprietary data mining models to forecast demand and enhance the customer experience across all touchpoints. Netflix famously offered a one-million-dollar prize for an algorithm that would significantly increase the accuracy of their recommendation system. The winning model improved recommendation accuracy by over 8%.
Sentiment analysis from social media data is a common application of data mining that utilizes a technique called text mining. This is a method used to gain an understanding of how an aggregate group of people feel towards a topic. Text mining involves using an input from social media channels or another form of public content to gain key insights as a result of statistical pattern recognition. Taken a step further, natural language processing (NLP) techniques can be used to find the contextual meaning behind the human language used.
Qualitative research can be structured and then analyzed using text mining techniques to make sense of large sets of unstructured data. An in-depth look at how this has been used to study child welfare was published by researchers at Berkley.
The first step is establishing the goals of the project are and how data mining can help you reach that goal. A plan should be developed at this stage to include timelines, actions, and role assignments.
Data is collected from all applicable data sources in this step. Data visualization tools are often used in this stage to explore the properties of the data to ensure it will help achieve the business goals.
Data is then cleansed, and missing data is included to ensure it is ready to be mined. Data processing can take enormous amounts of time depending on the amount of data analyzed and the number of data sources. Therefore, distributed systems are used in modern database management systems (DBMS) to improve the speed of the data mining process rather than burden a single system. They’re also more secure than having all an organization’s data in a single data warehouse. It’s important to include failsafe measures in the data manipulation stage so data is not permanently lost.
Mathematical models are then used to find patterns in the data using sophisticated data tools.
The findings are evaluated and compared to business objectives to determine if they should be deployed across the organization.
In the final stage, the data mining findings are shared across everyday business operations. An enterprise business intelligence platform can be used to provide a single source of the truth for self-service data discovery.
Data Mining allows organizations to continually analyze data and automate both routine and critical decisions without the delay of human judgment. Banks can instantly detect fraudulent transactions, request verification, and even secure personal information to protect customers against identity theft. Deployed within a firm’s operational algorithms, these models can collect, analyze, and act on data independently to streamline decision making and enhance the daily processes of an organization.
Accurate Prediction and Forecasting
Planning is a critical process within every organization. Data mining facilitates planning and provides managers with reliable forecasts based on past trends and current conditions. Macy’s implements demand forecasting models to predict the demand for each clothing category at each store and route the appropriate inventory to efficiently meet the market’s needs.
Data mining allows for more efficient use and allocation of resources. Organizations can plan and make automated decisions with accurate forecasts that will result in maximum cost reduction. Delta imbedded RFID chips in passengers checked baggage and deployed data mining models to identify holes in their process and reduce the number of bags mishandled. This process improvement increases passenger satisfaction and decreases the cost of searching for and re-routing lost baggage.
Firms deploy data mining models from customer data to uncover key characteristics and differences among their customers. Data mining can be used to create personas and personalize each touchpoint to improve overall customer experience. In 2017, Disney invested over one billion dollars to create and implement “Magic Bands.” These bands have a symbiotic relationship with consumers, working to increase their overall experience at the resort while simultaneously collecting data on their activities for Disney to analyze to further enhance their customer experience.
The challenges of big data are prolific and penetrate every field that collects, stores, and analyzes data. Big data is characterized by four major challenges: volume, variety, veracity, and velocity. The goal of data mining is to mediate these challenges and unlock the data’s value.
Volume describes the challenge of storing and processing the enormous quantity of data collected by organizations. This enormous amount of data presents two major challenges: first, it is more difficult to find the correct data, and second, it slows down the processing speed of data mining tools.
Variety encompasses the many different types of data collected and stored. Data mining tools must be equipped to simultaneously process a wide array of data formats. Failing to focus an analysis on both structured and unstructured data inhibits the value added by data mining.
Velocity details the increasing speed at which new data is created, collected, and stored. While volume refers to increasing storage requirement and variety refers to the increasing types of data, velocity is the challenge associated with the rapidly increasing rate of data generation.
Finally, veracity acknowledges that not all data is equally accurate. Data can be messy, incomplete, improperly collected, and even biased. With anything, the quicker data is collected, the more errors will manifest within the data. The challenge of veracity is to balance the quantity of data with its quality.
Over-fitting occurs when a model explains the natural errors within the sample instead of the underlying trends of the population. Over-fitted models are often overly complex and utilize an excess of independent variables to generate a prediction. Therefore, the risk of over-fitting is heighted by the increase in volume and variety of data. Too few variables make the model irrelevant, where as too many variables restrict the model to the known sample data. The challenge is to moderate the number of variables used in data mining models and balance its predictive power with accuracy.
Cost of Scale
As data velocity continues to increase data’s volume and variety, firms must scale these models and apply them across the entire organization. Unlocking the full benefits of data mining with these models requires significant investment in computing infrastructure and processing power. To reach scale, organizations must purchase and maintain powerful computers, servers, and software designed to handle the firm’s large quantity and variety of data.
Privacy and Security
The increased storage requirement of data has forced many firms to turn toward cloud computing and storage. While the cloud has empowered many modern advances in data mining, the nature of the service creates significant privacy and security threats. Organizations must protect their data from malicious figures to maintain the trust of their partners and customers.
With data privacy comes the need for organizations to develop internal rules and constraints on the use and implementation of a customer’s data. Data mining is a powerful tool that provides businesses with compelling insights into their consumers. However, at what point do these insights infringe on an individual’s privacy? Organizations must weigh this relationship with their customers, develop policies to benefit consumers, and communicate these policies to the consumers to maintain a trustworthy relationship.
The goal of supervised learning is prediction or classification. The easiest way to conceptualize this process is to look for a single output variable. A process is considered supervised learning if the goal of the model is to predict the value of an observation. One example is spam filters, which use supervised learning to classify incoming emails as unwanted content and automatically remove these messages from your inbox.
Common analytical models used in supervised data mining approaches are:
Linear regressions predict the value of a continuous variable using one or more independent inputs. Realtors use linear regressions to predict the value of a house based on square footage, bed-to-bath ratio, year built, and zip code.
Logistic regressions predict the probability of a categorical variable using one or more independent inputs. Banks use logistic regressions to predict the probability that a loan applicant will default based on credit score, household income, age, and other personal factors.
Time series models are forecasting tools which use time as the primary independent variable. Retailers, such as Macy’s, deploy time series models to predict the demand for products as a function of time and use the forecast to accurately plan and stock stores with the required level of inventory.
Classification or Regression Trees
Classification Trees are a predictive modeling technique that can be used to predict the value of both categorical and continuous target variables. Based on the data, the model will create sets of binary rules to split and group the highest proportion of similar target variables together. Following those rules, the group that a new observation falls into will become its predicted value.
A neural network is an analytical model inspired by the structure of the brain, its neurons, and their connections. These models were originally created in 1940s but have just recently gained popularity with statisticians and data scientists. Neural networks use inputs and, based on their magnitude, will “fire” or “not fire” its node based on its threshold requirement. This signal, or lack thereof, is then combined with the other “fired” signals in the hidden layers of the network, where the process repeats itself until an output is created. Since one of the benefits of neural networks is a near-instant output, self-driving cars are deploying these models to accurately and efficiently process data to autonomously make critical decisions.
The K-nearest neighbor method is used to categorize a new observation based on past observations. Unlike the previous methods, k-nearest neighbor is data-driven, not model-driven. This method makes no underlying assumptions about the data nor does it employ complex processes to interpret its inputs. The basic idea of the k-nearest neighbor model is that it classifies new observations by identifying its closest K neighbors and assigning it the majority’s value. Many recommender systems nest this method to identify and classify similar content which will later be pulled by the greater algorithm.
Unsupervised tasks focus on understanding and describing data to reveal underlying patterns within it. Recommendation systems employ unsupervised learning to track user patterns and provide them with personalized recommendations to enhance their customer experience.
Common analytical models used in unsupervised data mining approaches are:
Clustering models group similar data together. They are best employed with complex data sets describing a single entity. One example is lookalike modeling, to group similarities between segments, identify clusters, and target new groups who look like an existing group.
Association analysis is also known as market basket analysis and is used to identify items that frequently occur together. Supermarkets commonly use this tool to identify paired products and spread them out in the store to encourage customers to pass by more merchandise and increase their purchases.
Principal Component Analysis
Principal component analysis is used to illustrate hidden correlations between input variables and create new variables, called principal components, which capture the same information contained in the original data, but with less variables. By reducing the number of variables used to convey the same level information, analysts can increase the utility and accuracy of supervised data mining models.
Supervised and Unsupervised Approaches in Practice
While you can use each approach independently, it is quite common to use both during an analysis. Each approach has unique advantages and combine to increase the robustness, stability, and overall utility of data mining models. Supervised models can benefit from nesting variables derived from unsupervised methods. For example, a cluster variable within a regression model allows analysts to eliminate redundant variables from the model and improve its accuracy. Because unsupervised approaches reveal the underlying relationships within data, analysts should use the insights from unsupervised learning to springboard their supervised analysis.
Similar to the way that SQL evolved to become the preeminent language for databases, users are beginning to demand a standardization among data mining. This push allows users to conveniently interact with many different mining platforms while only learning one standard language. While developers are hesitant to make this change, as more users continue to support it, we can expect a standard language to be developed within the next few years.
With its proven success in the business world, data mining is being implemented in scientific and academic research. Psychologists now use association analysis to track and identify broader patterns in human behavior to support their research. Economists similarly employ forecasting algorithms to predict future market changes based on present-day variables.
Complex Data Objects
As data mining expands to influence other departments and fields, new methods are being developed to analyze increasingly varied and complex data. Google experimented with a visual search tool, whereby users can conduct a search using a picture as input in place of text. Data mining tools can no longer just accommodate text and numbers, they must have the capacity to process and analyze a variety of complex data types.
Increased Computing Speed
As data size, complexity, and variety increase, data mining tools require faster computers and more efficient methods of analyzing data. Each new observation adds an extra computation cycle to an analysis. As the quantity of data increases exponentially, so do the number of cycles needed to process the data. Statistical techniques, such as clustering, were built to efficiently handle a few thousand observations with a dozen variables. However, with organizations collecting millions of new observations with hundreds of variables, the calculations can become too complex for many computers to handle. As the size of data continues to grow, faster computers and more efficient methods are needed to match the required computing power for analysis.
With the expansion of the internet, uncovering patterns and trends in usage is a great value to organizations. Web mining uses the same techniques as data mining and applies them directly on the internet. The three major types of web mining are content mining, structure mining, and usage mining. Online retailers, such as Amazon, use web mining to understand how customers navigate their webpage. These insights allow Amazon to restructure their platform to improve customer experience and increase purchases.
The proliferation of web content was the catalyst for the World Wide Web Consortium (W3C) to introduce standards for the Semantic Web. This provides a standardized method to use common data formats and exchange protocols on the web. This makes data more easily shared, reused, and applied across regions and systems. This standardization makes it easier to mine large quantities of data for analysis.
RapidMiner is an open source software written in Java. RapidMiner is one of the best platforms to conduct predictive analyses and offers integrated environments for deep learning, text mining, and machine learning. The platform can utilize either on-premise or cloud-based servers and has been implemented across a diverse array of organizations. RapidMiner offers a great balance of custom coding features and a user-friendly interface, which allow the platform to be leveraged most effectively by those with a solid foundation in coding and data mining.
Orange is an open source component-based software written in Python. Orange boasts painless data pre-processing features and is one of the best platforms for basic data mining analyses. Orange takes a user-oriented approach to data mining with a unique and user-friendly interface. However, one of the major drawbacks is its limited set of external data connectors. Orange is perfect for organizations looking for user-friendly data mining and who use on-premise storage.
Developed by the Apache Foundation, Mahout is an open source platform which focuses on the unsupervised learning process. The software excels at creating machine learning algorithms for clustering, classification, and collaborative filtering. Mahout is catered toward individuals with more advanced backgrounds. The program allows mathematicians, statisticians, and data scientists to create, test, and implement their own algorithms. While Mahout does include several turn-key algorithms, such as a recommender, which organizations can deploy with minimal effort, the larger platform does require a more specialized background to leverage its full capabilities.
MicroStrategy is business intelligence and data analytics software that complements all data mining models. With a wide array of native gateways and drivers, the platform can connect to any enterprise resource and analyze its data. MicroStrategy excels at transforming complex data into accessible visualizations to be distributed across an organization. The software can track and analyze the performance of all data mining models in real time and clearly display these insights for decision-makers. Pairing MicroStrategy with a data mining tool enables users to create advanced data mining models, deploy them across the organization, and make decisions from its insights and performance in the market.