Unit 3
PREDICTIVE & PRESCRIPTIVE ANALYTICS
What is data mining?
Data mining is the process of searching and analyzing a large batch of raw data in order
to identify patterns and extract useful information. Companies use data mining software
to learn more about their customers. It can help them to develop more effective
marketing strategies, increase sales, and decrease costs.
Data Mining Applications
Telecommunications:
Healthcare: •Network optimization
•Disease prediction and early diagnosis (e.g., cancer prediction) •Customer churn prediction
•Patient clustering for personalized treatment •Fraud detection
•Drug discovery •Service quality management
•Electronic Health Record (EHR) analysis
Marketing and Customer Relationship Management
(CRM):
•Customer segmentation and targeting
•Predictive modeling for customer churn
•Market basket analysis (e.g., Amazon recommending products)
•Customer lifetime value prediction
Retail and E-commerce:
•Sales forecasting
•Inventory management
•Pricing strategy optimization
•Personalized recommendations
Strategy for Data Mining: CRISP-DM
CRISP-DM (Cross-Industry Standard Process for Data Mining) is a popular and well-established methodology
used to guide the data mining process. It provides a structured, iterative framework for planning and executing data
mining projects. CRISP-DM is flexible and can be applied to any industry or domain.
Phase Description
Business Understanding Determine business objectives, assess the situation, and create a project plan
Data Understanding Collect, describe, explore, and verify data quality
Data Preparation Select, clean, construct, integrate, and format data
Modeling Select a modeling technique, generate a test design, build a model, and
assess the model
Evaluation Evaluate results, review the process, and determine next steps
Deployment Plan deployment, plan monitoring and maintenance, produce a final report,
and review the project
Stages in CRISP-DM
1.Business Understanding:
1. The first step focuses on understanding the business objectives and requirements. It includes identifying
the problem the organization is facing and formulating data mining goals to address that problem.
2. Example activities: defining the project objectives, assessing the situation, defining success criteria.
2.Data Understanding:
1. Once the business problem is understood, the next step involves collecting data from various sources and
gaining a deep understanding of the data.
2. Example activities: data collection, data description, exploring data, identifying data quality issues.
3.Data Preparation:
1. This stage involves transforming raw data into a format that can be used for modeling. It is typically the
most time-consuming step and includes data cleaning, integration, and transformation.
2. Example activities: data selection, data cleaning, constructing new attributes, data transformation.
5.Modeling:
1. In this phase, various modeling techniques are selected and applied to the data. Models are fine-tuned to
improve accuracy and performance.
2. Example activities: selecting modeling techniques, designing test cases, building models, tuning parameters.
6.Evaluation:
3. After models are built, they are evaluated against the business objectives. This step determines if the model
meets the desired success criteria and if it is ready for deployment.
4. Example activities: evaluating model performance, reviewing process, determining the next steps.
7.Deployment:
5. The final stage is deploying the model to a production environment, making it available for business use. This
may involve generating reports, automating predictions, or integrating the model into a decision-making
system.
6. Example activities: deployment planning, monitoring, maintenance, creating final reports.
•.
Life Cycle of a Data Mining Project
The data mining life cycle follows an iterative and cyclic pattern, closely aligning with the CRISP-DM methodology.
It can be broken down into several phases:
1.Problem Identification: Establishing the problem or business challenge.
2.Data Collection and Exploration: Gathering data and performing an exploratory analysis to understand patterns,
correlations, and anomalies.
3.Data Preparation: Cleaning and pre-processing the data for the modeling phase.
4.Model Development: Selecting the appropriate data mining techniques and developing predictive, descriptive, or
classification models.
5. Model Testing and Validation: Ensuring the model is accurate and reliable through testing on unseen data
6. Model Deployment: Implementing the model in a real-world scenario, enabling decision-making based on data
insights.
7. Model Monitoring and Maintenance: Ensuring the model continues to perform accurately over time and making
adjustments if necessary.
Data-Mining Successes
Data mining has yielded many successful applications across various industries. Below are a few notable examples:
Data Mining Successes
1.Targeted Marketing: Companies like Amazon and Netflix use data mining to recommend products and movies to their
users, leading to higher customer satisfaction and increased sales.
2.Fraud Detection: Banks and financial institutions successfully use data mining techniques to detect and prevent
fraudulent transactions.
3.Healthcare: Data mining is used to predict patient outcomes, improve treatment plans, and assist in diagnosing
diseases like cancer.
4.Customer Relationship Management (CRM):
1. Success Story: Retail giants like Amazon use data mining to personalize recommendations. By analyzing
customer purchase history and browsing patterns, Amazon can suggest products that users are more likely to
buy. This recommendation system has increased sales and customer retention.
5.Fraud Detection:
1. Success Story: Financial institutions, such as banks and credit card companies, use data mining to detect
fraudulent transactions. For example, Visa uses data mining algorithms to analyze transactional data in
real-time to flag potential fraud, saving millions of dollars in losses.
6.Healthcare:
1. Success Story: Hospitals and healthcare providers utilize data mining to predict disease outbreaks, personalize
patient care, and improve diagnostics. For example, IBM Watson Health uses data mining to assist doctors in
diagnosing rare diseases by analyzing large datasets of medical literature, patient histories, and treatment
outcomes.
Data-Mining Failures
Despite many successes, data mining has its pitfalls and notable failures, often due to poor implementation or
ethical concerns.
Data Mining Failures
1.Overfitting: This occurs when a model is too complex and performs exceptionally well on training data but
poorly on new, unseen data. It fails to generalize, which can result in inaccurate predictions in real-world
applications.
2.Data Quality Issues: Poor quality data can lead to biased or incorrect models. For example, missing or
inaccurate data points could cause a model to make flawed predictions.
3.Misalignment with Business Goals: Even if the model is technically sound, it can fail if it doesn't align with
the organization's business objectives. An example could be focusing too much on precision without considering
how predictions impact actual business outcomes.
4.Google Flu Trends:
1. Failure Story: Google Flu Trends was a data-mining project designed to predict flu outbreaks based on
search queries. Initially promising, it eventually failed because it overestimated flu prevalence by
relying on search trends that weren't always indicative of actual flu activity. This failure highlighted the
importance of combining data-driven insights with real-world data validation.
5.Facebook’s Emotion Manipulation Study:
1. Failure Story: In 2012, Facebook conducted an experiment to manipulate users' emotions by altering
the content they saw in their news feeds. The study sparked outrage for unethical experimentation on
users without consent. This highlighted the potential ethical failures in data mining projects.
Skills Needed for Data Mining
To succeed in data mining, professionals need a combination of technical, analytical, and domain-specific skills:
1.Statistical and Mathematical Skills:
1. Knowledge of algorithms: Understanding key data mining algorithms (e.g., decision trees, clustering,
regression) and their statistical foundations.
2. Probability and statistics: Essential for analyzing data, making predictions, and validating models.
2.Programming and Software Skills:
1. Languages: Proficiency in programming languages such as Python, R, Java, or SQL is essential for
data manipulation and implementing algorithms.
2. Tools: Familiarity with data mining and machine learning tools such as Scikit-learn, TensorFlow,
Weka, SAS, or KNIME.
3.Database and Data Management Skills:
1. SQL/NoSQL databases: Ability to work with databases to extract and manipulate large datasets.
2. Big Data technologies: Experience with tools like Hadoop, Spark, or NoSQL databases (e.g.,
MongoDB, Cassandra) for working with large, unstructured datasets.
Machine Learning and AI Knowledge:
Supervised and unsupervised learning: Understanding various machine learning paradigms, including
clustering, classification, and regression.
Deep learning: Experience with neural networks and frameworks like PyTorch and Keras can be crucial for
advanced data mining tasks.
•Domain Knowledge:
Business acumen: Understanding the business context is essential for translating data insights into actionable
business strategies.
Specific industry knowledge: For instance, knowledge of healthcare, finance, or retail can help apply data
mining insights more effectively.
•Critical Thinking and Problem-Solving:
Analytical mindset: Data miners must be able to think critically to identify patterns, trends, and anomalies in
data, and to develop strategies based on insights.
•Communication Skills:
Data storytelling: The ability to convey insights effectively to non-technical stakeholders through visualization
tools (e.g., Tableau, Power BI) and clear reporting.
Chi-Square Test Overview
Chi-Square Test Overview
A Chi-Square Test is a statistical method used to determine if there is a significant association between two categorical
variables. It compares the observed frequencies (the actual data) to the expected frequencies (what we'd expect to see if
there was no relationship between the variables). There are two main types of Chi-Square tests:
1.Chi-Square Test of Independence: This test checks if two categorical variables are independent of each other.
2.Chi-Square Goodness-of-Fit Test: This test checks how well the observed data matches an expected distribution.
Chi-Square Statistic
The Chi-Square Statistic measures how much the observed data deviate from the expected data under the null
hypothesis (which assumes no relationship between the variables). It’s calculated using the following formula:
Where:
The larger the Chi-Square statistic, the greater the discrepancy between observed and expected frequencies, potentially
leading to rejecting the null hypothesis.
Chi-Square P-Values
Once the Chi-Square statistic is calculated, it is compared against a Chi-Square Distribution to determine the p-value.
The p-value helps in deciding whether to reject the null hypothesis:
•A small p-value (typically ≤ 0.05) indicates strong evidence against the null hypothesis, suggesting that the variables are
related.
•A large p-value (> 0.05) suggests that there is no significant relationship between the variables.
Chi-Square Distribution & Chi Distribution
The Chi-Square Distribution is a right-skewed distribution used in hypothesis testing, especially with the Chi-Square test.
It is defined by the degrees of freedom (df), which are typically determined by the number of categories or variables in
the test.
For example, in a test of independence, degrees of freedom are calculated as:
Data Mining Project Framework
Data mining involves extracting valuable patterns from large datasets. A typical data mining project follows these steps:
1.Business Understanding: Define the business problem or goal clearly. For example, predicting customer churn, fraud
detection, or optimizing marketing strategies.
2.Data Understanding: Collect and explore the data, identifying potential variables, relationships, and data quality issues.
3.Data Preparation: Clean, transform, and preprocess data, such as handling missing values, encoding categorical variables,
and normalizing features.
4.Modeling: Apply appropriate data mining techniques (e.g., classification, clustering, regression).
5.Evaluation: Assess model performance, ensuring that it meets the business objectives and provides valuable insights.
6.Deployment: Use the model in the real-world application, such as integrating it into a business process or making
automated decisions.
Business Case: Historical Data Example
Suppose a retail company wants to predict customer churn using historical customer data. The company has past records
of customers, their demographics, purchase history, complaints, and service interactions. The goal is to build a predictive
model to classify which customers are likely to churn and focus marketing efforts on retaining them.
1.Business Understanding: The problem is to reduce customer churn by identifying potential churners early.
2.Data Understanding: The dataset contains customer demographics (age, income), service details (number of
complaints, purchase frequency), and the target variable (churn or not).
3.Data Preparation: The dataset needs to be cleaned by removing duplicates, handling missing values, and encoding the
target variable (yes/no) as 1/0.
4.Modeling: Logistic regression or decision tree models can be used to predict customer churn.
5.Evaluation: The model's accuracy, precision, recall, and F1-score are evaluated to determine if it is effective.
6.Deployment: The model is deployed to flag customers likely to churn, so the marketing team can take action.
Data-Mining Project in IBM SPSS Modeler
IBM SPSS Modeler is a powerful tool for conducting data-mining projects, offering a visual interface and automated
workflows for building models. To conduct a data-mining project in SPSS Modeler:
1.Set Up the Project: Load the dataset into SPSS Modeler by connecting to your data source (e.g., CSV, database).
2.Prepare the Data: Use nodes such as the Type node to set the roles of variables. Specify which variables are inputs
(features) and which is the target (outcome).
3.Build the Model: Use various algorithms such as Decision Trees, Neural Networks, or Logistic Regression. Apply
appropriate models depending on the problem.
4.Evaluate the Model: Test the model's accuracy using a test set and validate using metrics such as accuracy or ROC
curves.
Build the Model: Set Roles in a Type Node
In IBM SPSS Modeler, the Type node is critical for defining the role of each field in the data. You can specify fields
as:
•Input: Features used to predict the target.
•Target: The variable you are predicting (e.g., customer churn).
•ID: Identifier fields (e.g., customer ID), not used in modeling.
•None: Fields that are ignored for the analysis.
By setting the roles appropriately, you ensure that SPSS applies the correct operations to each field during modeling.
Handy Operations: Filter Fields and Sort Records
Two helpful operations in SPSS Modeler are:
1.Filter Fields: The Filter node lets you remove irrelevant or redundant fields that could negatively impact model
performance. For instance, if some features are too correlated or irrelevant to the target variable, filtering them out helps.
2.Sort Records: The Sort node organizes data records by a specified field, such as sorting customers by the number of
purchases or the date of their last interaction. Sorting can help identify patterns or trends and can improve the clarity of
the analysis.