Data Mining: Techniques and Applications

The document provides an overview of predictive and prescriptive analytics, focusing on data mining, its applications across various industries, and the CRISP-DM methodology for guiding data mining projects. It discusses the life cycle of data mining projects, highlights successes and failures, and outlines the skills needed for effective data mining. Additionally, it covers the Chi-Square test, its significance in hypothesis testing, and practical steps for conducting data mining projects using tools like IBM SPSS Modeler.

Uploaded by

rajputsinghkhusi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

30 views19 pages

Data Mining: Techniques and Applications

Uploaded by

rajputsinghkhusi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Unit 3

PREDICTIVE & PRESCRIPTIVE ANALYTICS

What is data mining?

Data mining is the process of searching and analyzing a large batch of raw data in order
to identify patterns and extract useful information. Companies use data mining software
to learn more about their customers. It can help them to develop more effective
marketing strategies, increase sales, and decrease costs.
Data Mining Applications
Telecommunications:
Healthcare: •Network optimization
•Disease prediction and early diagnosis (e.g., cancer prediction) •Customer churn prediction
•Patient clustering for personalized treatment •Fraud detection
•Drug discovery •Service quality management
•Electronic Health Record (EHR) analysis

Marketing and Customer Relationship Management

(CRM):
•Customer segmentation and targeting
•Predictive modeling for customer churn
•Market basket analysis (e.g., Amazon recommending products)
•Customer lifetime value prediction
Retail and E-commerce:
•Sales forecasting
•Inventory management
•Pricing strategy optimization
•Personalized recommendations
Strategy for Data Mining: CRISP-DM
CRISP-DM (Cross-Industry Standard Process for Data Mining) is a popular and well-established methodology
used to guide the data mining process. It provides a structured, iterative framework for planning and executing data
mining projects. CRISP-DM is flexible and can be applied to any industry or domain.
Phase Description
Business Understanding Determine business objectives, assess the situation, and create a project plan

Data Understanding Collect, describe, explore, and verify data quality

Data Preparation Select, clean, construct, integrate, and format data

Modeling Select a modeling technique, generate a test design, build a model, and
assess the model

Evaluation Evaluate results, review the process, and determine next steps

Deployment Plan deployment, plan monitoring and maintenance, produce a final report,
and review the project
Stages in CRISP-DM
1.Business Understanding:
1. The first step focuses on understanding the business objectives and requirements. It includes identifying
the problem the organization is facing and formulating data mining goals to address that problem.
2. Example activities: defining the project objectives, assessing the situation, defining success criteria.
2.Data Understanding:
1. Once the business problem is understood, the next step involves collecting data from various sources and
gaining a deep understanding of the data.
2. Example activities: data collection, data description, exploring data, identifying data quality issues.
3.Data Preparation:
1. This stage involves transforming raw data into a format that can be used for modeling. It is typically the
most time-consuming step and includes data cleaning, integration, and transformation.
2. Example activities: data selection, data cleaning, constructing new attributes, data transformation.
5.Modeling:
1. In this phase, various modeling techniques are selected and applied to the data. Models are fine-tuned to
improve accuracy and performance.
2. Example activities: selecting modeling techniques, designing test cases, building models, tuning parameters.
6.Evaluation:
3. After models are built, they are evaluated against the business objectives. This step determines if the model
meets the desired success criteria and if it is ready for deployment.
4. Example activities: evaluating model performance, reviewing process, determining the next steps.
7.Deployment:
5. The final stage is deploying the model to a production environment, making it available for business use. This
may involve generating reports, automating predictions, or integrating the model into a decision-making
system.
6. Example activities: deployment planning, monitoring, maintenance, creating final reports.
•.

Life Cycle of a Data Mining Project

The data mining life cycle follows an iterative and cyclic pattern, closely aligning with the CRISP-DM methodology.
It can be broken down into several phases:
1.Problem Identification: Establishing the problem or business challenge.
2.Data Collection and Exploration: Gathering data and performing an exploratory analysis to understand patterns,
correlations, and anomalies.
3.Data Preparation: Cleaning and pre-processing the data for the modeling phase.
4.Model Development: Selecting the appropriate data mining techniques and developing predictive, descriptive, or
classification models.
5. Model Testing and Validation: Ensuring the model is accurate and reliable through testing on unseen data
6. Model Deployment: Implementing the model in a real-world scenario, enabling decision-making based on data
insights.
7. Model Monitoring and Maintenance: Ensuring the model continues to perform accurately over time and making
adjustments if necessary.
Data-Mining Successes
Data mining has yielded many successful applications across various industries. Below are a few notable examples:
Data Mining Successes
1.Targeted Marketing: Companies like Amazon and Netflix use data mining to recommend products and movies to their
users, leading to higher customer satisfaction and increased sales.
2.Fraud Detection: Banks and financial institutions successfully use data mining techniques to detect and prevent
fraudulent transactions.
3.Healthcare: Data mining is used to predict patient outcomes, improve treatment plans, and assist in diagnosing
diseases like cancer.
4.Customer Relationship Management (CRM):
1. Success Story: Retail giants like Amazon use data mining to personalize recommendations. By analyzing
customer purchase history and browsing patterns, Amazon can suggest products that users are more likely to
buy. This recommendation system has increased sales and customer retention.
5.Fraud Detection:
1. Success Story: Financial institutions, such as banks and credit card companies, use data mining to detect
fraudulent transactions. For example, Visa uses data mining algorithms to analyze transactional data in
real-time to flag potential fraud, saving millions of dollars in losses.
6.Healthcare:
1. Success Story: Hospitals and healthcare providers utilize data mining to predict disease outbreaks, personalize
patient care, and improve diagnostics. For example, IBM Watson Health uses data mining to assist doctors in
diagnosing rare diseases by analyzing large datasets of medical literature, patient histories, and treatment
outcomes.
Data-Mining Failures
Despite many successes, data mining has its pitfalls and notable failures, often due to poor implementation or
ethical concerns.
Data Mining Failures
1.Overfitting: This occurs when a model is too complex and performs exceptionally well on training data but
poorly on new, unseen data. It fails to generalize, which can result in inaccurate predictions in real-world
applications.
2.Data Quality Issues: Poor quality data can lead to biased or incorrect models. For example, missing or
inaccurate data points could cause a model to make flawed predictions.
3.Misalignment with Business Goals: Even if the model is technically sound, it can fail if it doesn't align with
the organization's business objectives. An example could be focusing too much on precision without considering
how predictions impact actual business outcomes.
4.Google Flu Trends:
1. Failure Story: Google Flu Trends was a data-mining project designed to predict flu outbreaks based on
search queries. Initially promising, it eventually failed because it overestimated flu prevalence by
relying on search trends that weren't always indicative of actual flu activity. This failure highlighted the
importance of combining data-driven insights with real-world data validation.
5.Facebook’s Emotion Manipulation Study:
1. Failure Story: In 2012, Facebook conducted an experiment to manipulate users' emotions by altering
the content they saw in their news feeds. The study sparked outrage for unethical experimentation on
users without consent. This highlighted the potential ethical failures in data mining projects.
Skills Needed for Data Mining
To succeed in data mining, professionals need a combination of technical, analytical, and domain-specific skills:
1.Statistical and Mathematical Skills:
1. Knowledge of algorithms: Understanding key data mining algorithms (e.g., decision trees, clustering,
regression) and their statistical foundations.
2. Probability and statistics: Essential for analyzing data, making predictions, and validating models.
2.Programming and Software Skills:
1. Languages: Proficiency in programming languages such as Python, R, Java, or SQL is essential for
data manipulation and implementing algorithms.
2. Tools: Familiarity with data mining and machine learning tools such as Scikit-learn, TensorFlow,
Weka, SAS, or KNIME.
3.Database and Data Management Skills:
1. SQL/NoSQL databases: Ability to work with databases to extract and manipulate large datasets.
2. Big Data technologies: Experience with tools like Hadoop, Spark, or NoSQL databases (e.g.,
MongoDB, Cassandra) for working with large, unstructured datasets.
Machine Learning and AI Knowledge:
Supervised and unsupervised learning: Understanding various machine learning paradigms, including
clustering, classification, and regression.
Deep learning: Experience with neural networks and frameworks like PyTorch and Keras can be crucial for
advanced data mining tasks.
•Domain Knowledge:
Business acumen: Understanding the business context is essential for translating data insights into actionable
business strategies.
Specific industry knowledge: For instance, knowledge of healthcare, finance, or retail can help apply data
mining insights more effectively.
•Critical Thinking and Problem-Solving:
Analytical mindset: Data miners must be able to think critically to identify patterns, trends, and anomalies in
data, and to develop strategies based on insights.
•Communication Skills:
Data storytelling: The ability to convey insights effectively to non-technical stakeholders through visualization
tools (e.g., Tableau, Power BI) and clear reporting.
Chi-Square Test Overview
Chi-Square Test Overview
A Chi-Square Test is a statistical method used to determine if there is a significant association between two categorical
variables. It compares the observed frequencies (the actual data) to the expected frequencies (what we'd expect to see if
there was no relationship between the variables). There are two main types of Chi-Square tests:
1.Chi-Square Test of Independence: This test checks if two categorical variables are independent of each other.
2.Chi-Square Goodness-of-Fit Test: This test checks how well the observed data matches an expected distribution.

Chi-Square Statistic
The Chi-Square Statistic measures how much the observed data deviate from the expected data under the null
hypothesis (which assumes no relationship between the variables). It’s calculated using the following formula:

Where:

The larger the Chi-Square statistic, the greater the discrepancy between observed and expected frequencies, potentially
leading to rejecting the null hypothesis.
Chi-Square P-Values
Once the Chi-Square statistic is calculated, it is compared against a Chi-Square Distribution to determine the p-value.
The p-value helps in deciding whether to reject the null hypothesis:
•A small p-value (typically ≤ 0.05) indicates strong evidence against the null hypothesis, suggesting that the variables are
related.
•A large p-value (> 0.05) suggests that there is no significant relationship between the variables.

Chi-Square Distribution & Chi Distribution

The Chi-Square Distribution is a right-skewed distribution used in hypothesis testing, especially with the Chi-Square test.
It is defined by the degrees of freedom (df), which are typically determined by the number of categories or variables in
the test.
For example, in a test of independence, degrees of freedom are calculated as:
Data Mining Project Framework
Data mining involves extracting valuable patterns from large datasets. A typical data mining project follows these steps:
1.Business Understanding: Define the business problem or goal clearly. For example, predicting customer churn, fraud
detection, or optimizing marketing strategies.
2.Data Understanding: Collect and explore the data, identifying potential variables, relationships, and data quality issues.
3.Data Preparation: Clean, transform, and preprocess data, such as handling missing values, encoding categorical variables,
and normalizing features.
4.Modeling: Apply appropriate data mining techniques (e.g., classification, clustering, regression).
5.Evaluation: Assess model performance, ensuring that it meets the business objectives and provides valuable insights.
6.Deployment: Use the model in the real-world application, such as integrating it into a business process or making
automated decisions.
Business Case: Historical Data Example
Suppose a retail company wants to predict customer churn using historical customer data. The company has past records
of customers, their demographics, purchase history, complaints, and service interactions. The goal is to build a predictive
model to classify which customers are likely to churn and focus marketing efforts on retaining them.
1.Business Understanding: The problem is to reduce customer churn by identifying potential churners early.
2.Data Understanding: The dataset contains customer demographics (age, income), service details (number of
complaints, purchase frequency), and the target variable (churn or not).
3.Data Preparation: The dataset needs to be cleaned by removing duplicates, handling missing values, and encoding the
target variable (yes/no) as 1/0.
4.Modeling: Logistic regression or decision tree models can be used to predict customer churn.
5.Evaluation: The model's accuracy, precision, recall, and F1-score are evaluated to determine if it is effective.
6.Deployment: The model is deployed to flag customers likely to churn, so the marketing team can take action.
Data-Mining Project in IBM SPSS Modeler
IBM SPSS Modeler is a powerful tool for conducting data-mining projects, offering a visual interface and automated
workflows for building models. To conduct a data-mining project in SPSS Modeler:
1.Set Up the Project: Load the dataset into SPSS Modeler by connecting to your data source (e.g., CSV, database).
2.Prepare the Data: Use nodes such as the Type node to set the roles of variables. Specify which variables are inputs
(features) and which is the target (outcome).
3.Build the Model: Use various algorithms such as Decision Trees, Neural Networks, or Logistic Regression. Apply
appropriate models depending on the problem.
4.Evaluate the Model: Test the model's accuracy using a test set and validate using metrics such as accuracy or ROC
curves.
Build the Model: Set Roles in a Type Node
In IBM SPSS Modeler, the Type node is critical for defining the role of each field in the data. You can specify fields
as:
•Input: Features used to predict the target.
•Target: The variable you are predicting (e.g., customer churn).
•ID: Identifier fields (e.g., customer ID), not used in modeling.
•None: Fields that are ignored for the analysis.
By setting the roles appropriately, you ensure that SPSS applies the correct operations to each field during modeling.
Handy Operations: Filter Fields and Sort Records
Two helpful operations in SPSS Modeler are:
1.Filter Fields: The Filter node lets you remove irrelevant or redundant fields that could negatively impact model
performance. For instance, if some features are too correlated or irrelevant to the target variable, filtering them out helps.
2.Sort Records: The Sort node organizes data records by a specified field, such as sorting customers by the number of
purchases or the date of their last interaction. Sorting can help identify patterns or trends and can improve the clarity of
the analysis.

Data Mining
No ratings yet
Data Mining
13 pages
Data Mining: CRISP-DM Overview and Skills
No ratings yet
Data Mining: CRISP-DM Overview and Skills
26 pages
Data Mining and IBM SPSS Modeler
No ratings yet
Data Mining and IBM SPSS Modeler
20 pages
Data Mining for Business Insights
No ratings yet
Data Mining for Business Insights
30 pages
Introduction To Data Mining
No ratings yet
Introduction To Data Mining
27 pages
Chapter 1
No ratings yet
Chapter 1
23 pages
Fundamental of Data Mining (CSI-508) .
No ratings yet
Fundamental of Data Mining (CSI-508) .
19 pages
Data Mining: Techniques and Processes
No ratings yet
Data Mining: Techniques and Processes
21 pages
HND - BI - W8 - Data Mining
No ratings yet
HND - BI - W8 - Data Mining
19 pages
Data Mining Techniques and Applications
No ratings yet
Data Mining Techniques and Applications
22 pages
ISS-DSS - Module 3
No ratings yet
ISS-DSS - Module 3
23 pages
DWDM 3 Unit Notes
No ratings yet
DWDM 3 Unit Notes
10 pages
Data Mining for Business Intelligence
No ratings yet
Data Mining for Business Intelligence
69 pages
Datamining Topic 2
No ratings yet
Datamining Topic 2
13 pages
Introduction to Data Mining Concepts
No ratings yet
Introduction to Data Mining Concepts
19 pages
Understanding Data Mining Techniques
No ratings yet
Understanding Data Mining Techniques
9 pages
Understanding Data Mining Basics
No ratings yet
Understanding Data Mining Basics
15 pages
Module 7 Introduction To Data Mining
No ratings yet
Module 7 Introduction To Data Mining
14 pages
What Is Data Mining
No ratings yet
What Is Data Mining
8 pages
Data Mining Concepts and Techniques
100% (3)
Data Mining Concepts and Techniques
122 pages
Lecture 1 - Introduction
No ratings yet
Lecture 1 - Introduction
46 pages
Data Mining: Techniques and Applications
No ratings yet
Data Mining: Techniques and Applications
17 pages
1 - DM
No ratings yet
1 - DM
5 pages
DMM 1
No ratings yet
DMM 1
4 pages
Data Mining Process, Techniques, Tools & Examples
No ratings yet
Data Mining Process, Techniques, Tools & Examples
11 pages
Data Mining Techniques and Process Overview
No ratings yet
Data Mining Techniques and Process Overview
125 pages
PAM - Complete
No ratings yet
PAM - Complete
322 pages
Data Mining and Its Process
No ratings yet
Data Mining and Its Process
6 pages
Understanding Data Mining Techniques
No ratings yet
Understanding Data Mining Techniques
33 pages
Introduction to Data Mining Concepts
No ratings yet
Introduction to Data Mining Concepts
87 pages
Data Mining Techniques and Benefits
No ratings yet
Data Mining Techniques and Benefits
22 pages
Data Mining and Decision Trees: Prof. Sin-Min Lee Department of Computer Science
No ratings yet
Data Mining and Decision Trees: Prof. Sin-Min Lee Department of Computer Science
66 pages
Data Mining Applications and CRISP-DM Phases
No ratings yet
Data Mining Applications and CRISP-DM Phases
5 pages
An Introduction To Data Mining
No ratings yet
An Introduction To Data Mining
47 pages
Healthcare Data Mining Techniques
No ratings yet
Healthcare Data Mining Techniques
77 pages
Understanding Data Mining Techniques
No ratings yet
Understanding Data Mining Techniques
26 pages
Introduction to Data Mining Techniques
No ratings yet
Introduction to Data Mining Techniques
21 pages
Data Mining Concepts & Applications
100% (1)
Data Mining Concepts & Applications
121 pages
PredictiveAnalysis U1 U2
No ratings yet
PredictiveAnalysis U1 U2
7 pages
Durga Erpppt
No ratings yet
Durga Erpppt
16 pages
Understanding Data Mining Techniques
No ratings yet
Understanding Data Mining Techniques
6 pages
Introduction To Data Mining
No ratings yet
Introduction To Data Mining
48 pages
Introduction To Data Mining
No ratings yet
Introduction To Data Mining
6 pages
Introduction to Data Mining Techniques
No ratings yet
Introduction to Data Mining Techniques
15 pages
Comprehensive Data Mining Tutorial
No ratings yet
Comprehensive Data Mining Tutorial
8 pages
Comprehensive Guide to Data Mining
No ratings yet
Comprehensive Guide to Data Mining
22 pages
Comprehensive Guide to Data Mining
No ratings yet
Comprehensive Guide to Data Mining
22 pages
Business Understanding This Step Involves Understanding The Problem That Needs To Be Solved and Defining The Objectives of The Data Mining Project
No ratings yet
Business Understanding This Step Involves Understanding The Problem That Needs To Be Solved and Defining The Objectives of The Data Mining Project
5 pages
Predictive Analytics Modelling (21CSH-440) : Apex Institute of Technology
No ratings yet
Predictive Analytics Modelling (21CSH-440) : Apex Institute of Technology
42 pages
Data Mining and Predictive Modeling Overview
No ratings yet
Data Mining and Predictive Modeling Overview
17 pages
Understanding Data Mining Techniques
No ratings yet
Understanding Data Mining Techniques
23 pages
Understanding Data Mining Techniques
100% (1)
Understanding Data Mining Techniques
87 pages
Data Mining
No ratings yet
Data Mining
3 pages
Data Visualization
No ratings yet
Data Visualization
13 pages
Data Mining Applications in e-Governance
No ratings yet
Data Mining Applications in e-Governance
18 pages
Course 1 Guide
No ratings yet
Course 1 Guide
432 pages
Software Design: Functional Decomposition Guide
No ratings yet
Software Design: Functional Decomposition Guide
13 pages
Data Mining in Indian e-Governance
No ratings yet
Data Mining in Indian e-Governance
10 pages
System Testing and Quality Assurance Guide
No ratings yet
System Testing and Quality Assurance Guide
8 pages
Business Values and Ethics Overview
No ratings yet
Business Values and Ethics Overview
35 pages
Parleo Games Brandbook Overview
No ratings yet
Parleo Games Brandbook Overview
24 pages
FEM for Geotechnical Engineers
No ratings yet
FEM for Geotechnical Engineers
25 pages
VanshBresume SIP
No ratings yet
VanshBresume SIP
2 pages
MT130 120 Manual
No ratings yet
MT130 120 Manual
13 pages
Digital Literacy MCQ PDF
No ratings yet
Digital Literacy MCQ PDF
20 pages
GM-800 Reference Manual English
No ratings yet
GM-800 Reference Manual English
22 pages
GIS Mapping of Public Toilets in Kisumu City Report Final 20230804
No ratings yet
GIS Mapping of Public Toilets in Kisumu City Report Final 20230804
59 pages
Informz Mailing Designer 2 Guide - 2024
No ratings yet
Informz Mailing Designer 2 Guide - 2024
6 pages
Ezi-STEPII EtherCAT User Manual
No ratings yet
Ezi-STEPII EtherCAT User Manual
95 pages
Sowmya Gadipally Resume SSIS-SSRS Developer
No ratings yet
Sowmya Gadipally Resume SSIS-SSRS Developer
5 pages
Ai Essential Practice Set2
No ratings yet
Ai Essential Practice Set2
11 pages
.!7BC5 9"HEDLLD! .!7BC5 9"HEDLMK!: Executive Search
No ratings yet
.!7BC5 9"HEDLLD! .!7BC5 9"HEDLMK!: Executive Search
18 pages
3.5inch 16BIT Module Arduino Demo Instructions EN
No ratings yet
3.5inch 16BIT Module Arduino Demo Instructions EN
5 pages
Leveraging Machine Learning For Intelligent Agriculture: Discover Internet of Things
No ratings yet
Leveraging Machine Learning For Intelligent Agriculture: Discover Internet of Things
21 pages
Inventory Sheet for ORS Equipment
No ratings yet
Inventory Sheet for ORS Equipment
3 pages
NEO XL-Series Tape Library User Guide
No ratings yet
NEO XL-Series Tape Library User Guide
78 pages
Vdocuments - MX - Abx Micros Es60 Otct Abx Micros Esv60 Scil Vet Abc Printer List Ral095a 2 The Compatibility
No ratings yet
Vdocuments - MX - Abx Micros Es60 Otct Abx Micros Esv60 Scil Vet Abc Printer List Ral095a 2 The Compatibility
21 pages
SAP R/3 Background Processing Guide
No ratings yet
SAP R/3 Background Processing Guide
10 pages
Post MarketSurveillancePlanTemplate
100% (3)
Post MarketSurveillancePlanTemplate
3 pages
Most Probable Ques of Polity
No ratings yet
Most Probable Ques of Polity
246 pages
Ran Engineer
No ratings yet
Ran Engineer
13 pages
Blood Relation - Study Notes
No ratings yet
Blood Relation - Study Notes
11 pages
Organized 1 35
No ratings yet
Organized 1 35
35 pages
TABLE
No ratings yet
TABLE
1 page
Introduction To Binary Student Worksheets
No ratings yet
Introduction To Binary Student Worksheets
12 pages
Virtual I/O Server: Power Systems
No ratings yet
Virtual I/O Server: Power Systems
208 pages
Chap. 7 Microprogrammed Control
No ratings yet
Chap. 7 Microprogrammed Control
14 pages
DXS Spec
No ratings yet
DXS Spec
1 page
C++ Bank Management System Project Report
No ratings yet
C++ Bank Management System Project Report
46 pages
Test Bank For Practical Guide To Fedora and Red Hat Enterprise Linux A 7E 7th Edition 0133477436
No ratings yet
Test Bank For Practical Guide To Fedora and Red Hat Enterprise Linux A 7E 7th Edition 0133477436
321 pages

Data Mining: Techniques and Applications

Uploaded by

Data Mining: Techniques and Applications

Uploaded by

Unit 3

PREDICTIVE & PRESCRIPTIVE ANALYTICS

Marketing and Customer Relationship Management

Data Understanding Collect, describe, explore, and verify data quality

Data Preparation Select, clean, construct, integrate, and format data

Life Cycle of a Data Mining Project

Chi-Square Distribution & Chi Distribution

You might also like