0% found this document useful (0 votes)

27 views39 pages

UNIT 3 Data Warehousing

The document discusses various classification and prediction techniques in data mining, highlighting challenges such as data quality issues, overfitting, and model selection. It covers methods including Decision Trees, Bayesian Classification, Rule-Based Classification, Backpropagation, Support Vector Machines, and Associative Classification, detailing their mechanisms, advantages, and disadvantages. Effective classification requires clean data, appropriate models, and sufficient training examples to ensure reliable outcomes.

Uploaded by

dewangandeepali069

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

27 views39 pages

UNIT 3 Data Warehousing

Uploaded by

dewangandeepali069

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

UNIT 3

CLASSIFICATION AND PREDICTION

1. Issues Regarding Classification and Prediction
Classification and prediction are major tasks in data mining, where the goal is to
build models that can classify data or predict unknown values. However, several
challenges affect their accuracy and performance.

🔹 1. Data Quality Issues

 Incomplete Data: Missing values can mislead model training.
 Noisy Data: Errors or outliers in data reduce model accuracy.
 Inconsistent Data: Different formats or units across datasets can confuse
the classifier.

🔹 2. Overfitting and Underfitting

 Overfitting: Model performs well on training data but poorly on new data.
 Underfitting: Model is too simple to capture the data patterns.

🔹 3. Irrelevant or Redundant Features

 Too many unnecessary attributes can reduce accuracy.
 It makes the model more complex and harder to interpret.

🔹 4. Imbalanced Classes
 If one class has much more data than others, the model may ignore the
minority class.
 Common in fraud detection or disease prediction.

🔹 5. Lack of Sufficient Training Data

 Small datasets may not represent the real-world scenario properly.
 Leads to poor generalization and weak predictions.

🔹 6. Difficulty in Selecting the Right Model

 Different algorithms work better for different types of data.
 Choosing the wrong one can lead to poor performance.

🔹 7. Scalability and Efficiency

 When data is very large (like in data warehouses), classification algorithms
can become slow and resource-heavy.

🔹 8. Real-time Prediction Issues

 In some cases, predictions must be made instantly (e.g., fraud detection).
 Not all models are optimized for real-time decision-making.

🔹 9. Data Integration Problems

 When data comes from multiple sources, differences in schema or format
may affect prediction quality.

Conclusion:
Effective classification and prediction require clean, balanced, and well-
structured data along with the right model and enough training examples.
Overcoming these issues is crucial for building reliable and useful data mining
systems.

2. Classification by Decision Tree – Introduction

🔹 What is Classification?
 Classification is a data mining technique used to predict the class or
category of a given data point.
 It uses a model trained on historical (labeled) data to classify new data.

🔹1. What is a Decision Tree?

 A Decision Tree is a tree-shaped structure used for classification.
 It breaks down a dataset into smaller subsets while building the tree, and at
the same time, an associated decision tree is incrementally developed.
 The final result is a tree where:
o Internal nodes represent tests or decisions on attributes
o Branches represent outcomes of those tests
o Leaf nodes represent class labels (final decision)

🔹2. Why Use Decision Trees for Classification?

 Easy to understand and interpret (like a flowchart)
 Handles both numerical and categorical data
 No need for domain knowledge
 Works well for large datasets
🔹 Basic Example:
Suppose we want to classify whether a person will play tennis or not based on
weather conditions:

Outlook Temperature Humidity Wind Play Tennis

Sunny Hot High Weak No
Overcast Mild High Strong Yes

A simple decision tree for this might be:

[Outlook]
/ | \
Sunny Overcast Rain
/ \
[Humidity] [Wind]
/ \ / \
High Normal Weak Strong
No Yes Yes No

🔹 3. How Decision Trees Work

1. Start with the full dataset.
2. Choose the best attribute to split data (based on measures like Information
Gain or Gini Index).
3. Create branches for each value of the attribute.
4. Repeat recursively for each subset of data until:
o All records belong to the same class, or
o No more attributes are left.
🔹 5. Key Algorithms Used
 ID3 (Iterative Dichotomiser 3): Uses Information Gain.
 C4.5: Improvement over ID3, handles missing values and pruning.
 CART (Classification and Regression Tree): Uses Gini Index.

🔹 6. Advantages
 Easy to understand and interpret.
 Can handle both numerical and categorical data.
 Performs well on large datasets.

🔹 7. Disadvantages
 Can overfit the data if the tree is too deep.
 Sensitive to small changes in data.
 Pruning is required to avoid complexity.

🔹 8. Applications
 Customer segmentation
 Medical diagnosis
 Fraud detection
 Credit risk analysis
Conclusion
Classification using decision trees is a powerful and popular method in data
mining.
It provides a clear and visual model that can be used for decision-making,
especially in areas like:
 Customer classification
 Fraud detection
 Medical diagnosis
 Loan approval

3. BAYESIAN CLASSIFICATION
🔹 Introduction
Bayesian Classification is a statistical approach to classification based on Bayes’
Theorem.
It is widely used in machine learning for predicting the class of a given data point
based on probability.

🔹 What is Bayes' Theorem?

Bayes' Theorem gives a way to calculate posterior probability using prior
knowledge.
P(H∣X) = P(X∣H) ⋅ P(H) / P(X)
Where:
 P(H∣X): Probability of hypothesis H given the data X (posterior)
 P(H): Probability of H being true (prior)
 P(X∣H): Probability of data X given hypothesis H (likelihood)
 P(X): Probability of data X (evidence)

🔹 Bayesian Classifier:
A Bayesian Classifier uses this formula to predict the most probable class for a
given input.
It assumes a probabilistic model of the data and calculates the probability for each
class.

🔹 Naïve Bayes Classifier

One of the most popular Bayesian classifiers is the Naïve Bayes Classifier.
It is based on a naïve assumption:
 All attributes are independent of each other, given the class label.
P(C∣X1,X2,...,Xn) = P(C)⋅P(X1∣C)⋅P(X2∣C)⋅…⋅P(Xn∣C) / P(X1,X2,...,Xn)

Where:
 C = class
 X1,X2,...,Xn = attribute values

🔹 Steps in Bayesian Classification:

1. Calculate prior probabilities for each class
2. Calculate conditional probabilities for each attribute given a class
3. Apply Bayes’ theorem to find the posterior probability
4. Choose the class with the highest posterior probability

🔹 Example: Email Spam Detection

Suppose we want to classify emails as Spam or Not Spam based on keywords like
"free", "win", "offer".
 Calculate:
o P(Spam), P(NotSpam)
o P(word∣Spam) P(word∣NotSpam)

Then apply Bayes’ theorem to predict the class of a new email.

🔹 Advantages of Bayesian Classification:

✅Simple and fast.
✅Works well even with small datasets.
✅Handles missing data effectively.
✅Performs well in text classification, like spam filtering.

🔹 Disadvantages:
❌ Assumes independence between features (which is rarely true).
❌ May not work well with highly correlated attributes.
✅ Conclusion
Bayesian Classification is a powerful, probabilistic technique that uses prior
knowledge and statistical rules to classify data.
It is especially effective in domains like text mining, document classification, and
spam filtering.

4. RULE BASED CLASSIFICATION

🔹 What is Rule-Based Classification?
Rule-Based Classification is a method in data mining where a set of IF–THEN rules
is used to classify data into different categories (classes).
Each rule connects a set of conditions (on attribute values) to a class label.

🔹 Structure of a Rule
A rule has two parts:
IF <condition>
THEN <class label>
 Condition: a combination of attribute values
 Class label: the predicted class for data matching the condition

🔹 Example Rule
IF age > 50 AND cholesterol = high
THEN class = heart_disease
means:
If a person is older than 50 and has high cholesterol, classify them as having heart
disease.

🔹 Rule-Based Classifier (Classifier = Rule Set)

The model built by rule-based classification is called a classifier, and it consists of
a set of IF-THEN rules.
When a new record is to be classified:
1. The system checks each rule to see if the condition is satisfied.
2. If a rule matches, it assigns the class in the THEN part to the data.
3. If multiple rules match, a conflict resolution strategy is used (e.g., first
match, most specific, highest confidence).

🔹 How Are Rules Generated?

Rules can be:
 Extracted from decision trees
 Learned directly from training data using algorithms like:
o RIPPER (Repeated Incremental Pruning to Produce Error Reduction)
o FOIL (First Order Inductive Learner)
o CN2

🔹 Advantages of Rule-Based Classification:

✅Easy to understand and interpret
✅Can handle discrete and continuous data
✅Rules are explicit and human-readable
✅ Supports incremental learning (rules can be updated)

🔹 Disadvantages:
❌May overfit the training data
❌Conflict may occur when multiple rules match
❌ Rule generation may become complex if data is large or noisy

🔹 Example Use Cases:

 Medical diagnosis
 Fraud detection
 Spam email classification
 Customer segmentation
✅ Conclusion
Rule-Based Classification is a transparent and intuitive method that uses a set of
IF–THEN rules to predict class labels.
It is useful in domains where interpretability and clarity are important, making it
a popular choice in expert systems and knowledge-based applications.
5. CLASSIFICATION BY BACKPROPAGATION
🔹 Introduction
 Backpropagation is a supervised learning algorithm used for training
artificial neural networks (ANNs).
 It is mainly used in classification problems, such as image recognition,
speech recognition, and medical diagnosis.
 The term “backpropagation” stands for "backward propagation of error".

🔹 What is an Artificial Neural Network (ANN)?

 An ANN is a computational model inspired by the human brain.
 It consists of neurons (nodes) arranged in layers:
o Input Layer: Takes input features
o Hidden Layer(s): Performs intermediate processing
o Output Layer: Produces final class label or output

🔹 How Backpropagation Works

It has two main phases:
1. Forward Pass:
 The input is passed through the network layer by layer.
 The output is calculated using activation functions (like sigmoid or ReLU).
 The output is compared to the actual (target) value.
 Error is calculated using a loss function (e.g., Mean Squared Error).
2. Backward Pass (Backpropagation):
 The error is propagated backward from output to input.
 The network adjusts the weights of the connections to reduce the error.
 This is done using a method called Gradient Descent.

🔹 Steps in Backpropagation Algorithm

1. Initialize weights randomly.
2. Forward propagate input to compute output.
3. Calculate error (Loss) between predicted and actual output.
4. Backpropagate the error:
o Compute gradient of the loss with respect to each weight.
5. Update weights using gradient descent:
o wnew=wold−η⋅∂w∂Error
o where η is the learning rate.
6. Repeat the process for multiple iterations (epochs) until error is minimized.

🔹 Example:
If you're classifying handwritten digits (0–9), each image is input to the network,
and the output layer will have 10 nodes (one for each digit).
The node with the highest value is the predicted digit.

🔹 Advantages of Backpropagation:
✅Works well for complex, non-linear classification problems
✅Can be used for both binary and multi-class classification
✅Learns from data without needing explicit rules

🔹 Disadvantages:
❌Requires large amounts of training data
❌Can be computationally expensive
❌May get stuck in local minima
❌ Sensitive to choice of learning rate

✅ Conclusion
Backpropagation is the core algorithm for training neural networks and is widely
used for classification tasks in modern AI applications.
It uses error signals to adjust weights, helping the network to learn patterns in
the data effectively.

6. SUPPORT VECTOR MACHINES

🔹 Introduction
 Support Vector Machine (SVM) is a supervised machine learning algorithm
used for classification and regression tasks.
 It is especially powerful for binary classification problems.
 SVM aims to find the best decision boundary (also called hyperplane) that
separates the classes in the dataset.

🔹 Basic Concept
 SVM looks for a hyperplane that:
o Separates the data points of different classes with the maximum
margin.
o Margin = distance between the hyperplane and the nearest data
points from both classes (called support vectors).
 The wider the margin, the better the generalization on unseen data.

🔹 Important Terms

Term Meaning

A line (in 2D) or plane (in higher dimensions) that separates

Hyperplane
classes

Support The data points closest to the hyperplane; they “support” the
Vectors decision boundary

Margin The distance between the hyperplane and the support vectors

🔹 Types of SVM
1. Linear SVM:
o Used when the data is linearly separable.
o Finds a straight-line hyperplane.
2. Non-Linear SVM:
o Used when data cannot be separated by a straight line.
o Uses a Kernel Trick to map data to a higher-dimensional space where
it can be separated linearly.
🔹 Kernel Functions
Kernels help transform data into higher dimensions. Common ones include:
 Linear Kernel
 Polynomial Kernel
 Radial Basis Function (RBF) / Gaussian Kernel
 Sigmoid Kernel

🔹 How SVM Works – Steps

1. Plot the data in a space.
2. Choose the best hyperplane that separates the classes with maximum
margin.
3. If not linearly separable, apply kernel function.
4. Use optimization to minimize classification error and maximize margin.

🔹 Advantages of SVM
✅ Works well in high-dimensional spaces
✅ Effective when number of features > number of samples
✅ Robust to overfitting (especially with proper kernel and regularization)
✅ Can handle non-linear data using kernels

🔹 Disadvantages
❌ Not suitable for large datasets (slow training)
❌ Choice of kernel and parameters is critical
❌ Doesn’t work well if classes overlap a lot
🔹 Example Use Cases
 Email spam detection
 Face recognition
 Disease diagnosis
 Text classification

✅ Conclusion
Support Vector Machines are powerful classifiers that aim to create a decision
boundary with maximum margin between classes.
They are highly accurate and work well in complex, high-dimensional
classification tasks, making them popular in many real-world applications.

7. Associative Classification (AC)

🔹 Introduction
 Associative Classification is a classification technique that combines two
data mining approaches:
1. Association Rule Mining
2. Classification
 Instead of using IF–THEN rules derived from decision trees, AC generates
classification rules using association rules.
 It is based on the idea that strong associations between attribute-value
pairs and class labels can be used to classify new data.

🔹 What is Association Rule Mining?

Association rule mining is about finding frequent patterns or relationships in data,
typically in the form:
IF A and B → THEN C
In Associative Classification, such rules must predict a class label:
IF (X1 = v1) AND (X2 = v2) THEN class = C
These are called Class Association Rules (CARs)
🔹 How Associative Classification Works
1. Mine all frequent itemsets where one item is a class label.
2. From those itemsets, generate class association rules (CARs).
3. Use a rule selection and ordering strategy (like confidence, support) to
select the best rules.
4. Apply rules to classify new data. The rule that matches the new record and
has the highest confidence is typically chosen.

🔹 Example:
Suppose we have transaction data about customers buying products and their
purchase category:
IF (buys = milk) AND (buys = bread) → class = grocery
This means: If a customer buys both milk and bread, they are likely classified
under “grocery” category.

🔹 Benefits of Associative Classification

✅Generates highly accurate rules
✅Uses frequent patterns found in the data
✅Rules are interpretable and easy to understand
✅More flexible than decision tree rules
🔹 Challenges / Disadvantages
❌Can generate too many rules → need pruning
❌Rule selection can be difficult when multiple rules match
❌Requires efficient mining of class association rules
🔹 Common Algorithms Used
 CBA (Classification Based on Associations)
 CMAR (Classification based on Multiple Association Rules)
 CPAR (Classification based on Predictive Association Rules)

🔹 Applications
 Market basket analysis
 Customer behavior prediction
 Text and document classification
 Medical diagnosis

✅ Conclusion
Associative Classification is a powerful hybrid method that builds accurate,
interpretable classifiers using association rules linked to class labels.
By combining the strengths of association mining and classification, it is well-
suited for domains that require strong rule-based prediction.

8. LAZY LEARNERS
🔹 Introduction
In machine learning, algorithms are generally categorized into two types based on
when they generalize the model:
 Eager Learners: Build the model during training (e.g., Decision Tree, SVM)
 Lazy Learners: Delay model building until a query (test instance) is made

🔹 What is a Lazy Learner?

 A Lazy Learner is a type of learning algorithm that stores the training data
and waits until a test (query) comes in to do any real computation.
 It does not build a global model in advance.
 It performs local generalization at prediction time.

🔹 Working of Lazy Learners

1. During training:
o It stores the entire training dataset as it is.
o No training or model construction happens.
2. During testing:
o It compares the new test instance to all stored training instances.
o It finds the most similar examples and uses them to make a
prediction.

🔹 Popular Lazy Learning Algorithm: K-Nearest Neighbors (KNN)

 KNN is the most common lazy learner.
 It predicts the class of a test instance based on the majority class of the 'k'
closest neighbors.
Example:
If a new fruit looks similar to 3 apples and 2 oranges in the dataset, and k=5, it is
classified as an apple.

🔹 Advantages of Lazy Learners

✅No need for long training time
✅Very simple to implement
✅Naturally handles dynamic data
✅Can model complex decision boundaries

🔹 Disadvantages of Lazy Learners

❌Slow at prediction time (because they search the entire dataset)
❌Need large memory to store all training data
❌Sensitive to irrelevant features or noise
❌Poor performance with large datasets

🔹 Comparison: Lazy vs Eager Learners

Feature Lazy Learner Eager Learner

Model Building Delayed (at query) Done during training

Example KNN, Case-based Decision Tree, Naive Bayes

Speed (Training) Fast Slow

Speed (Testing) Slow Fast

Memory Usage High Usually Low

🔹 Applications
 Recommender systems
 Pattern recognition
 Handwriting and face recognition
 Medical diagnosis (case-based reasoning)
✅ Conclusion
Lazy Learners are a simple but powerful approach in machine learning that defer
computation until prediction time.
They are useful when training time is limited, but they require efficient search
and memory handling for large datasets.

9. OTHER CLASSIFICATION METHODS

🔹 Introduction
Classification is a fundamental task in machine learning, and several algorithms
are available to solve classification problems. Besides the popular ones like
Decision Trees, SVM, and Naïve Bayes, other classification methods include k-
Nearest Neighbors (KNN), Logistic Regression, Random Forests, and Artificial
Neural Networks (ANNs).

🔹 1. k-Nearest Neighbors (KNN)

 Instance-based learning method (a type of lazy learner).
 How it works: KNN classifies a test instance based on the majority class of
its 'k' nearest neighbors in the training dataset.
 Training process: No explicit model is built, the algorithm stores the training
data.
 Prediction process: For each new test instance, it finds the k closest
training instances (using distance metrics like Euclidean distance) and
assigns the most common class.
Advantages:
 Simple and intuitive
 Works well with small datasets
Disadvantages:
 Slow for large datasets
 Sensitive to irrelevant features

🔹 2. Logistic Regression
 Logistic Regression is a statistical method used for binary classification
(though it can be extended to multiclass classification).
 How it works: It estimates the probability that a given input belongs to a
certain class using the sigmoid function:
P(y=1∣X) = 1/ 1+e^−(b0+b1x1+b2x2+…+bnxn)
where y is the class label and X is the input vector.
Advantages:
 Simple and efficient
 Interpretable model
Disadvantages:
 Assumes a linear relationship between inputs and output
 Performs poorly with highly non-linear data

🔹 3. Random Forest
 Random Forest is an ensemble learning method that uses multiple decision
trees.
 How it works: It creates many decision trees by randomly selecting subsets
of features and data. Then, for classification, it uses the majority vote from
all the trees.
 Random forests are a combination of bagging and random feature
selection.
Advantages:
 Reduces overfitting by averaging multiple trees
 Works well with both categorical and continuous data
Disadvantages:
 Complex and slower for prediction
 Difficult to interpret (black-box model)

🔹 4. Artificial Neural Networks (ANNs)

 ANNs are inspired by the structure of the human brain and consist of layers
of interconnected nodes (neurons).
 How it works: Each neuron in a layer is connected to neurons in the next
layer, and these connections have weights. The network learns by adjusting
these weights during training (typically using backpropagation and gradient
descent).
 ANNs are flexible models that can handle complex and non-linear
classification problems.
Advantages:
 Highly flexible and capable of modeling complex patterns
 Works well for large datasets in tasks like image, speech, and text
classification
Disadvantages:
 Computationally expensive
 Hard to interpret (black-box model)

🔹 5. Decision Trees (Alternative Method)

 A Decision Tree is a flowchart-like tree structure where:
o Each internal node represents a test on an attribute (e.g., is age >
30?).
o Each branch represents the outcome of the test.
o Each leaf node represents a class label.
Advantages:
 Easy to understand and interpret
 Works well with both categorical and numerical data
Disadvantages:
 Prone to overfitting, especially with noisy data
 Sensitive to small variations in the data (can cause different trees)

🔹 6. Naïve Bayes (Alternative Method)

 A probabilistic classifier based on Bayes' Theorem.
 How it works: Assumes that features are independent, and calculates the
probability of each class label given the input features. The class with the
highest probability is selected as the predicted label.
Advantages:
 Simple and fast
 Works well with high-dimensional data (e.g., text classification)
Disadvantages:
 Assumes feature independence, which is often not true
 Can perform poorly if the assumptions are violated

✅ Conclusion
In addition to popular classification methods like SVM and decision trees, there
are many other effective classification algorithms like KNN, logistic regression,
random forests, and artificial neural networks.
Each method has its strengths and weaknesses, and the choice of which to use
depends on the data characteristics, problem complexity, and interpretability
needs.

10.PREDICTION IN DATA MINING

🔹 Introduction
 Prediction is a data mining technique used to forecast future data values
or trends based on historical data.
 It is a type of supervised learning where the goal is to predict a continuous
(numerical) value, unlike classification which predicts categorical labels.

🔹 What is Prediction?
 Prediction involves building a model that maps input variables (features) to
a target output (numeric value).
Prediction
Feature Classification
(Regression)
Output Categorical (e.g., Continuous (e.g.,
 type yes/no) price = 500)
Predicting future
Example Spam or not spam
sales
Algorithm Classification Regression
type algorithms algorithms
Example: Predicting house prices, sales revenue, stock prices, or temperature.

🔹 Difference: Classification vs Prediction

🔹 Common Prediction Techniques

1. Linear Regression
o Finds a straight-line relationship between input (X) and output (Y).
o Equation:
Y=b0+b1X1+b2X2+…+bnXn
2. Multiple Regression
o Like linear regression but uses multiple input variables to predict one
output variable.
3. Non-linear Regression
o Used when the relationship between input and output is non-linear.
4. Regression Trees
o Similar to decision trees, but used for predicting numeric values
instead of class labels.
5. Neural Networks
o Used for complex prediction tasks with non-linear relationships.

🔹 Steps in Prediction
1. Collect and clean data
2. Select relevant features
3. Choose a prediction model (e.g., linear regression)
4. Train the model using historical data
5. Test the model with new/unseen data
6. Use the model to predict future values

🔹 Applications of Prediction
 Sales forecasting
 Weather prediction
 Stock market trends
 Risk assessment in finance and healthcare
 Predicting customer behavior

🔹 Advantages
✅Helps in decision-making using future forecasts
✅Identifies hidden patterns in data
✅Can lead to cost savings and better planning

🔹 Challenges
❌Requires clean and sufficient data
❌Models can be affected by noise or outliers
❌May become inaccurate over time if data changes

✅ Conclusion
Prediction is an essential data mining task used to forecast continuous values
based on patterns learned from past data.
It is widely used in business, science, and industry to support planning,
forecasting, and optimization.

11.Accuracy and Error Measures in Classification and Prediction

🔹 Introduction
In data mining and machine learning, once a model is built, it is
important to evaluate its performance.
We use accuracy and error measures to find out how well the
model is working.
There are two main goals in evaluation:
 For classification: How many predictions are correct or
incorrect?
 For prediction (regression): How close are the predicted
values to the actual values?

🔹 1. Accuracy (for Classification)

 Accuracy measures the percentage of correct predictions
made by the model.
Accuracy=Number of Correct PredictionsTotal Pr
edictions×100\ Total Predictions
Example: If a model makes 80 correct predictions out of 100,
Accuracy= 80 x 100 /100 = 80%

🔹 2. Confusion Matrix (for Classification)

A confusion matrix helps evaluate a classifier's performance
using four values:
Pr
Pr
edi
edi
cte
cte
d
d
Ne
Po
ga
siti
tiv
ve
e
Tru Fal
From this, we can Act e se calculate:
ual Pos Ne
 Precision = TP / (TP + FP) → How
Pos itiv gat
many predicted itiv e ive positives were
correct e (TP (FN
) )
 Recall = TP / (TP + FN) → How
Fal Tru
many actual Act se e positives were
correctly ual Pos Ne predicted
Ne itiv gat
 F1-Score = Harmonic mean of
gat e ive
Precision and ive (FP (TN Recall
) )

🔹 3. Error Rate
 ErrorRate=1-Accuracy
It shows the percentage of wrong predictions.
Error Rate=(Number of Incorrect Predictions/Total Predicti
ons)×100

🔹 4. Measures for Prediction (Numerical Data)

For prediction or regression problems, we use different error
measures to compare predicted values and actual values:
a) Mean Absolute Error (MAE)
 Measures the average of absolute errors.
MAE=1/n=n∑ i=1∣yi−yî∣
Where:
 yi = actual value
 yÎ = predicted value
 n = total number of data points
b) Mean Squared Error (MSE)
 Measures the average of squared differences between actual
and predicted values.
MSE= 1/n n∑i=1 (yi−yî)^2
c) Root Mean Squared Error (RMSE)
 The square root of MSE; it gives error in the same unit as the
original data.
RMSE= MSE

🔹 5. R-Squared (R²) – for Regression

 Measures how well the predicted values fit the actual data.
 Value ranges from 0 to 1. Closer to 1 means better
prediction.
✅ Conclusion

Accuracy and error measures are important to check how well a model
performs.
They help us understand if a model is good, needs improvement, or is
overfitting/underfitting.
Different types of models use different measures depending on whether the
output is a label or a number.
13. Ensemble Methods
🔹 Introduction
 Ensemble methods are techniques that combine multiple models (called
weak learners) to create a stronger and more accurate model.
 The idea is that a group of weak models can perform better together than
any single model alone.

🔹 Why Use Ensemble Methods?

 A single classifier may make errors.
 Combining several classifiers helps:
o Increase accuracy
o Reduce overfitting
o Handle noise or complex patterns in data

🔹 Types of Ensemble Methods

1. Bagging (Bootstrap Aggregating)
 Multiple models are trained using different random samples of the original
dataset (with replacement).
 Their results are combined (usually by voting).
 Reduces variance and improves stability.
✅ Popular Example: Random Forest

2. Boosting
 Models are trained sequentially, each one learning from the errors of the
previous one.
 Focuses more on difficult cases.
 Final output is a weighted sum of all models.
✅ Popular Examples: AdaBoost, Gradient Boosting, XGBoost

3. Stacking (Stacked Generalization)

 Combines multiple models (called base learners), and then uses another
model (called a meta-learner) to make the final prediction.
 Can use different types of algorithms together.
✅ Example: Combining Decision Tree, SVM, and Logistic Regression, with a Neural
Network as a meta-learner.

🔹 Comparison Table
Method Approach Key Benefit
Bagging Parallel, uses random Reduces variance
subsets
Boosting Sequential, focuses on errors Improves accuracy
Stacking Combines different models Uses model diversity

🔹 Advantages of Ensemble Methods

✅Higher accuracy than individual models
✅Better generalization to new data
✅ Less likely to overfit (especially with bagging)

🔹 Disadvantages
❌More complex and slower
❌Harder to interpret
❌ Requires more resources (memory, computation)

✅ Conclusion
Ensemble methods are powerful techniques in data mining and machine learning.
By combining multiple weak models, they achieve better performance,
robustness, and accuracy than single classifiers.

14.Model Selection in Data Mining

🔹 Introduction
Model selection is the process of choosing the best model from a set of candidate
models for solving a specific data mining problem (such as classification or
prediction).
It helps ensure the model is accurate, efficient, and generalizes well to unseen
data.

🔹 Why Model Selection is Important

 There are many algorithms available (e.g., decision tree, SVM, neural
networks).
 Not every model works well for every type of dataset.
 We need to select the model that gives the best performance based on the
problem and data.

🔹 Steps in Model Selection

1. Define the Problem
o Is it classification, regression, clustering, etc.?
o Understand the data type, target variable, and business goal.
2. Choose Candidate Models
o Pick a few algorithms that are suitable (e.g., Decision Tree, SVM,
Naive Bayes for classification).
3. Split the Dataset
o Use training and testing sets (e.g., 70% training, 30% testing).
o Optionally use cross-validation for better evaluation.
4. Train the Models
o Fit the candidate models to the training data.
5. Evaluate Models
o Use evaluation metrics:
 For classification: Accuracy, Precision, Recall, F1-score
 For prediction: MAE, MSE, RMSE, R²
6. Compare Performance
o Choose the model that performs best on testing data.
o Consider both accuracy and complexity.
7. Select the Best Model
o Finalize the model and tune its hyperparameters if needed.

🔹 Model Selection Techniques

✅ A. Hold-Out Method
 Divide dataset into training and testing.
 Simple and fast.
✅ B. Cross-Validation
 Data is split into k folds.
 Each fold is used once as test, rest as training.
 More reliable.
✅ C. Bootstrapping
 Random sampling with replacement to create training sets.
 Useful with small datasets.
🔹 Bias-Variance Trade-off
 Good model selection balances:
o Bias (error due to simplifying assumptions)
o Variance (sensitivity to small changes in data)
A good model has low bias and low variance.
🔹 Factors Affecting Model Selection
 Size of data
 Noise in data
 Interpretability of model
 Training time and resources
 Scalability

🔹 Example
For email spam detection:
 Try models like Naive Bayes, Decision Tree, SVM.
 Use 10-fold cross-validation.
 Evaluate each model and choose the one with the highest accuracy and
lowest error.

✅ Conclusion
Model selection is a critical step in data mining.
By carefully comparing models using appropriate evaluation techniques, we can
choose a model that gives the best performance for our task while ensuring it
generalizes well to new data.

Big Data Mining and Analytics Notes
No ratings yet
Big Data Mining and Analytics Notes
7 pages
ABP DWDM UNIT 4 Classification 1
No ratings yet
ABP DWDM UNIT 4 Classification 1
51 pages
Data Minning Unit 2-1
No ratings yet
Data Minning Unit 2-1
10 pages
Classification
No ratings yet
Classification
33 pages
Classification Personal
No ratings yet
Classification Personal
36 pages
Classification & Prediction Guide
No ratings yet
Classification & Prediction Guide
83 pages
Classification and Prediction Lecture-22,23,24,25,26,27, 28: Dr. Sudhir Sharma Manipal University Jaipur
No ratings yet
Classification and Prediction Lecture-22,23,24,25,26,27, 28: Dr. Sudhir Sharma Manipal University Jaipur
43 pages
Classifiction
No ratings yet
Classifiction
42 pages
Classification Algorithms
No ratings yet
Classification Algorithms
23 pages
Classification Notes
No ratings yet
Classification Notes
14 pages
Chapter 5 Classification
No ratings yet
Chapter 5 Classification
24 pages
Classification and Clustering Techniques in Data Mining
No ratings yet
Classification and Clustering Techniques in Data Mining
18 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
17 pages
Classification and Prediction-Module4
No ratings yet
Classification and Prediction-Module4
26 pages
Data Mining: Classification & Prediction
No ratings yet
Data Mining: Classification & Prediction
71 pages
DWDM Unit-3: What Is Classification? What Is Prediction?
No ratings yet
DWDM Unit-3: What Is Classification? What Is Prediction?
12 pages
10 Classification New 1
No ratings yet
10 Classification New 1
31 pages
Unit 3
No ratings yet
Unit 3
16 pages
CS402 Mod 3
No ratings yet
CS402 Mod 3
2 pages
DM Module 4
No ratings yet
DM Module 4
12 pages
Classification Techniques Overview
No ratings yet
Classification Techniques Overview
141 pages
Unit 6 Classification and Prediction
No ratings yet
Unit 6 Classification and Prediction
66 pages
Unit 4 Classification & Prediction
No ratings yet
Unit 4 Classification & Prediction
10 pages
7 Classification
100% (3)
7 Classification
63 pages
4 22865 IS465 2019 1 2 1 08ClassBasic
No ratings yet
4 22865 IS465 2019 1 2 1 08ClassBasic
43 pages
DM - 06 Mar 2025
No ratings yet
DM - 06 Mar 2025
13 pages
Data Mining-Unit-3
No ratings yet
Data Mining-Unit-3
16 pages
Fundamentals of Data Science Unit 4
100% (1)
Fundamentals of Data Science Unit 4
31 pages
DM Chapter 4
No ratings yet
DM Chapter 4
47 pages
DWDM Unit Iv
No ratings yet
DWDM Unit Iv
30 pages
Understanding Data Classification Methods
No ratings yet
Understanding Data Classification Methods
23 pages
Classification
No ratings yet
Classification
45 pages
Notes
No ratings yet
Notes
35 pages
CH 8 Data Mining
No ratings yet
CH 8 Data Mining
30 pages
CH 5
No ratings yet
CH 5
19 pages
4 & 5 DWM 2024-25
No ratings yet
4 & 5 DWM 2024-25
32 pages
Classification and Prediction Guide
No ratings yet
Classification and Prediction Guide
93 pages
Unit 4
No ratings yet
Unit 4
20 pages
Classification
No ratings yet
Classification
23 pages
08 - Classification - Decision Trees
No ratings yet
08 - Classification - Decision Trees
116 pages
5.classification and Prediction
No ratings yet
5.classification and Prediction
9 pages
Jalali@mshdiua - Ac.ir Jalali - Mshdiau.ac - Ir: Data Mining
No ratings yet
Jalali@mshdiua - Ac.ir Jalali - Mshdiau.ac - Ir: Data Mining
50 pages
CH 5
No ratings yet
CH 5
84 pages
Unit 3 &4 BDA Notes
No ratings yet
Unit 3 &4 BDA Notes
20 pages
ClassificationandPrediction Module3
No ratings yet
ClassificationandPrediction Module3
88 pages
DM Unit - 3
No ratings yet
DM Unit - 3
21 pages
Classification and Prediction
No ratings yet
Classification and Prediction
14 pages
Supervised Learning Classification Techniques
No ratings yet
Supervised Learning Classification Techniques
224 pages
Classification & Prediction Guide
100% (1)
Classification & Prediction Guide
67 pages
Classification and Prediction Overview
No ratings yet
Classification and Prediction Overview
75 pages
INT354 - Unit 2
No ratings yet
INT354 - Unit 2
26 pages
L11 Slides
No ratings yet
L11 Slides
28 pages
Unit 4 DS
No ratings yet
Unit 4 DS
16 pages
DMDW 11 Classification Basic
No ratings yet
DMDW 11 Classification Basic
41 pages
Classification & Prediction: - Shailesh Yadav Central University of Rajasthan
No ratings yet
Classification & Prediction: - Shailesh Yadav Central University of Rajasthan
28 pages
19-Introduction Classification Algorithm-18-09-2024
No ratings yet
19-Introduction Classification Algorithm-18-09-2024
102 pages
Classification & Prediction
No ratings yet
Classification & Prediction
19 pages
Classification
No ratings yet
Classification
50 pages
08 Class Basic
No ratings yet
08 Class Basic
103 pages
DecisionTree Numerical ID3Prob
No ratings yet
DecisionTree Numerical ID3Prob
114 pages
Decision Trees in Machine Learning
No ratings yet
Decision Trees in Machine Learning
15 pages
AI Exam for Students
No ratings yet
AI Exam for Students
10 pages
Android Malware Detection
No ratings yet
Android Malware Detection
17 pages
Noughts and Crosses Machine Learning Guide
No ratings yet
Noughts and Crosses Machine Learning Guide
2 pages
ML Lab Mannual R22 Cse (DS)
100% (1)
ML Lab Mannual R22 Cse (DS)
46 pages
Data Science
No ratings yet
Data Science
9 pages
Data Analysis Techniques Guide
No ratings yet
Data Analysis Techniques Guide
31 pages
Data Analytics Unit4 Notes
No ratings yet
Data Analytics Unit4 Notes
3 pages
Tree-Based Modeling Tutorial in R & Python
No ratings yet
Tree-Based Modeling Tutorial in R & Python
28 pages
Machine Learning Notes For KTU Semester 7
No ratings yet
Machine Learning Notes For KTU Semester 7
226 pages
Adinarayana, Ilavarasan - 2018 - An Efficient Decision Tree For Imbalance Data Learning Using Confiscate and Substitute Technique
No ratings yet
Adinarayana, Ilavarasan - 2018 - An Efficient Decision Tree For Imbalance Data Learning Using Confiscate and Substitute Technique
8 pages
Learning Decision Trees
No ratings yet
Learning Decision Trees
10 pages
QUESTIONS
No ratings yet
QUESTIONS
20 pages
Novel Meta-Features for Anomaly Detection
No ratings yet
Novel Meta-Features for Anomaly Detection
13 pages
Open-Pit Mine Truck Fuel Consumption Pattern and Application Based On Multi-Dimensional Features and XGBoost
No ratings yet
Open-Pit Mine Truck Fuel Consumption Pattern and Application Based On Multi-Dimensional Features and XGBoost
10 pages
AIML - Module 3 - Updated
No ratings yet
AIML - Module 3 - Updated
42 pages
ML Notes MAKAUT 7th Sem
100% (2)
ML Notes MAKAUT 7th Sem
31 pages
Whopes PDF
No ratings yet
Whopes PDF
81 pages
AI & IoT in Smart Healthcare
No ratings yet
AI & IoT in Smart Healthcare
32 pages
ID3 Algorithm Implementation with Tennis Data
No ratings yet
ID3 Algorithm Implementation with Tennis Data
3 pages
Prediction of Medical Costs Using Regression Algorithms: A. Lakshmanarao, Chandra Sekhar Koppireddy, G.Vijay Kumar
0% (1)
Prediction of Medical Costs Using Regression Algorithms: A. Lakshmanarao, Chandra Sekhar Koppireddy, G.Vijay Kumar
7 pages
AI for Predicting Consumer Case Outcomes
No ratings yet
AI for Predicting Consumer Case Outcomes
15 pages
Applications of Machine Learning and Data Analytics Models in Maritime Transportation
No ratings yet
Applications of Machine Learning and Data Analytics Models in Maritime Transportation
319 pages
Phishing Detection with ML
No ratings yet
Phishing Detection with ML
25 pages
Btech III Year I Semester (Ar20)
No ratings yet
Btech III Year I Semester (Ar20)
7 pages
Arora 2019
No ratings yet
Arora 2019
29 pages
R Package synthpop for Synthetic Data
No ratings yet
R Package synthpop for Synthetic Data
9 pages
Prediction of Financial Distress Analyzing The Industry Performance in Stock Exchange Market Using Data Mining
100% (1)
Prediction of Financial Distress Analyzing The Industry Performance in Stock Exchange Market Using Data Mining
27 pages
18ai61-Model Question Paper Solutions
No ratings yet
18ai61-Model Question Paper Solutions
71 pages

UNIT 3 Data Warehousing

Uploaded by

UNIT 3 Data Warehousing

Uploaded by

UNIT 3

CLASSIFICATION AND PREDICTION

🔹 1. Data Quality Issues

🔹 2. Overfitting and Underfitting

🔹 3. Irrelevant or Redundant Features

🔹 5. Lack of Sufficient Training Data

🔹 6. Difficulty in Selecting the Right Model

🔹 7. Scalability and Efficiency

🔹 8. Real-time Prediction Issues

🔹 9. Data Integration Problems

2. Classification by Decision Tree – Introduction

🔹1. What is a Decision Tree?

🔹2. Why Use Decision Trees for Classification?

Outlook Temperature Humidity Wind Play Tennis

A simple decision tree for this might be:

🔹 3. How Decision Trees Work

🔹 What is Bayes' Theorem?

🔹 Naïve Bayes Classifier

🔹 Steps in Bayesian Classification:

🔹 Example: Email Spam Detection

Then apply Bayes’ theorem to predict the class of a new email.

🔹 Advantages of Bayesian Classification:

4. RULE BASED CLASSIFICATION

🔹 Rule-Based Classifier (Classifier = Rule Set)

🔹 How Are Rules Generated?

🔹 Advantages of Rule-Based Classification:

🔹 Example Use Cases:

🔹 What is an Artificial Neural Network (ANN)?

🔹 How Backpropagation Works

🔹 Steps in Backpropagation Algorithm

6. SUPPORT VECTOR MACHINES

A line (in 2D) or plane (in higher dimensions) that separates

🔹 How SVM Works – Steps

7. Associative Classification (AC)

🔹 What is Association Rule Mining?

🔹 Benefits of Associative Classification

🔹 What is a Lazy Learner?

🔹 Working of Lazy Learners

🔹 Popular Lazy Learning Algorithm: K-Nearest Neighbors (KNN)

🔹 Advantages of Lazy Learners

🔹 Disadvantages of Lazy Learners

🔹 Comparison: Lazy vs Eager Learners

Feature Lazy Learner Eager Learner

Model Building Delayed (at query) Done during training

Example KNN, Case-based Decision Tree, Naive Bayes

Speed (Training) Fast Slow

Speed (Testing) Slow Fast

Memory Usage High Usually Low

9. OTHER CLASSIFICATION METHODS

🔹 1. k-Nearest Neighbors (KNN)

🔹 4. Artificial Neural Networks (ANNs)

🔹 5. Decision Trees (Alternative Method)

🔹 6. Naïve Bayes (Alternative Method)

10.PREDICTION IN DATA MINING

🔹 Difference: Classification vs Prediction

🔹 Common Prediction Techniques

11.Accuracy and Error Measures in Classification and Prediction

🔹 1. Accuracy (for Classification)

🔹 2. Confusion Matrix (for Classification)

🔹 4. Measures for Prediction (Numerical Data)

🔹 5. R-Squared (R²) – for Regression

🔹 Why Use Ensemble Methods?

🔹 Types of Ensemble Methods

3. Stacking (Stacked Generalization)

🔹 Advantages of Ensemble Methods

14.Model Selection in Data Mining

🔹 Why Model Selection is Important

🔹 Steps in Model Selection

🔹 Model Selection Techniques

You might also like