0% found this document useful (0 votes)
27 views39 pages

UNIT 3 Data Warehousing

The document discusses various classification and prediction techniques in data mining, highlighting challenges such as data quality issues, overfitting, and model selection. It covers methods including Decision Trees, Bayesian Classification, Rule-Based Classification, Backpropagation, Support Vector Machines, and Associative Classification, detailing their mechanisms, advantages, and disadvantages. Effective classification requires clean data, appropriate models, and sufficient training examples to ensure reliable outcomes.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views39 pages

UNIT 3 Data Warehousing

The document discusses various classification and prediction techniques in data mining, highlighting challenges such as data quality issues, overfitting, and model selection. It covers methods including Decision Trees, Bayesian Classification, Rule-Based Classification, Backpropagation, Support Vector Machines, and Associative Classification, detailing their mechanisms, advantages, and disadvantages. Effective classification requires clean data, appropriate models, and sufficient training examples to ensure reliable outcomes.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd

UNIT 3

CLASSIFICATION AND PREDICTION


1. Issues Regarding Classification and Prediction
Classification and prediction are major tasks in data mining, where the goal is to
build models that can classify data or predict unknown values. However, several
challenges affect their accuracy and performance.

🔹 1. Data Quality Issues


 Incomplete Data: Missing values can mislead model training.
 Noisy Data: Errors or outliers in data reduce model accuracy.
 Inconsistent Data: Different formats or units across datasets can confuse
the classifier.

🔹 2. Overfitting and Underfitting


 Overfitting: Model performs well on training data but poorly on new data.
 Underfitting: Model is too simple to capture the data patterns.

🔹 3. Irrelevant or Redundant Features


 Too many unnecessary attributes can reduce accuracy.
 It makes the model more complex and harder to interpret.

🔹 4. Imbalanced Classes
 If one class has much more data than others, the model may ignore the
minority class.
 Common in fraud detection or disease prediction.

🔹 5. Lack of Sufficient Training Data


 Small datasets may not represent the real-world scenario properly.
 Leads to poor generalization and weak predictions.

🔹 6. Difficulty in Selecting the Right Model


 Different algorithms work better for different types of data.
 Choosing the wrong one can lead to poor performance.

🔹 7. Scalability and Efficiency


 When data is very large (like in data warehouses), classification algorithms
can become slow and resource-heavy.

🔹 8. Real-time Prediction Issues


 In some cases, predictions must be made instantly (e.g., fraud detection).
 Not all models are optimized for real-time decision-making.

🔹 9. Data Integration Problems


 When data comes from multiple sources, differences in schema or format
may affect prediction quality.

Conclusion:
Effective classification and prediction require clean, balanced, and well-
structured data along with the right model and enough training examples.
Overcoming these issues is crucial for building reliable and useful data mining
systems.

2. Classification by Decision Tree – Introduction


🔹 What is Classification?
 Classification is a data mining technique used to predict the class or
category of a given data point.
 It uses a model trained on historical (labeled) data to classify new data.

🔹1. What is a Decision Tree?


 A Decision Tree is a tree-shaped structure used for classification.
 It breaks down a dataset into smaller subsets while building the tree, and at
the same time, an associated decision tree is incrementally developed.
 The final result is a tree where:
o Internal nodes represent tests or decisions on attributes
o Branches represent outcomes of those tests
o Leaf nodes represent class labels (final decision)

🔹2. Why Use Decision Trees for Classification?


 Easy to understand and interpret (like a flowchart)
 Handles both numerical and categorical data
 No need for domain knowledge
 Works well for large datasets
🔹 Basic Example:
Suppose we want to classify whether a person will play tennis or not based on
weather conditions:

Outlook Temperature Humidity Wind Play Tennis


Sunny Hot High Weak No
Overcast Mild High Strong Yes

A simple decision tree for this might be:


[Outlook]
/ | \
Sunny Overcast Rain
/ \
[Humidity] [Wind]
/ \ / \
High Normal Weak Strong
No Yes Yes No

🔹 3. How Decision Trees Work


1. Start with the full dataset.
2. Choose the best attribute to split data (based on measures like Information
Gain or Gini Index).
3. Create branches for each value of the attribute.
4. Repeat recursively for each subset of data until:
o All records belong to the same class, or
o No more attributes are left.
🔹 5. Key Algorithms Used
 ID3 (Iterative Dichotomiser 3): Uses Information Gain.
 C4.5: Improvement over ID3, handles missing values and pruning.
 CART (Classification and Regression Tree): Uses Gini Index.

🔹 6. Advantages
 Easy to understand and interpret.
 Can handle both numerical and categorical data.
 Performs well on large datasets.

🔹 7. Disadvantages
 Can overfit the data if the tree is too deep.
 Sensitive to small changes in data.
 Pruning is required to avoid complexity.

🔹 8. Applications
 Customer segmentation
 Medical diagnosis
 Fraud detection
 Credit risk analysis
Conclusion
Classification using decision trees is a powerful and popular method in data
mining.
It provides a clear and visual model that can be used for decision-making,
especially in areas like:
 Customer classification
 Fraud detection
 Medical diagnosis
 Loan approval

3. BAYESIAN CLASSIFICATION
🔹 Introduction
Bayesian Classification is a statistical approach to classification based on Bayes’
Theorem.
It is widely used in machine learning for predicting the class of a given data point
based on probability.

🔹 What is Bayes' Theorem?


Bayes' Theorem gives a way to calculate posterior probability using prior
knowledge.
P(H∣X) = P(X∣H) ⋅ P(H) / P(X)
Where:
 P(H∣X): Probability of hypothesis H given the data X (posterior)
 P(H): Probability of H being true (prior)
 P(X∣H): Probability of data X given hypothesis H (likelihood)
 P(X): Probability of data X (evidence)

🔹 Bayesian Classifier:
A Bayesian Classifier uses this formula to predict the most probable class for a
given input.
It assumes a probabilistic model of the data and calculates the probability for each
class.

🔹 Naïve Bayes Classifier


One of the most popular Bayesian classifiers is the Naïve Bayes Classifier.
It is based on a naïve assumption:
 All attributes are independent of each other, given the class label.
P(C∣X1,X2,...,Xn) = P(C)⋅P(X1∣C)⋅P(X2∣C)⋅…⋅P(Xn∣C) / P(X1,X2,...,Xn)

Where:
 C = class
 X1,X2,...,Xn = attribute values

🔹 Steps in Bayesian Classification:


1. Calculate prior probabilities for each class
2. Calculate conditional probabilities for each attribute given a class
3. Apply Bayes’ theorem to find the posterior probability
4. Choose the class with the highest posterior probability

🔹 Example: Email Spam Detection


Suppose we want to classify emails as Spam or Not Spam based on keywords like
"free", "win", "offer".
 Calculate:
o P(Spam), P(NotSpam)
o P(word∣Spam) P(word∣NotSpam)

Then apply Bayes’ theorem to predict the class of a new email.

🔹 Advantages of Bayesian Classification:


✅Simple and fast.
✅Works well even with small datasets.
✅Handles missing data effectively.
✅Performs well in text classification, like spam filtering.

🔹 Disadvantages:
❌ Assumes independence between features (which is rarely true).
❌ May not work well with highly correlated attributes.
✅ Conclusion
Bayesian Classification is a powerful, probabilistic technique that uses prior
knowledge and statistical rules to classify data.
It is especially effective in domains like text mining, document classification, and
spam filtering.

4. RULE BASED CLASSIFICATION


🔹 What is Rule-Based Classification?
Rule-Based Classification is a method in data mining where a set of IF–THEN rules
is used to classify data into different categories (classes).
Each rule connects a set of conditions (on attribute values) to a class label.

🔹 Structure of a Rule
A rule has two parts:
IF <condition>
THEN <class label>
 Condition: a combination of attribute values
 Class label: the predicted class for data matching the condition

🔹 Example Rule
IF age > 50 AND cholesterol = high
THEN class = heart_disease
means:
If a person is older than 50 and has high cholesterol, classify them as having heart
disease.

🔹 Rule-Based Classifier (Classifier = Rule Set)


The model built by rule-based classification is called a classifier, and it consists of
a set of IF-THEN rules.
When a new record is to be classified:
1. The system checks each rule to see if the condition is satisfied.
2. If a rule matches, it assigns the class in the THEN part to the data.
3. If multiple rules match, a conflict resolution strategy is used (e.g., first
match, most specific, highest confidence).

🔹 How Are Rules Generated?


Rules can be:
 Extracted from decision trees
 Learned directly from training data using algorithms like:
o RIPPER (Repeated Incremental Pruning to Produce Error Reduction)
o FOIL (First Order Inductive Learner)
o CN2

🔹 Advantages of Rule-Based Classification:


✅Easy to understand and interpret
✅Can handle discrete and continuous data
✅Rules are explicit and human-readable
✅ Supports incremental learning (rules can be updated)

🔹 Disadvantages:
❌May overfit the training data
❌Conflict may occur when multiple rules match
❌ Rule generation may become complex if data is large or noisy

🔹 Example Use Cases:


 Medical diagnosis
 Fraud detection
 Spam email classification
 Customer segmentation
✅ Conclusion
Rule-Based Classification is a transparent and intuitive method that uses a set of
IF–THEN rules to predict class labels.
It is useful in domains where interpretability and clarity are important, making it
a popular choice in expert systems and knowledge-based applications.
5. CLASSIFICATION BY BACKPROPAGATION
🔹 Introduction
 Backpropagation is a supervised learning algorithm used for training
artificial neural networks (ANNs).
 It is mainly used in classification problems, such as image recognition,
speech recognition, and medical diagnosis.
 The term “backpropagation” stands for "backward propagation of error".

🔹 What is an Artificial Neural Network (ANN)?


 An ANN is a computational model inspired by the human brain.
 It consists of neurons (nodes) arranged in layers:
o Input Layer: Takes input features
o Hidden Layer(s): Performs intermediate processing
o Output Layer: Produces final class label or output

🔹 How Backpropagation Works


It has two main phases:
1. Forward Pass:
 The input is passed through the network layer by layer.
 The output is calculated using activation functions (like sigmoid or ReLU).
 The output is compared to the actual (target) value.
 Error is calculated using a loss function (e.g., Mean Squared Error).
2. Backward Pass (Backpropagation):
 The error is propagated backward from output to input.
 The network adjusts the weights of the connections to reduce the error.
 This is done using a method called Gradient Descent.

🔹 Steps in Backpropagation Algorithm


1. Initialize weights randomly.
2. Forward propagate input to compute output.
3. Calculate error (Loss) between predicted and actual output.
4. Backpropagate the error:
o Compute gradient of the loss with respect to each weight.
5. Update weights using gradient descent:
o wnew=wold−η⋅∂w∂Error
o where η is the learning rate.
6. Repeat the process for multiple iterations (epochs) until error is minimized.

🔹 Example:
If you're classifying handwritten digits (0–9), each image is input to the network,
and the output layer will have 10 nodes (one for each digit).
The node with the highest value is the predicted digit.

🔹 Advantages of Backpropagation:
✅Works well for complex, non-linear classification problems
✅Can be used for both binary and multi-class classification
✅Learns from data without needing explicit rules

🔹 Disadvantages:
❌Requires large amounts of training data
❌Can be computationally expensive
❌May get stuck in local minima
❌ Sensitive to choice of learning rate

✅ Conclusion
Backpropagation is the core algorithm for training neural networks and is widely
used for classification tasks in modern AI applications.
It uses error signals to adjust weights, helping the network to learn patterns in
the data effectively.

6. SUPPORT VECTOR MACHINES


🔹 Introduction
 Support Vector Machine (SVM) is a supervised machine learning algorithm
used for classification and regression tasks.
 It is especially powerful for binary classification problems.
 SVM aims to find the best decision boundary (also called hyperplane) that
separates the classes in the dataset.

🔹 Basic Concept
 SVM looks for a hyperplane that:
o Separates the data points of different classes with the maximum
margin.
o Margin = distance between the hyperplane and the nearest data
points from both classes (called support vectors).
 The wider the margin, the better the generalization on unseen data.

🔹 Important Terms

Term Meaning

A line (in 2D) or plane (in higher dimensions) that separates


Hyperplane
classes

Support The data points closest to the hyperplane; they “support” the
Vectors decision boundary

Margin The distance between the hyperplane and the support vectors

🔹 Types of SVM
1. Linear SVM:
o Used when the data is linearly separable.
o Finds a straight-line hyperplane.
2. Non-Linear SVM:
o Used when data cannot be separated by a straight line.
o Uses a Kernel Trick to map data to a higher-dimensional space where
it can be separated linearly.
🔹 Kernel Functions
Kernels help transform data into higher dimensions. Common ones include:
 Linear Kernel
 Polynomial Kernel
 Radial Basis Function (RBF) / Gaussian Kernel
 Sigmoid Kernel

🔹 How SVM Works – Steps


1. Plot the data in a space.
2. Choose the best hyperplane that separates the classes with maximum
margin.
3. If not linearly separable, apply kernel function.
4. Use optimization to minimize classification error and maximize margin.

🔹 Advantages of SVM
✅ Works well in high-dimensional spaces
✅ Effective when number of features > number of samples
✅ Robust to overfitting (especially with proper kernel and regularization)
✅ Can handle non-linear data using kernels

🔹 Disadvantages
❌ Not suitable for large datasets (slow training)
❌ Choice of kernel and parameters is critical
❌ Doesn’t work well if classes overlap a lot
🔹 Example Use Cases
 Email spam detection
 Face recognition
 Disease diagnosis
 Text classification

✅ Conclusion
Support Vector Machines are powerful classifiers that aim to create a decision
boundary with maximum margin between classes.
They are highly accurate and work well in complex, high-dimensional
classification tasks, making them popular in many real-world applications.

7. Associative Classification (AC)


🔹 Introduction
 Associative Classification is a classification technique that combines two
data mining approaches:
1. Association Rule Mining
2. Classification
 Instead of using IF–THEN rules derived from decision trees, AC generates
classification rules using association rules.
 It is based on the idea that strong associations between attribute-value
pairs and class labels can be used to classify new data.

🔹 What is Association Rule Mining?


Association rule mining is about finding frequent patterns or relationships in data,
typically in the form:
IF A and B → THEN C
In Associative Classification, such rules must predict a class label:
IF (X1 = v1) AND (X2 = v2) THEN class = C
These are called Class Association Rules (CARs)
🔹 How Associative Classification Works
1. Mine all frequent itemsets where one item is a class label.
2. From those itemsets, generate class association rules (CARs).
3. Use a rule selection and ordering strategy (like confidence, support) to
select the best rules.
4. Apply rules to classify new data. The rule that matches the new record and
has the highest confidence is typically chosen.

🔹 Example:
Suppose we have transaction data about customers buying products and their
purchase category:
IF (buys = milk) AND (buys = bread) → class = grocery
This means: If a customer buys both milk and bread, they are likely classified
under “grocery” category.

🔹 Benefits of Associative Classification


✅Generates highly accurate rules
✅Uses frequent patterns found in the data
✅Rules are interpretable and easy to understand
✅More flexible than decision tree rules
🔹 Challenges / Disadvantages
❌Can generate too many rules → need pruning
❌Rule selection can be difficult when multiple rules match
❌Requires efficient mining of class association rules
🔹 Common Algorithms Used
 CBA (Classification Based on Associations)
 CMAR (Classification based on Multiple Association Rules)
 CPAR (Classification based on Predictive Association Rules)

🔹 Applications
 Market basket analysis
 Customer behavior prediction
 Text and document classification
 Medical diagnosis

✅ Conclusion
Associative Classification is a powerful hybrid method that builds accurate,
interpretable classifiers using association rules linked to class labels.
By combining the strengths of association mining and classification, it is well-
suited for domains that require strong rule-based prediction.

8. LAZY LEARNERS
🔹 Introduction
In machine learning, algorithms are generally categorized into two types based on
when they generalize the model:
 Eager Learners: Build the model during training (e.g., Decision Tree, SVM)
 Lazy Learners: Delay model building until a query (test instance) is made

🔹 What is a Lazy Learner?


 A Lazy Learner is a type of learning algorithm that stores the training data
and waits until a test (query) comes in to do any real computation.
 It does not build a global model in advance.
 It performs local generalization at prediction time.

🔹 Working of Lazy Learners


1. During training:
o It stores the entire training dataset as it is.
o No training or model construction happens.
2. During testing:
o It compares the new test instance to all stored training instances.
o It finds the most similar examples and uses them to make a
prediction.

🔹 Popular Lazy Learning Algorithm: K-Nearest Neighbors (KNN)


 KNN is the most common lazy learner.
 It predicts the class of a test instance based on the majority class of the 'k'
closest neighbors.
Example:
If a new fruit looks similar to 3 apples and 2 oranges in the dataset, and k=5, it is
classified as an apple.

🔹 Advantages of Lazy Learners


✅No need for long training time
✅Very simple to implement
✅Naturally handles dynamic data
✅Can model complex decision boundaries

🔹 Disadvantages of Lazy Learners


❌Slow at prediction time (because they search the entire dataset)
❌Need large memory to store all training data
❌Sensitive to irrelevant features or noise
❌Poor performance with large datasets

🔹 Comparison: Lazy vs Eager Learners

Feature Lazy Learner Eager Learner

Model Building Delayed (at query) Done during training

Example KNN, Case-based Decision Tree, Naive Bayes

Speed (Training) Fast Slow

Speed (Testing) Slow Fast

Memory Usage High Usually Low

🔹 Applications
 Recommender systems
 Pattern recognition
 Handwriting and face recognition
 Medical diagnosis (case-based reasoning)
✅ Conclusion
Lazy Learners are a simple but powerful approach in machine learning that defer
computation until prediction time.
They are useful when training time is limited, but they require efficient search
and memory handling for large datasets.

9. OTHER CLASSIFICATION METHODS


🔹 Introduction
Classification is a fundamental task in machine learning, and several algorithms
are available to solve classification problems. Besides the popular ones like
Decision Trees, SVM, and Naïve Bayes, other classification methods include k-
Nearest Neighbors (KNN), Logistic Regression, Random Forests, and Artificial
Neural Networks (ANNs).

🔹 1. k-Nearest Neighbors (KNN)


 Instance-based learning method (a type of lazy learner).
 How it works: KNN classifies a test instance based on the majority class of
its 'k' nearest neighbors in the training dataset.
 Training process: No explicit model is built, the algorithm stores the training
data.
 Prediction process: For each new test instance, it finds the k closest
training instances (using distance metrics like Euclidean distance) and
assigns the most common class.
Advantages:
 Simple and intuitive
 Works well with small datasets
Disadvantages:
 Slow for large datasets
 Sensitive to irrelevant features

🔹 2. Logistic Regression
 Logistic Regression is a statistical method used for binary classification
(though it can be extended to multiclass classification).
 How it works: It estimates the probability that a given input belongs to a
certain class using the sigmoid function:
P(y=1∣X) = 1/ 1+e^−(b0+b1x1+b2x2+…+bnxn)
where y is the class label and X is the input vector.
Advantages:
 Simple and efficient
 Interpretable model
Disadvantages:
 Assumes a linear relationship between inputs and output
 Performs poorly with highly non-linear data

🔹 3. Random Forest
 Random Forest is an ensemble learning method that uses multiple decision
trees.
 How it works: It creates many decision trees by randomly selecting subsets
of features and data. Then, for classification, it uses the majority vote from
all the trees.
 Random forests are a combination of bagging and random feature
selection.
Advantages:
 Reduces overfitting by averaging multiple trees
 Works well with both categorical and continuous data
Disadvantages:
 Complex and slower for prediction
 Difficult to interpret (black-box model)

🔹 4. Artificial Neural Networks (ANNs)


 ANNs are inspired by the structure of the human brain and consist of layers
of interconnected nodes (neurons).
 How it works: Each neuron in a layer is connected to neurons in the next
layer, and these connections have weights. The network learns by adjusting
these weights during training (typically using backpropagation and gradient
descent).
 ANNs are flexible models that can handle complex and non-linear
classification problems.
Advantages:
 Highly flexible and capable of modeling complex patterns
 Works well for large datasets in tasks like image, speech, and text
classification
Disadvantages:
 Computationally expensive
 Hard to interpret (black-box model)

🔹 5. Decision Trees (Alternative Method)


 A Decision Tree is a flowchart-like tree structure where:
o Each internal node represents a test on an attribute (e.g., is age >
30?).
o Each branch represents the outcome of the test.
o Each leaf node represents a class label.
Advantages:
 Easy to understand and interpret
 Works well with both categorical and numerical data
Disadvantages:
 Prone to overfitting, especially with noisy data
 Sensitive to small variations in the data (can cause different trees)

🔹 6. Naïve Bayes (Alternative Method)


 A probabilistic classifier based on Bayes' Theorem.
 How it works: Assumes that features are independent, and calculates the
probability of each class label given the input features. The class with the
highest probability is selected as the predicted label.
Advantages:
 Simple and fast
 Works well with high-dimensional data (e.g., text classification)
Disadvantages:
 Assumes feature independence, which is often not true
 Can perform poorly if the assumptions are violated

✅ Conclusion
In addition to popular classification methods like SVM and decision trees, there
are many other effective classification algorithms like KNN, logistic regression,
random forests, and artificial neural networks.
Each method has its strengths and weaknesses, and the choice of which to use
depends on the data characteristics, problem complexity, and interpretability
needs.

10.PREDICTION IN DATA MINING


🔹 Introduction
 Prediction is a data mining technique used to forecast future data values
or trends based on historical data.
 It is a type of supervised learning where the goal is to predict a continuous
(numerical) value, unlike classification which predicts categorical labels.

🔹 What is Prediction?
 Prediction involves building a model that maps input variables (features) to
a target output (numeric value).
Prediction
Feature Classification
(Regression)
Output Categorical (e.g., Continuous (e.g.,
 type yes/no) price = 500)
Predicting future
Example Spam or not spam
sales
Algorithm Classification Regression
type algorithms algorithms
Example: Predicting house prices, sales revenue, stock prices, or temperature.

🔹 Difference: Classification vs Prediction

🔹 Common Prediction Techniques


1. Linear Regression
o Finds a straight-line relationship between input (X) and output (Y).
o Equation:
Y=b0+b1X1+b2X2+…+bnXn
2. Multiple Regression
o Like linear regression but uses multiple input variables to predict one
output variable.
3. Non-linear Regression
o Used when the relationship between input and output is non-linear.
4. Regression Trees
o Similar to decision trees, but used for predicting numeric values
instead of class labels.
5. Neural Networks
o Used for complex prediction tasks with non-linear relationships.

🔹 Steps in Prediction
1. Collect and clean data
2. Select relevant features
3. Choose a prediction model (e.g., linear regression)
4. Train the model using historical data
5. Test the model with new/unseen data
6. Use the model to predict future values

🔹 Applications of Prediction
 Sales forecasting
 Weather prediction
 Stock market trends
 Risk assessment in finance and healthcare
 Predicting customer behavior

🔹 Advantages
✅Helps in decision-making using future forecasts
✅Identifies hidden patterns in data
✅Can lead to cost savings and better planning

🔹 Challenges
❌Requires clean and sufficient data
❌Models can be affected by noise or outliers
❌May become inaccurate over time if data changes

✅ Conclusion
Prediction is an essential data mining task used to forecast continuous values
based on patterns learned from past data.
It is widely used in business, science, and industry to support planning,
forecasting, and optimization.

11.Accuracy and Error Measures in Classification and Prediction


🔹 Introduction
In data mining and machine learning, once a model is built, it is
important to evaluate its performance.
We use accuracy and error measures to find out how well the
model is working.
There are two main goals in evaluation:
 For classification: How many predictions are correct or
incorrect?
 For prediction (regression): How close are the predicted
values to the actual values?

🔹 1. Accuracy (for Classification)


 Accuracy measures the percentage of correct predictions
made by the model.
Accuracy=Number of Correct PredictionsTotal Pr
edictions×100\ Total Predictions
Example: If a model makes 80 correct predictions out of 100,
Accuracy= 80 x 100 /100 = 80%

🔹 2. Confusion Matrix (for Classification)


A confusion matrix helps evaluate a classifier's performance
using four values:
Pr
Pr
edi
edi
cte
cte
d
d
Ne
Po
ga
siti
tiv
ve
e
Tru Fal
From this, we can Act e se calculate:
ual Pos Ne
 Precision = TP / (TP + FP) → How
Pos itiv gat
many predicted itiv e ive positives were
correct e (TP (FN
) )
 Recall = TP / (TP + FN) → How
Fal Tru
many actual Act se e positives were
correctly ual Pos Ne predicted
Ne itiv gat
 F1-Score = Harmonic mean of
gat e ive
Precision and ive (FP (TN Recall
) )

🔹 3. Error Rate
 ErrorRate=1-Accuracy
It shows the percentage of wrong predictions.
Error Rate=(Number of Incorrect Predictions/Total Predicti
ons)×100

🔹 4. Measures for Prediction (Numerical Data)


For prediction or regression problems, we use different error
measures to compare predicted values and actual values:
a) Mean Absolute Error (MAE)
 Measures the average of absolute errors.
MAE=1/n=n∑ i=1∣yi−y^i∣
Where:
 yi = actual value
 y^I = predicted value
 n = total number of data points
b) Mean Squared Error (MSE)
 Measures the average of squared differences between actual
and predicted values.
MSE= 1/n n∑i=1 (yi−y^i)^2
c) Root Mean Squared Error (RMSE)
 The square root of MSE; it gives error in the same unit as the
original data.
RMSE= MSE

🔹 5. R-Squared (R²) – for Regression


 Measures how well the predicted values fit the actual data.
 Value ranges from 0 to 1. Closer to 1 means better
prediction.
✅ Conclusion

Accuracy and error measures are important to check how well a model
performs.
They help us understand if a model is good, needs improvement, or is
overfitting/underfitting.
Different types of models use different measures depending on whether the
output is a label or a number.
13. Ensemble Methods
🔹 Introduction
 Ensemble methods are techniques that combine multiple models (called
weak learners) to create a stronger and more accurate model.
 The idea is that a group of weak models can perform better together than
any single model alone.

🔹 Why Use Ensemble Methods?


 A single classifier may make errors.
 Combining several classifiers helps:
o Increase accuracy
o Reduce overfitting
o Handle noise or complex patterns in data

🔹 Types of Ensemble Methods


1. Bagging (Bootstrap Aggregating)
 Multiple models are trained using different random samples of the original
dataset (with replacement).
 Their results are combined (usually by voting).
 Reduces variance and improves stability.
✅ Popular Example: Random Forest

2. Boosting
 Models are trained sequentially, each one learning from the errors of the
previous one.
 Focuses more on difficult cases.
 Final output is a weighted sum of all models.
✅ Popular Examples: AdaBoost, Gradient Boosting, XGBoost

3. Stacking (Stacked Generalization)


 Combines multiple models (called base learners), and then uses another
model (called a meta-learner) to make the final prediction.
 Can use different types of algorithms together.
✅ Example: Combining Decision Tree, SVM, and Logistic Regression, with a Neural
Network as a meta-learner.

🔹 Comparison Table
Method Approach Key Benefit
Bagging Parallel, uses random Reduces variance
subsets
Boosting Sequential, focuses on errors Improves accuracy
Stacking Combines different models Uses model diversity

🔹 Advantages of Ensemble Methods


✅Higher accuracy than individual models
✅Better generalization to new data
✅ Less likely to overfit (especially with bagging)

🔹 Disadvantages
❌More complex and slower
❌Harder to interpret
❌ Requires more resources (memory, computation)

✅ Conclusion
Ensemble methods are powerful techniques in data mining and machine learning.
By combining multiple weak models, they achieve better performance,
robustness, and accuracy than single classifiers.

14.Model Selection in Data Mining


🔹 Introduction
Model selection is the process of choosing the best model from a set of candidate
models for solving a specific data mining problem (such as classification or
prediction).
It helps ensure the model is accurate, efficient, and generalizes well to unseen
data.

🔹 Why Model Selection is Important


 There are many algorithms available (e.g., decision tree, SVM, neural
networks).
 Not every model works well for every type of dataset.
 We need to select the model that gives the best performance based on the
problem and data.

🔹 Steps in Model Selection


1. Define the Problem
o Is it classification, regression, clustering, etc.?
o Understand the data type, target variable, and business goal.
2. Choose Candidate Models
o Pick a few algorithms that are suitable (e.g., Decision Tree, SVM,
Naive Bayes for classification).
3. Split the Dataset
o Use training and testing sets (e.g., 70% training, 30% testing).
o Optionally use cross-validation for better evaluation.
4. Train the Models
o Fit the candidate models to the training data.
5. Evaluate Models
o Use evaluation metrics:
 For classification: Accuracy, Precision, Recall, F1-score
 For prediction: MAE, MSE, RMSE, R²
6. Compare Performance
o Choose the model that performs best on testing data.
o Consider both accuracy and complexity.
7. Select the Best Model
o Finalize the model and tune its hyperparameters if needed.

🔹 Model Selection Techniques


✅ A. Hold-Out Method
 Divide dataset into training and testing.
 Simple and fast.
✅ B. Cross-Validation
 Data is split into k folds.
 Each fold is used once as test, rest as training.
 More reliable.
✅ C. Bootstrapping
 Random sampling with replacement to create training sets.
 Useful with small datasets.
🔹 Bias-Variance Trade-off
 Good model selection balances:
o Bias (error due to simplifying assumptions)
o Variance (sensitivity to small changes in data)
A good model has low bias and low variance.
🔹 Factors Affecting Model Selection
 Size of data
 Noise in data
 Interpretability of model
 Training time and resources
 Scalability

🔹 Example
For email spam detection:
 Try models like Naive Bayes, Decision Tree, SVM.
 Use 10-fold cross-validation.
 Evaluate each model and choose the one with the highest accuracy and
lowest error.

✅ Conclusion
Model selection is a critical step in data mining.
By carefully comparing models using appropriate evaluation techniques, we can
choose a model that gives the best performance for our task while ensuring it
generalizes well to new data.

You might also like