0% found this document useful (0 votes)

155 views25 pages

DMBAR Chapter 4 Dimension Reduction

Uploaded by

ANAM AFTAB 22GSOB2010404

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

155 views25 pages

DMBAR Chapter 4 Dimension Reduction

Uploaded by

ANAM AFTAB 22GSOB2010404

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

DATA MINING FOR BUSINESS

ANALYTICS IN R

Galit Shmueli , Peter. C Bruce, Inbal Yahav,

Nitin R. Patel, Kenneth C. Lichtendahl, Jr.

Indian Adaptation by
O.P. Wali, Professor, Indian Institute of Foreign Trade

Dimension Reduction
4.1 INTRODUCTION

 In data mining, one often encounters situations where there is a large number of variables in the database.
 When the initial number of variables is small, this set quickly expands in the data preparation step, where new
derived variables are created.
 Including highly correlated variables in a classification or prediction model, or including variables that are
unrelated to the outcome of interest, can lead to overfitting, and accuracy and reliability can suffer.
 A large number of variables also poses computational problems for some supervised as well as unsupervised
algorithms .
 In model deployment, superfluous variables can increase costs due to the collection and processing of these
variables
4.2 CURSE OF DIMENSIONALITY
 The dimensionality of a model is the number of predictors or input variables used by the model.
 The curse of dimensionality is the affliction caused by adding variables to multivariate data models.
 As variables are added, the data space becomes increasingly sparse, and classification and prediction models fail
because the available data are insufficient to provide a useful model across so many variables.
 An important consideration is the fact that the difficulties posed by adding a variable increase exponentially with
the addition of each variable.
 One way to think of this intuitively is to consider the location of an object on a chessboard.
 In statistical distance terms, the proliferation of variables means that nothing is close to anything else anymore—
too much noise has been added and patterns and structure are no longer discernible.
 One of the key steps in data mining, therefore, is finding ways to reduce dimensionality with minimal sacrifice of
accuracy.
 In the artificial intelligence literature, dimension reduction is often referred to as factor selection or feature
extraction.
4.3 PRACTICAL CONSIDERATIONS

❖ Example 1: Service Feedback and Usage

4.3 PRACTICAL CONSIDERATIONS
4.4 DATA SUMMARIES

 Summary Statistics
 R has several functions and facilities that assist in summarizing data.
 The function summary() gives an overview of the entire set of variables in the data. The functions mean(), sd(),
min(), max(), median(), and length() are also very helpful for learning about the characteristics of each variable.
 The mean and median give a sense of the central values of that variable, and a large deviation between the two
also indicates skew.
 The standard deviation gives a sense of how dispersed the data are.
 For numerical variables, we can compute a complete matrix of correlations between each pair of variables, using
the R function cor().
4.4 DATA SUMMARIES

 Aggregation and Pivot Tables

 Another very useful approach for exploring the data is aggregation by one or more variables.

 Aggregation by a single variable, we can use table().

 The aggregate() function can be used for aggregating one or more variables, and computing a range of summary
statistics (count, average, percentage, etc.).
4.4 DATA SUMMARIES
4.4 DATA SUMMARIES
4.4 DATA SUMMARIES
4.5 CORRELATION ANALYSIS

 In datasets with a large number of variables (which are likely to serve as predictors), there is usually much overlap
in the information covered by the set of variables.
 One simple way to find redundancies is to look at a correlation matrix.
 Pairs that have a very strong (positive or negative) correlation contain a lot of overlap in information and are
good candidates for data reduction by removing one of the variables.
 Correlation analysis is also a good method for detecting duplications of variables in the data
 The same variable appears accidentally more than once in the dataset (under a different name) because the
dataset was merged from multiple sources, the same phenomenon is measured in different units, and so on.
 Using correlation table heatmaps, can make the task of identifying strong correlations easier
4.6 REDUCING THE NUMBER OF CATEGORIES IN CATEGORICAL
VARIABLES

 A categorical variable has many categories, and this variable is destined to be a predictor, many data mining
methods will require converting it into many dummy variables.
 A variable with m categories will be transformed into either m or m – 1 dummy variables (depending on the
method).
 Even if we have very few original categorical variables, they can greatly inflate the dimension of the dataset. One
way to handle this is to reduce the number of categories by combining close or similar categories.
 Combining categories requires incorporating expert knowledge and common sense.
 Categories that contain very few observations are good candidates for combining with other categories.
 In classification tasks (with a categorical outcome variable), a pivot table broken down by the outcome classes can
help identify categories that do not separate the classes.
4.6 REDUCING THE NUMBER OF CATEGORIES IN CATEGORICAL
VARIABLES

FIGURE 4.1 Distribution of UL (black denotes low usage coded as “0”) bars representing RP based on their usage
levels can be transformed into three categories
4.7 CONVERTING A CATEGORICAL VARIABLE TO A NUMERICAL
VARIABLE

 Sometimes the categories in a categorical variable represent intervals. Common examples are age group or
income bracket. If the interval values are known , we can replace the categorical value with the mid-interval value
(here “25”). The result will be a numerical variable which no longer requires multiple dummy variables.

FIGURE 4.2 Quarterly revenues of toys “R” US, 1992–1995

4.8 PRINCIPAL COMPONENTS ANALYSIS

 Principal components analysis (PCA) is a useful method for dimension reduction, especially when the number of
variables is large.
 PCA is especially valuable when we have subsets of measurements that are measured on the same scale and are
highly correlated.
 PCA is intended for use with numerical variables.
 For categorical variables, other methods such as correspondence analysis are more suitable.
4.8 PRINCIPAL COMPONENTS ANALYSIS
❖ Example 2: Breakfast Cereals
Data were collected on the nutritional information and consumer rating of 77 breakfast cereals.1 The consumer
rating is a rating of cereal “healthiness” for consumer information. For each cereal, the data include 13 numerical
variables, and we are interested in reducing this dimension.
We focus first on two variables: calories and consumer rating. These are given in Table 4.9. The average calories across
the 77 cereals is 106.88 and the average consumer rating is 42.67. The estimated covariance matrix between the two
variables is

It can be seen that the two variables are strongly correlated with a negative correlation of
4.8 PRINCIPAL COMPONENTS ANALYSIS

FIGURE 4.4 Scatter plot of rating vs. calories for 77 breakfast cereals, with the two principal component directions
4.8 PRINCIPAL COMPONENTS ANALYSIS

❖ Principal Components
Let us formalize the procedure described above so that it can easily be generalized to p > 2 variables. Denote the
original p variables by X1 , X2 , . . ., Xp . In PCA, we are looking for a set of new variables Z1 , Z2 , . . ., Zp that are
weighted averages of the original variables (after subtracting their mean):

where each pair of Z’s has correlation = 0. We then order the resulting Z’s by their variance, with Z1 having the
largest variance and Zp having the smallest variance. The software computes the weights ai,j , which are then used in
computing the principal component scores.
4.8 PRINCIPAL COMPONENTS ANALYSIS

❖ Normalizing the Data

▪ A further use of PCA is to understand the structure of the data.
▪ This is done by examining the weights to see how the original variables contribute to the different principal
components
▪ Normalization (or standardization) means replacing each original variable by a standardized version of the variable
that has unit variance.
▪ The effect of this normalization is to give all variables equal importance in terms of variability.
4.8 PRINCIPAL COMPONENTS ANALYSIS

❖ Using Principal Components for Classification and Prediction

▪ When the goal of the data reduction is to have a smaller set of variables that will serve as predictors, we can
proceed as following:
▪ Apply PCA to the predictors using the training data.
▪ Use the output to determine the number of principal components to be retained.
▪ The predictors in the model now use the (reduced number of) principal scores columns.
▪ The predictors in the model now use the (reduced number of) principal scores columns.
▪ One disadvantage of using a subset of principal components as predictors in a supervised task, is that we might
lose predictive information that is nonlinear.
4.9 DIMENSION REDUCTION USING REGRESSION MODELS

▪ Fitted regression models can also be used to further combine similar categories:
▪ Categories that have coefficients that are not statistically significant can be combined with the reference category,
because their distinction from the reference category appears to have no significant effect on the outcome
variable.
▪ Categories that have similar coefficient values (and the same sign) can often be combined, because their effect on
the outcome variable is similar.
4.10 DIMENSION REDUCTION USING CLASSIFICATION AND
REGRESSION TREES

 Another method for reducing the number of columns and for combining categories of a categorical variable is by
applying classification and regression trees.
 Classification trees are used for classification tasks and regression trees for prediction tasks.
 In both cases, the algorithm creates binary splits on the predictors that best classify/predict the outcome variable.
 Predictors (numerical or categorical) that do not appear in the tree can be removed. Similarly, categories that do
not appear in the tree can be combined.

Dimension Reduction Techniques Explained
No ratings yet
Dimension Reduction Techniques Explained
66 pages
Chapter 4 - Dimension Reduction: Data Mining For Business Intelligence
No ratings yet
Chapter 4 - Dimension Reduction: Data Mining For Business Intelligence
24 pages
Chapter 04 Dimension Reduction (R)
No ratings yet
Chapter 04 Dimension Reduction (R)
27 pages
Principal Component Analysis Overview
No ratings yet
Principal Component Analysis Overview
26 pages
Principal Component Analysis Overview
No ratings yet
Principal Component Analysis Overview
19 pages
Dimensionality Reduction with PCA in Python
No ratings yet
Dimensionality Reduction with PCA in Python
11 pages
Principal Component Analysis Overview
No ratings yet
Principal Component Analysis Overview
19 pages
PCA Analysis with R: A Comprehensive Guide
No ratings yet
PCA Analysis with R: A Comprehensive Guide
25 pages
Data Mining: Dimensionality Reduction Techniques
No ratings yet
Data Mining: Dimensionality Reduction Techniques
21 pages
Data Preprocessing Strategies for BI
No ratings yet
Data Preprocessing Strategies for BI
44 pages
Data Mining Process Overview
No ratings yet
Data Mining Process Overview
50 pages
PCA: Dimensionality Reduction Explained
No ratings yet
PCA: Dimensionality Reduction Explained
17 pages
Dimension Reduction in Data Mining
No ratings yet
Dimension Reduction in Data Mining
32 pages
PCA vs. Factor Analysis Explained
No ratings yet
PCA vs. Factor Analysis Explained
26 pages
Monograph PCA-FA Final Version
No ratings yet
Monograph PCA-FA Final Version
40 pages
Dimensionality Reduction Techniques
No ratings yet
Dimensionality Reduction Techniques
59 pages
Factor Analysis and PCA Overview
No ratings yet
Factor Analysis and PCA Overview
28 pages
PCA and Cluster Analysis Overview
No ratings yet
PCA and Cluster Analysis Overview
14 pages
Data Transformation Techniques Explained
No ratings yet
Data Transformation Techniques Explained
20 pages
Dimensionality Reduction Techniques Explained
No ratings yet
Dimensionality Reduction Techniques Explained
41 pages
Principal Component Analysis Explained
No ratings yet
Principal Component Analysis Explained
26 pages
Deep Learning Notes III To IV
No ratings yet
Deep Learning Notes III To IV
22 pages
Dimensionality Reduction Techniques Explained
No ratings yet
Dimensionality Reduction Techniques Explained
85 pages
Logistic Regression and PCA Overview
No ratings yet
Logistic Regression and PCA Overview
92 pages
Data Normalization Techniques in Mining
No ratings yet
Data Normalization Techniques in Mining
25 pages
Dimensionality Reduction Techniques Explained
No ratings yet
Dimensionality Reduction Techniques Explained
19 pages
Discriminant vs Cluster Analysis Explained
No ratings yet
Discriminant vs Cluster Analysis Explained
43 pages
Factor Analysis and PCA Overview
No ratings yet
Factor Analysis and PCA Overview
26 pages
ML Module 6
No ratings yet
ML Module 6
29 pages
Machine Learning Dimensionality Reduction Techniques
No ratings yet
Machine Learning Dimensionality Reduction Techniques
17 pages
Dimensionality Reduction in R Techniques
No ratings yet
Dimensionality Reduction in R Techniques
24 pages
Dimensionality Reduction Techniques
No ratings yet
Dimensionality Reduction Techniques
82 pages
Understanding Principal Component Analysis
No ratings yet
Understanding Principal Component Analysis
37 pages
Understanding Predictive Business Analytics
No ratings yet
Understanding Predictive Business Analytics
31 pages
Data Normalization Techniques Explained
No ratings yet
Data Normalization Techniques Explained
24 pages
PCA: Data Reduction & Analysis Techniques
No ratings yet
PCA: Data Reduction & Analysis Techniques
4 pages
PCA for Dimensionality Reduction
No ratings yet
PCA for Dimensionality Reduction
17 pages
Data Preparation in Data Mining
No ratings yet
Data Preparation in Data Mining
43 pages
Principal Component Analysis Overview
No ratings yet
Principal Component Analysis Overview
55 pages
Fisher Discriminant Analysis Overview
No ratings yet
Fisher Discriminant Analysis Overview
15 pages
Dimensionality Reduction Techniques
No ratings yet
Dimensionality Reduction Techniques
30 pages
Understanding Principal Component Analysis
No ratings yet
Understanding Principal Component Analysis
43 pages
Regression Forecasting Techniques Explained
No ratings yet
Regression Forecasting Techniques Explained
29 pages
Data Preprocessing Techniques Explained
No ratings yet
Data Preprocessing Techniques Explained
22 pages
Data Preprocessing and Visualization Guide
No ratings yet
Data Preprocessing and Visualization Guide
96 pages
PCA for Multivariate Data Analysis
No ratings yet
PCA for Multivariate Data Analysis
41 pages
Factor Analysis Guide in SPSS
No ratings yet
Factor Analysis Guide in SPSS
11 pages
Understanding Principal Component Analysis
No ratings yet
Understanding Principal Component Analysis
33 pages
PCA for Dimensionality Reduction
No ratings yet
PCA for Dimensionality Reduction
27 pages
Understanding Sampling Methods and PCA
No ratings yet
Understanding Sampling Methods and PCA
9 pages
PCA Techniques and Applications
100% (1)
PCA Techniques and Applications
24 pages
Dimensionality Reduction Techniques
No ratings yet
Dimensionality Reduction Techniques
30 pages
Feature Selection vs. Dimensionality Reduction
No ratings yet
Feature Selection vs. Dimensionality Reduction
18 pages
PCA for Dimension Reduction in Analytics
No ratings yet
PCA for Dimension Reduction in Analytics
25 pages
Dimensionality Reduction with PCA
No ratings yet
Dimensionality Reduction with PCA
28 pages
Dimensionality Reduction in Machine Learning
No ratings yet
Dimensionality Reduction in Machine Learning
27 pages
DMBAR Chapter 14 Association Rules and Collaborative Filtering
No ratings yet
DMBAR Chapter 14 Association Rules and Collaborative Filtering
21 pages
DMBAR Chapter 1
No ratings yet
DMBAR Chapter 1
15 pages
Manali Petrochemicals Ltd. Rating Update
No ratings yet
Manali Petrochemicals Ltd. Rating Update
6 pages
Tax Planning Strategies for Wealth Management
No ratings yet
Tax Planning Strategies for Wealth Management
47 pages
Yes Bank Financial Analysis
No ratings yet
Yes Bank Financial Analysis
19 pages
Understanding Decision Tree Analysis
No ratings yet
Understanding Decision Tree Analysis
11 pages
TOPSwitch-FX Off-Line Switcher Overview
No ratings yet
TOPSwitch-FX Off-Line Switcher Overview
38 pages
The Rise of AI Agents Explained
No ratings yet
The Rise of AI Agents Explained
4 pages
ASHA Certification Registration 2025
No ratings yet
ASHA Certification Registration 2025
6 pages
NT Control PI04053
No ratings yet
NT Control PI04053
4 pages
Transformer Parameter Testing Overview
No ratings yet
Transformer Parameter Testing Overview
51 pages
Plunger Kit Guide v3.12
No ratings yet
Plunger Kit Guide v3.12
7 pages
Java Arrays Overview and Examples
No ratings yet
Java Arrays Overview and Examples
10 pages
Electricity Meter Construction Overview
No ratings yet
Electricity Meter Construction Overview
40 pages
Kapampangan Language Basics Guide
No ratings yet
Kapampangan Language Basics Guide
9 pages
PVTPMT2007aprv (Preview) (Air Cooled Packaged Inverter FVG (P) - PV (Y) 14 and FDR-PY14 - R410A)
No ratings yet
PVTPMT2007aprv (Preview) (Air Cooled Packaged Inverter FVG (P) - PV (Y) 14 and FDR-PY14 - R410A)
8 pages
Optimizing Waste Collection Routes
No ratings yet
Optimizing Waste Collection Routes
10 pages
Throttle and Gear Shift Cable Inventory
No ratings yet
Throttle and Gear Shift Cable Inventory
2 pages
CAD Laboratory Course for Mechanical Engineering
No ratings yet
CAD Laboratory Course for Mechanical Engineering
31 pages
Energy Savings With Chiller Optimization System - Brochure
No ratings yet
Energy Savings With Chiller Optimization System - Brochure
8 pages
Huawei SUN2000-50KTL-M0 Datasheet
No ratings yet
Huawei SUN2000-50KTL-M0 Datasheet
2 pages
Novel DOA Estimation in Impulsive Noise
No ratings yet
Novel DOA Estimation in Impulsive Noise
13 pages
English Xamidea Class 10 Guide
No ratings yet
English Xamidea Class 10 Guide
239 pages
Ugandan IT Professional CV
No ratings yet
Ugandan IT Professional CV
5 pages
IP Telephony Overview & Call Center Features
No ratings yet
IP Telephony Overview & Call Center Features
32 pages
Bifacial Double-Glass PV Module Specs
No ratings yet
Bifacial Double-Glass PV Module Specs
2 pages
Penstock Pipe Design Guide
No ratings yet
Penstock Pipe Design Guide
19 pages
Technology and War: 2000 BC to Present
100% (1)
Technology and War: 2000 BC to Present
35 pages
8th Class Computer Exam Paper 2024
No ratings yet
8th Class Computer Exam Paper 2024
2 pages
José Alejandro López Pérez: CV Summary
No ratings yet
José Alejandro López Pérez: CV Summary
7 pages
Huawei's Smartphone Challenges Analysis
No ratings yet
Huawei's Smartphone Challenges Analysis
7 pages
Density-Based Clustering with DBSCAN
No ratings yet
Density-Based Clustering with DBSCAN
19 pages
Python Data Analytics with Pandas & NumPy
No ratings yet
Python Data Analytics with Pandas & NumPy
521 pages
Adi Trasator Curbe Signature Tracer Scope Add On
No ratings yet
Adi Trasator Curbe Signature Tracer Scope Add On
5 pages
Exam Application Details for UG 2023-24
No ratings yet
Exam Application Details for UG 2023-24
1 page
74LS08 Quadruple Two Input and Gate - Datasheet
No ratings yet
74LS08 Quadruple Two Input and Gate - Datasheet
17 pages

DMBAR Chapter 4 Dimension Reduction

Uploaded by

DMBAR Chapter 4 Dimension Reduction

Uploaded by

DATA MINING FOR BUSINESS

Galit Shmueli , Peter. C Bruce, Inbal Yahav,

❖ Example 1: Service Feedback and Usage

 Aggregation and Pivot Tables

 Aggregation by a single variable, we can use table().

FIGURE 4.2 Quarterly revenues of toys “R” US, 1992–1995

❖ Normalizing the Data

❖ Using Principal Components for Classification and Prediction

You might also like