0% found this document useful (0 votes)
155 views25 pages

DMBAR Chapter 4 Dimension Reduction

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
155 views25 pages

DMBAR Chapter 4 Dimension Reduction

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

DATA MINING FOR BUSINESS

ANALYTICS IN R

Galit Shmueli , Peter. C Bruce, Inbal Yahav,


Nitin R. Patel, Kenneth C. Lichtendahl, Jr.

Indian Adaptation by
O.P. Wali, Professor, Indian Institute of Foreign Trade

Copyright © 2022 by John Wiley & Sons, Inc. All rights reserved.
CHAPTER 4

Dimension Reduction
4.1 INTRODUCTION

 In data mining, one often encounters situations where there is a large number of variables in the database.
 When the initial number of variables is small, this set quickly expands in the data preparation step, where new
derived variables are created.
 Including highly correlated variables in a classification or prediction model, or including variables that are
unrelated to the outcome of interest, can lead to overfitting, and accuracy and reliability can suffer.
 A large number of variables also poses computational problems for some supervised as well as unsupervised
algorithms .
 In model deployment, superfluous variables can increase costs due to the collection and processing of these
variables
4.2 CURSE OF DIMENSIONALITY
 The dimensionality of a model is the number of predictors or input variables used by the model.
 The curse of dimensionality is the affliction caused by adding variables to multivariate data models.
 As variables are added, the data space becomes increasingly sparse, and classification and prediction models fail
because the available data are insufficient to provide a useful model across so many variables.
 An important consideration is the fact that the difficulties posed by adding a variable increase exponentially with
the addition of each variable.
 One way to think of this intuitively is to consider the location of an object on a chessboard.
 In statistical distance terms, the proliferation of variables means that nothing is close to anything else anymore—
too much noise has been added and patterns and structure are no longer discernible.
 One of the key steps in data mining, therefore, is finding ways to reduce dimensionality with minimal sacrifice of
accuracy.
 In the artificial intelligence literature, dimension reduction is often referred to as factor selection or feature
extraction.
4.3 PRACTICAL CONSIDERATIONS

❖ Example 1: Service Feedback and Usage


4.3 PRACTICAL CONSIDERATIONS
4.4 DATA SUMMARIES

 Summary Statistics
 R has several functions and facilities that assist in summarizing data.
 The function summary() gives an overview of the entire set of variables in the data. The functions mean(), sd(),
min(), max(), median(), and length() are also very helpful for learning about the characteristics of each variable.
 The mean and median give a sense of the central values of that variable, and a large deviation between the two
also indicates skew.
 The standard deviation gives a sense of how dispersed the data are.
 For numerical variables, we can compute a complete matrix of correlations between each pair of variables, using
the R function cor().
4.4 DATA SUMMARIES

 Aggregation and Pivot Tables

 Another very useful approach for exploring the data is aggregation by one or more variables.

 Aggregation by a single variable, we can use table().

 The aggregate() function can be used for aggregating one or more variables, and computing a range of summary
statistics (count, average, percentage, etc.).
4.4 DATA SUMMARIES
4.4 DATA SUMMARIES
4.4 DATA SUMMARIES
4.5 CORRELATION ANALYSIS

 In datasets with a large number of variables (which are likely to serve as predictors), there is usually much overlap
in the information covered by the set of variables.
 One simple way to find redundancies is to look at a correlation matrix.
 Pairs that have a very strong (positive or negative) correlation contain a lot of overlap in information and are
good candidates for data reduction by removing one of the variables.
 Correlation analysis is also a good method for detecting duplications of variables in the data
 The same variable appears accidentally more than once in the dataset (under a different name) because the
dataset was merged from multiple sources, the same phenomenon is measured in different units, and so on.
 Using correlation table heatmaps, can make the task of identifying strong correlations easier
4.6 REDUCING THE NUMBER OF CATEGORIES IN CATEGORICAL
VARIABLES

 A categorical variable has many categories, and this variable is destined to be a predictor, many data mining
methods will require converting it into many dummy variables.
 A variable with m categories will be transformed into either m or m – 1 dummy variables (depending on the
method).
 Even if we have very few original categorical variables, they can greatly inflate the dimension of the dataset. One
way to handle this is to reduce the number of categories by combining close or similar categories.
 Combining categories requires incorporating expert knowledge and common sense.
 Categories that contain very few observations are good candidates for combining with other categories.
 In classification tasks (with a categorical outcome variable), a pivot table broken down by the outcome classes can
help identify categories that do not separate the classes.
4.6 REDUCING THE NUMBER OF CATEGORIES IN CATEGORICAL
VARIABLES

FIGURE 4.1 Distribution of UL (black denotes low usage coded as “0”) bars representing RP based on their usage
levels can be transformed into three categories
4.7 CONVERTING A CATEGORICAL VARIABLE TO A NUMERICAL
VARIABLE

 Sometimes the categories in a categorical variable represent intervals. Common examples are age group or
income bracket. If the interval values are known , we can replace the categorical value with the mid-interval value
(here “25”). The result will be a numerical variable which no longer requires multiple dummy variables.

FIGURE 4.2 Quarterly revenues of toys “R” US, 1992–1995


4.8 PRINCIPAL COMPONENTS ANALYSIS

 Principal components analysis (PCA) is a useful method for dimension reduction, especially when the number of
variables is large.
 PCA is especially valuable when we have subsets of measurements that are measured on the same scale and are
highly correlated.
 PCA is intended for use with numerical variables.
 For categorical variables, other methods such as correspondence analysis are more suitable.
4.8 PRINCIPAL COMPONENTS ANALYSIS
❖ Example 2: Breakfast Cereals
Data were collected on the nutritional information and consumer rating of 77 breakfast cereals.1 The consumer
rating is a rating of cereal “healthiness” for consumer information. For each cereal, the data include 13 numerical
variables, and we are interested in reducing this dimension.
We focus first on two variables: calories and consumer rating. These are given in Table 4.9. The average calories across
the 77 cereals is 106.88 and the average consumer rating is 42.67. The estimated covariance matrix between the two
variables is

It can be seen that the two variables are strongly correlated with a negative correlation of
4.8 PRINCIPAL COMPONENTS ANALYSIS

FIGURE 4.4 Scatter plot of rating vs. calories for 77 breakfast cereals, with the two principal component directions
4.8 PRINCIPAL COMPONENTS ANALYSIS

❖ Principal Components
Let us formalize the procedure described above so that it can easily be generalized to p > 2 variables. Denote the
original p variables by X1 , X2 , . . ., Xp . In PCA, we are looking for a set of new variables Z1 , Z2 , . . ., Zp that are
weighted averages of the original variables (after subtracting their mean):

where each pair of Z’s has correlation = 0. We then order the resulting Z’s by their variance, with Z1 having the
largest variance and Zp having the smallest variance. The software computes the weights ai,j , which are then used in
computing the principal component scores.
4.8 PRINCIPAL COMPONENTS ANALYSIS

❖ Normalizing the Data


▪ A further use of PCA is to understand the structure of the data.
▪ This is done by examining the weights to see how the original variables contribute to the different principal
components
▪ Normalization (or standardization) means replacing each original variable by a standardized version of the variable
that has unit variance.
▪ The effect of this normalization is to give all variables equal importance in terms of variability.
4.8 PRINCIPAL COMPONENTS ANALYSIS

❖ Using Principal Components for Classification and Prediction


▪ When the goal of the data reduction is to have a smaller set of variables that will serve as predictors, we can
proceed as following:
▪ Apply PCA to the predictors using the training data.
▪ Use the output to determine the number of principal components to be retained.
▪ The predictors in the model now use the (reduced number of) principal scores columns.
▪ The predictors in the model now use the (reduced number of) principal scores columns.
▪ One disadvantage of using a subset of principal components as predictors in a supervised task, is that we might
lose predictive information that is nonlinear.
4.9 DIMENSION REDUCTION USING REGRESSION MODELS

▪ Fitted regression models can also be used to further combine similar categories:
▪ Categories that have coefficients that are not statistically significant can be combined with the reference category,
because their distinction from the reference category appears to have no significant effect on the outcome
variable.
▪ Categories that have similar coefficient values (and the same sign) can often be combined, because their effect on
the outcome variable is similar.
4.10 DIMENSION REDUCTION USING CLASSIFICATION AND
REGRESSION TREES

 Another method for reducing the number of columns and for combining categories of a categorical variable is by
applying classification and regression trees.
 Classification trees are used for classification tasks and regression trees for prediction tasks.
 In both cases, the algorithm creates binary splits on the predictors that best classify/predict the outcome variable.
 Predictors (numerical or categorical) that do not appear in the tree can be removed. Similarly, categories that do
not appear in the tree can be combined.

You might also like