0% found this document useful (0 votes)
42 views

Exploratory Data Analysis Updated

The document discusses concepts in exploratory data analysis including statistics, data types, frequency tables, descriptive statistics, distributions, and correlation. Statistics involves collecting, analyzing, interpreting and presenting data. Descriptive statistics calculated from data include measures of central tendency (mean, median, mode), dispersion (range, variance, standard deviation), and position (quartiles, percentiles). Common distributions examined include uniform, normal, binomial, Bernoulli, Poisson and exponential. Correlation coefficients indicate the strength and direction of linear relationships between variables.

Uploaded by

Dr. Sanjay Gupta
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
42 views

Exploratory Data Analysis Updated

The document discusses concepts in exploratory data analysis including statistics, data types, frequency tables, descriptive statistics, distributions, and correlation. Statistics involves collecting, analyzing, interpreting and presenting data. Descriptive statistics calculated from data include measures of central tendency (mean, median, mode), dispersion (range, variance, standard deviation), and position (quartiles, percentiles). Common distributions examined include uniform, normal, binomial, Bernoulli, Poisson and exponential. Correlation coefficients indicate the strength and direction of linear relationships between variables.

Uploaded by

Dr. Sanjay Gupta
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 44

Exploratory Data Analysis

CSDS3202
5.1 STATISTICS
The science of statistics involves ...
 Collection of data
 Analysis of data
 Interpretation of data
 Presentation of data

CSDS3202-Introduction to Data Science 2


5.2 DATA
Data are the actual values of variables
 Qualitative Data
 Qualitative data can be described in words or symbols rather than in numbers
 Example : colors, blood types etc..
 Quantitative Data
 Quantitative data is described by numbers
 Two categories:
 Discrete Data : Discrete data are quantitative data that are counted.
 Continuous Data : Continuous data are quantitative data that are measured.

CSDS3202-Introduction to Data Science 3


5.3 FREQUENCY TABLE

 Frequency is the number of times that a particular result occurs.


 Frequency tables are used to organize data. A basic frequency table consists of a
column of data followed by a column of frequencies.
 Example:
Look at the table below. It shows the three different ages represented in a pre-
school class. The table columns show the ages (3-5) and how many students there
are of those three ages in that class. Notice that the data is sorted in order from
smallest to largest.

CSDS3202-Introduction to Data Science 4


5.4 DESCRIPTIVE STATISTICS
Quartiles
• Quartiles divide an ordered set (smallest to largest) of data into quarters.
• Consider the following ordered set of 17 data values: {2, 2, 3, 3.5, 4, 4, 4, 6, 7.5, 8, 8, 10, 10,
11.5, 12, 12, 12}
• The value that divides the set in halves is called the second quartile (Q2). The second quartile, Q2,
is equal to 7.5. The second quartile is also called the median and the 50th percentile.
• The lower half of the data is 2, 2, 3, 3.5, 4, 4, 4, 6 .The value that divides the lower half into
halves is called the first quartile (Q1). The first quartile, Q1, is between the two middle values 3.5
and the first 4.
Q1 = (3.5 + 4)/2 = 3.75 [ Notice that 3.75 is not part of the data ]
• The upper half of the data is: 8, 8, 10, 10, 11.5, 12, 12, 12. The value that divides the upper half
into halves is called the third quartile (Q3). The third quartile, Q3, is between the two middle
values 10 and 11.5.
Q3 = (10 + 11.5)/2 = 10.75 [ Notice that 10.75 is not part of the data ]
CSDS3202-Introduction to Data Science 5
DESCRIPTIVE STATISTICS…

Quartiles…
• The data that falls below Q1= 3.75 is (2, 2, 3, 3.5) and is 25% of the data. We say that
25% of the data falls below Q1 = 3.75.

• The data that is more than Q1 = 3.75 but less than Q2 = 7.5 is (4, 4, 4, 6) and is 25% of
the data. We say that 25% of the data falls between Q1 =3.75 and Q2 = 7.5.

• The data that is more than Q2 = 7.5 but less than Q3 = 10.75 (8, 8, 10, 10) is 25% of the
data. We say that 25% of the data falls between Q2 = 7.5 and Q3 = 10.75.

• The data that falls above Q3 = 10.75 (11.5,12, 12, 12) is 25% of the data. We say that
25% of the data falls above Q3 = 10.75.

CSDS3202-Introduction to Data Science 6


DESCRIPTIVE STATISTICS…

Percentiles
 Percentiles divide an ordered set (smallest to largest) of data into
hundredths.
 Consider the ordered set of the 100 numbers 1, 2, 3, 4, 5, ..., 99, 100. Ten
percent of 100 numbers is 10 numbers. The 10 numbers 1, 2, 3, 4, 5, 6, 7, 8, 9,
10 fall below the 10th percentile. This means that the 10th percentile is
between 10 and 11. The 10th percentile (10th %ile) is equal to 10.5. Similarly,
the 90th percentile (90th %ile) is equal to 90.5.

CSDS3202-Introduction to Data Science 7


DESCRIPTIVE STATISTICS…

Mean
 The mean is the same as the average. To find the mean, add all the values and divide by
the total number of values.
 Example: {2, 3, 5, 6}
The mean
 The letter x with a bar over it, represents the sample mean.

Mode
 The mode is the most frequent value in the set of numbers.
 Example: In the data set 52, 60, 65, 67, 70, 71, 74, 76, 78, 78, 78, 80, 86, 89, 95, the most
frequent value is 78. The mode = 78.
 Example: In the data set 52, 53, 53, 53, 60, 67, 72,72,72, 90, both 53 and 72 occur the most
number of times (3 times each) so there are two modes, 53 and 72. We call this set of data
bimodal meaning it has two modes.
CSDS3202-Introduction to Data Science 8
DESCRIPTIVE STATISTICS…

Median
 The median is the middle value of a set of numbers that has been ordered from smallest
to largest. The upper case letter M is used for the median.
 Example: A sample of statistics exam scores for 14 students are (in order from smallest
to largest) as follows: 53, 59, 63, 63, 72, 72, 76, 78, 81, 83, 84, 84, 90, 93
 Notice that 14 is an even number. The median is between 7th and 8th values (the middle two
values).

 Example: A second sample of statistics exam scores for 15 students are (in order from
smallest to largest) as follows: 52, 60, 65, 67, 70, 71, 74, 76, 78, 78, 78, 80, 86, 89, 95
 Notice that 15 is an odd number. The median is the 8th value (the middle value). The 8th value is
76 so the median M = 76.

CSDS3202-Introduction to Data Science 9


DESCRIPTIVE STATISTICS…

Variance
 The variance is the average of the squares of the deviations. A deviation is the difference
between a value and the mean and is written as:

 Example: {2, 3, 5, 6} is a set of data. The sample mean is 4. The deviations are:
2 - 4 = -2
3 - 4 = -1
5-4=1
6-4=2
 The deviations squared are:
(-2)2 = 4
(-1)2 = 1
(1)2 = 1
(2)2 = 4
 An average of the deviations squared is
CSDS3202-Introduction to Data Science 10
DESCRIPTIVE STATISTICS…

Standard Deviation
 The standard deviation is a special average of the deviations. It measures how the data is
spread out from its mean.
 The standard deviation is the square root of the variance and has the same units as the
mean. The letter s represents the sample standard deviation and the Greek
letter σ represents the population standard deviation.
 Example: In the variance example above, the sample variance was s2 = 3.33 (to 2 decimal
places). The sample standard deviation is s =

rounded to one decimal place.

CSDS3202-Introduction to Data Science 11


5.5.1 THE STANDARD NORMAL PROBABILITY DISTRIBUTION

Standard Normal
 The standard normal distribution is a normal probability distribution of standardized
values called z-scores.
 The standard normal has a mean of 0 and a standard deviation of 1. Z is commonly used
as the random variable.
 Notation: Z ~ N(0, 1)
 Z-Scores
 The formula for a z-score is:
 where x is the value that is being standardized.
 A z-score is measured in terms of the standard deviation.
 So, if z = 2, then 2 is the standardized score for the value of X that is 2 standard deviations above
(positive z-score) the mean.
 If z = -1, then -1 is the standardized score for the value of X that is 1 standard deviation below
(negative z-score) the mean.
CSDS3202-Introduction to Data Science 12
5.5.2 TYPES OF DISTRIBUTIONS

 Uniform Distribution
 Normal Distribution
 Binomial Distribution
 Bernoulli Distribution
 Poisson Distribution
 Exponential Distribution

CSDS3202-Introduction to Data Science 13


UNIFORM DISTRIBUTION

CSDS3202-Introduction to Data Science 14


NORMAL DISTRIBUTION

CSDS3202-Introduction to Data Science 15


EXPONENTIAL DISTRIBUTION

CSDS3202-Introduction to Data Science 16


5.6 CORRELATION COEFFICIENT

 If a scatter plot shows a possible linear relationship, then the correlation coefficient indicates how
strong the relationship is between x and y. We use the letter r for the correlation coefficient.

 If r = 1 or r = -1, there is "perfect correlation." This means that the points are already in a straight
line. In the real world, perfect correlation is very unlikely to happen.
 The closer r is to 1 or -1, the better the correlation between x and y because the data points are
closer to the line of best fit.
 There is positive correlation if x increases then y increases or if x decreases then y decreases. If
there is positive correlation, then the line has a positive slope.
 There is negative correlation if x increases then y decreases or if x decreases then y increases. If
there is negative correlation, then the line has a negative slope.
 There is no correlation if the correlation coefficient is 0 (r = 0). This means there is no relationship
between x and y. If there is no correlation, then the slope of the line is 0.
 High correlation does not necessarily mean that x causes y or y causes x.

CSDS3202-Introduction to Data Science 17


CORRELATION COEFFICIENT…

Examples of scatter diagrams with different values of correlation coefficient (ρ)

CSDS3202-Introduction to Data Science 18


5.7 DIMENSIONALITY REDUCTION

Dimensionality reduction techniques can reduce the number of features in the


dataset without having to lose much information and keep /improve the model’s
performance
Benefits of applying dimensionality reduction to a dataset:
 Space required to store the data is reduced as the number of dimensions comes
down.
 Less dimensions lead to less computation/training time.
 Some algorithms do not perform well when we have a large dimensions. So
reducing these dimensions needs to happen for the algorithm to be useful.
 It takes care of multicollinearity by removing redundant features.
 It helps in visualizing data.
CSDS3202-Introduction to Data Science 19
DIMENSIONALITY REDUCTION…

Dimensionality reduction can be done in two different ways:


 Feature Selection
 Dimensionality Reduction
 Components/Factor Based
 Factor Analysis
 Principal Component Analysis(PCA)
 Singular Value Decomposition(SVD)
 Independent Component Analysis(ICA)
 Projections Based
 ISOMAP
 t-Distributed Stochastic Neighbor Embedding (t-SNE)
 Uniform Manifold Approximation and Projection (UMAP)

CSDS3202-Introduction to Data Science 20


5.7.1 FACTOR ANALYSIS

 Suppose we have two variables: Income and Education. These variables will
potentially have a high correlation as people with a higher education level tend to
have significantly higher income, and vice versa.
 In the Factor Analysis technique, variables are grouped by their correlations, i.e.,
all variables in a particular group will have a high correlation among themselves,
but a low correlation with variables of other group(s). Here, each group is known
as a factor. These factors are small in number as compared to the original
dimensions of the data. However, these factors are difficult to observe.

CSDS3202-Introduction to Data Science 21


FACTOR ANALYSIS…
Read in all the images contained in the train folder:
train = pd.read_csv("../input/fashionmnist/fashion-
mnist_train.csv",sep=',')
Convert these images into a numpy array format
train_data = np.array(train, dtype = 'float32')
image = []
for i in range(0,60000):
img = train_data[i].flatten()
image.append(img)
image = np.array(image)

CSDS3202-Introduction to Data Science 22


FACTOR ANALYSIS…

Create a dataframe containing the pixel values of every individual pixel present in each
image, and also their corresponding labels
train = pd.read_csv("../input/fashionmnist/fashion-
mnist_train.csv",sep=',') # Give the complete path of your
train.csv file
feat_cols = [ 'pixel'+str(i) for i in range(image.shape[1]) ]
df = pd.DataFrame(image,columns=feat_cols)
df['label'] = train['label']
Decompose the dataset using Factor Analysis:
from sklearn.decomposition import FactorAnalysis
fa = FactorAnalysis(n_components =
3).fit_transform(df[feat_cols].values)

CSDS3202-Introduction to Data Science 23


FACTOR ANALYSIS…

Visualize the results:


%matplotlib inline
import matplotlib.pyplot as plt
plt.figure(figsize=(16,10))
plt.title('Factor Analysis Components')
plt.scatter(fa[:,0], fa[:,1],c='r',s=10)
plt.scatter(fa[:,1], fa[:,2],c='b',s=10)
plt.scatter(fa[:,2],fa[:,0],c='g',s=10)
plt.legend(("First Factor","Second Factor","Third
Factor"))

CSDS3202-Introduction to Data Science 24


5.7.2 UNIFORM MANIFOLD APPROXIMATION
AND PROJECTION (UMAP)
 Uniform Manifold Approximation and Projection (UMAP) is a dimension reduction
technique that can preserve as much of the local, and more of the global data
structure
 Key advantages of UMAP are:
 It can handle large datasets and high dimensional data without too much difficulty
 It combines the power of visualization with the ability to reduce the dimensions of the
data
 Along with preserving the local structure, it also preserves the global structure of the
data. UMAP maps nearby points on the manifold to nearby points in the low
dimensional representation, and does the same for far away points.
 This method uses the concept of k-nearest neighbor and optimizes the results
using stochastic gradient descent.

CSDS3202-Introduction to Data Science 25


UMAP…

import umap
umap_data = umap.UMAP(n_neighbors=5, min_dist=0.3,
n_components=3).fit_transform(df[feat_cols][:6000].values
)
Here,
 n_neighbors determines the number of neighboring points
used
 min_dist controls how tightly embedding is allowed.
Larger values ensure embedded points are more evenly
distributed

CSDS3202-Introduction to Data Science 26


UMAP…

Visualize the transformation:

CSDS3202-Introduction to Data Science 27


5.8 FEATURE SELECTION

 The process of choosing a subset of input features that contribute the most to
the output feature for use in model construction.
 Important if we have datasets with high dimensionality (i.e., large number of
features).
 Helps to mitigate these problems by selecting features that have high
importance to the model, such that the data dimensionality can be reduced
without much loss of the total information.
 Benefits of feature selection are:
 Reduce training time
 Reduce the risk of overfitting
 Potentially increase model's performance
 Reduce model's complexity such that interpretation becomes easier

CSDS3202-Introduction to Data Science 28


5.8.1 METHODS OF FEATURE SELECTION

 Filter Methods
 ANOVA F-value
 Variance Threshold
 Mutual Information
 Wrapper Methods
 Exhaustive feature selection (EFS)
 Sequential forward selection (SFS)
 Sequential backward selection (SBS)
 Embedded Methods
 Random forest

CSDS3202-Introduction to Data Science 29


5.8.1.1 NECESSARY PYTHON LIBRARIES

import numpy as np
import pandas as pd

import matplotlib.pyplot as plt


import seaborn as sns
sns.set(style="whitegrid")

import warnings
warnings.filterwarnings('ignore')

CSDS3202-Introduction to Data Science 30


IRIS FLOWER DATASET FROM SCIKIT-
LEARN
# Load Iris dataset from Scikit-learn
from sklearn.datasets import load_iris

# Create input and output features


feature_names = load_iris().feature_names
X_data = pd.DataFrame(load_iris().data, columns=feature_names)
y_data = load_iris().target

# Show the first five rows of the dataset


X_data.head()

CSDS3202-Introduction to Data Science 31


5.8.1.2 ANOVA F-VALUE

 ANOVA F-value method estimates the degree of linearity between the input
feature (i.e., predictor) and the output feature.
 A high F-value indicates high degree of linearity and a low F-value indicates low
degree of linearity.
 The main disadvantage of using ANOVA F-value is it only captures linear
relationships between input and output feature.
 In other words, any non-linear relationships cannot be detected by F-value.
 Scikit-learn has two functions to calculate F-value:
 f_classif, which calculate F-value between input and output feature for classification
task
 f_regression, which calculate F-value between input and output feature for
classification task

CSDS3202-Introduction to Data Science 32


ANOVA F-VALUE…

 Use f_classif because the Iris dataset entails classification task


# Import f_classif from Scikit-learn
from sklearn.feature_selection import f_classif
# Create f_classif object to calculate F-value
f_value = f_classif(X_data, y_data)

# Print the name and F-value of each feature


for feature in zip(feature_names, f_value[0]):
print(feature)

CSDS3202-Introduction to Data Science 33


ANOVA F-VALUE…

Visualize the results by creating a bar chart:


# Create a bar chart for visualizing the F-values
plt.figure(figsize=(4,4))
plt.bar(x=feature_names, height=f_value[0], color='tomato')
plt.xticks(rotation='vertical')
plt.ylabel('F-value')
plt.title('F-value Comparison')
plt.show()

CSDS3202-Introduction to Data Science 34


5.8.1.3 EXHAUSTIVE FEATURE SELECTION (EFS)

 EFS finds the best subset of features by evaluating all feature combinations.
 Suppose we have a dataset with three features. EFS will evaluate the
following feature combinations:
 feature_1
 feature_2
 feature_3
 feature_1 and feature_2
 feature_1 and feature_3
 feature_2 and feature_3
 feature_1, feature_2, and feature_3
 EFS selects a subset that generates the best performance (e.g., accuracy,
precision, recall, etc.) of the model being considered.
 Mlxtend provides ExhaustiveFeatureSelector function to perform EFS.
CSDS3202-Introduction to Data Science 35
EXHAUSTIVE FEATURE SELECTION (EFS)…

 EFS has five important parameters:


 estimator: the classifier that we intend to train
 min_features: the minimum number of features to select
 max_features: the maximum number of features to select
 scoring: the metric to use to evaluate the classifier
 cv: the number of cross-validations to perform

CSDS3202-Introduction to Data Science 36


EXHAUSTIVE FEATURE SELECTION (EFS)…

# Import ExhaustiveFeatureSelector from Mlxtend

from mlxtend.feature_selection import ExhaustiveFeatureSelector as EFS

# Import logistic regression from Scikit-learn

from sklearn.linear_model import LogisticRegression

# Create a logistic regression classifier

lr = LogisticRegression()

# Create an EFS object

efs = EFS(estimator=lr, # Use logistic regression as the


classifier/estimator

min_features=1, # The minimum number of features to consider is 1

max_features=4, # The maximum number of features to consider is 4

scoring='accuracy', # The metric to use to evaluate the classifier is


accuracy

cv=5) # The number of cross-validations to perform is 5


CSDS3202-Introduction to Data Science 37
EXHAUSTIVE FEATURE SELECTION (EFS)…

# Train EFS with our dataset


efs = efs.fit(X_data, y_data)
# Print the results
print('Best accuracy score: %.2f' % efs.best_score_) #
best_score_ shows the best score
print('Best subset (indices):', efs.best_idx_) # best_idx_
shows the index of features that yield the best score
print('Best subset (corresponding names):',
efs.best_feature_names_) # best_feature_names_ shows the feature
names
# that yield the best score

CSDS3202-Introduction to Data Science 38


EXHAUSTIVE FEATURE SELECTION (EFS)…

Transform the dataset into a new dataset containing only the subset of features that
generates the best score by using transform method.
# Transform the dataset
X_data_new = efs.transform(X_data)
# Print the results
print('Number of features before transformation:
{}'.format(X_data.shape[1]))
print('Number of features after transformation:
{}'.format(X_data_new.shape[1]))
# Show the performance of each subset of features
efs_results = pd.DataFrame.from_dict(efs.get_metric_dict()).T
efs_results.sort_values(by='avg_score', ascending=True, inplace=True)
efs_results

CSDS3202-Introduction to Data Science 39


EXHAUSTIVE FEATURE SELECTION (EFS)…

Visualize the performance of each subset of features by creating a horizontal bar chart:
# Create a horizontal bar chart for visualizing

# the performance of each subset of features

fig, ax = plt.subplots(figsize=(12,9))

y_pos = np.arange(len(efs_results))

ax.barh(y_pos, efs_results['avg_score'],

xerr=efs_results['std_dev'], color='tomato')

ax.set_yticks(y_pos)

ax.set_yticklabels(efs_results['feature_names'])

ax.set_xlabel('Accuracy')

plt.show()

CSDS3202-Introduction to Data Science 40


5.8.1.4 FEATURE SELECTION USING RANDOM FOREST

 Random forest is one of the most popular learning algorithms used for feature
selection in a data science workflow.
 Split dataset into train and test split because the feature selection is a part of the
training process.
 Use gini criterion to define feature importance

CSDS3202-Introduction to Data Science 41


FEATURE SELECTION USING RANDOM
FOREST…
# Import RandomForestClassifier from Scikit-learn
from sklearn.ensemble import RandomForestClassifier
# Import train_test_split from Scikit-learn
from sklearn.model_selection import train_test_split

# Split the dataset into 30% test and 70% training


X_train, X_test, y_train, y_test =
train_test_split(X_data, y_data, test_size=0.3,
random_state=0)

CSDS3202-Introduction to Data Science 42


FEATURE SELECTION USING RANDOM
FOREST…
# Create a random forest classifier
rfc = RandomForestClassifier(random_state=0, criterion='gini') # Use gini
criterion to define feature importance
# Train the classifier
rfc.fit(X_train, y_train)
# Print the name and gini importance of each feature
for feature in zip(feature_names, rfc.feature_importances_):
print(feature)

If we add up al the importance scores, the result is 100%. As we can see, petal length and petal
width correspond to 83% of the total importance score. They are clearly the most important features!

CSDS3202-Introduction to Data Science 43


REFERENCES
J. Han and M. Kamber (2011). Data Mining: Concepts and Techniques. Morgan
Kaufmann, 3rd ed.

Rahil Shaikh (2018). Feature Selection Techniques in Machine Learning with Python.
Available 2023-02-22 at https://towardsdatascience.com/feature-selection-
techniques-in-machine-learning-with-python-f24e7da3f36e

CSDS3202-Introduction to Data Science 44

You might also like