Exploratory Data Analysis Updated
Exploratory Data Analysis Updated
CSDS3202
5.1 STATISTICS
The science of statistics involves ...
Collection of data
Analysis of data
Interpretation of data
Presentation of data
Quartiles…
• The data that falls below Q1= 3.75 is (2, 2, 3, 3.5) and is 25% of the data. We say that
25% of the data falls below Q1 = 3.75.
• The data that is more than Q1 = 3.75 but less than Q2 = 7.5 is (4, 4, 4, 6) and is 25% of
the data. We say that 25% of the data falls between Q1 =3.75 and Q2 = 7.5.
• The data that is more than Q2 = 7.5 but less than Q3 = 10.75 (8, 8, 10, 10) is 25% of the
data. We say that 25% of the data falls between Q2 = 7.5 and Q3 = 10.75.
• The data that falls above Q3 = 10.75 (11.5,12, 12, 12) is 25% of the data. We say that
25% of the data falls above Q3 = 10.75.
Percentiles
Percentiles divide an ordered set (smallest to largest) of data into
hundredths.
Consider the ordered set of the 100 numbers 1, 2, 3, 4, 5, ..., 99, 100. Ten
percent of 100 numbers is 10 numbers. The 10 numbers 1, 2, 3, 4, 5, 6, 7, 8, 9,
10 fall below the 10th percentile. This means that the 10th percentile is
between 10 and 11. The 10th percentile (10th %ile) is equal to 10.5. Similarly,
the 90th percentile (90th %ile) is equal to 90.5.
Mean
The mean is the same as the average. To find the mean, add all the values and divide by
the total number of values.
Example: {2, 3, 5, 6}
The mean
The letter x with a bar over it, represents the sample mean.
Mode
The mode is the most frequent value in the set of numbers.
Example: In the data set 52, 60, 65, 67, 70, 71, 74, 76, 78, 78, 78, 80, 86, 89, 95, the most
frequent value is 78. The mode = 78.
Example: In the data set 52, 53, 53, 53, 60, 67, 72,72,72, 90, both 53 and 72 occur the most
number of times (3 times each) so there are two modes, 53 and 72. We call this set of data
bimodal meaning it has two modes.
CSDS3202-Introduction to Data Science 8
DESCRIPTIVE STATISTICS…
Median
The median is the middle value of a set of numbers that has been ordered from smallest
to largest. The upper case letter M is used for the median.
Example: A sample of statistics exam scores for 14 students are (in order from smallest
to largest) as follows: 53, 59, 63, 63, 72, 72, 76, 78, 81, 83, 84, 84, 90, 93
Notice that 14 is an even number. The median is between 7th and 8th values (the middle two
values).
Example: A second sample of statistics exam scores for 15 students are (in order from
smallest to largest) as follows: 52, 60, 65, 67, 70, 71, 74, 76, 78, 78, 78, 80, 86, 89, 95
Notice that 15 is an odd number. The median is the 8th value (the middle value). The 8th value is
76 so the median M = 76.
Variance
The variance is the average of the squares of the deviations. A deviation is the difference
between a value and the mean and is written as:
Example: {2, 3, 5, 6} is a set of data. The sample mean is 4. The deviations are:
2 - 4 = -2
3 - 4 = -1
5-4=1
6-4=2
The deviations squared are:
(-2)2 = 4
(-1)2 = 1
(1)2 = 1
(2)2 = 4
An average of the deviations squared is
CSDS3202-Introduction to Data Science 10
DESCRIPTIVE STATISTICS…
Standard Deviation
The standard deviation is a special average of the deviations. It measures how the data is
spread out from its mean.
The standard deviation is the square root of the variance and has the same units as the
mean. The letter s represents the sample standard deviation and the Greek
letter σ represents the population standard deviation.
Example: In the variance example above, the sample variance was s2 = 3.33 (to 2 decimal
places). The sample standard deviation is s =
Standard Normal
The standard normal distribution is a normal probability distribution of standardized
values called z-scores.
The standard normal has a mean of 0 and a standard deviation of 1. Z is commonly used
as the random variable.
Notation: Z ~ N(0, 1)
Z-Scores
The formula for a z-score is:
where x is the value that is being standardized.
A z-score is measured in terms of the standard deviation.
So, if z = 2, then 2 is the standardized score for the value of X that is 2 standard deviations above
(positive z-score) the mean.
If z = -1, then -1 is the standardized score for the value of X that is 1 standard deviation below
(negative z-score) the mean.
CSDS3202-Introduction to Data Science 12
5.5.2 TYPES OF DISTRIBUTIONS
Uniform Distribution
Normal Distribution
Binomial Distribution
Bernoulli Distribution
Poisson Distribution
Exponential Distribution
If a scatter plot shows a possible linear relationship, then the correlation coefficient indicates how
strong the relationship is between x and y. We use the letter r for the correlation coefficient.
If r = 1 or r = -1, there is "perfect correlation." This means that the points are already in a straight
line. In the real world, perfect correlation is very unlikely to happen.
The closer r is to 1 or -1, the better the correlation between x and y because the data points are
closer to the line of best fit.
There is positive correlation if x increases then y increases or if x decreases then y decreases. If
there is positive correlation, then the line has a positive slope.
There is negative correlation if x increases then y decreases or if x decreases then y increases. If
there is negative correlation, then the line has a negative slope.
There is no correlation if the correlation coefficient is 0 (r = 0). This means there is no relationship
between x and y. If there is no correlation, then the slope of the line is 0.
High correlation does not necessarily mean that x causes y or y causes x.
Suppose we have two variables: Income and Education. These variables will
potentially have a high correlation as people with a higher education level tend to
have significantly higher income, and vice versa.
In the Factor Analysis technique, variables are grouped by their correlations, i.e.,
all variables in a particular group will have a high correlation among themselves,
but a low correlation with variables of other group(s). Here, each group is known
as a factor. These factors are small in number as compared to the original
dimensions of the data. However, these factors are difficult to observe.
Create a dataframe containing the pixel values of every individual pixel present in each
image, and also their corresponding labels
train = pd.read_csv("../input/fashionmnist/fashion-
mnist_train.csv",sep=',') # Give the complete path of your
train.csv file
feat_cols = [ 'pixel'+str(i) for i in range(image.shape[1]) ]
df = pd.DataFrame(image,columns=feat_cols)
df['label'] = train['label']
Decompose the dataset using Factor Analysis:
from sklearn.decomposition import FactorAnalysis
fa = FactorAnalysis(n_components =
3).fit_transform(df[feat_cols].values)
import umap
umap_data = umap.UMAP(n_neighbors=5, min_dist=0.3,
n_components=3).fit_transform(df[feat_cols][:6000].values
)
Here,
n_neighbors determines the number of neighboring points
used
min_dist controls how tightly embedding is allowed.
Larger values ensure embedded points are more evenly
distributed
The process of choosing a subset of input features that contribute the most to
the output feature for use in model construction.
Important if we have datasets with high dimensionality (i.e., large number of
features).
Helps to mitigate these problems by selecting features that have high
importance to the model, such that the data dimensionality can be reduced
without much loss of the total information.
Benefits of feature selection are:
Reduce training time
Reduce the risk of overfitting
Potentially increase model's performance
Reduce model's complexity such that interpretation becomes easier
Filter Methods
ANOVA F-value
Variance Threshold
Mutual Information
Wrapper Methods
Exhaustive feature selection (EFS)
Sequential forward selection (SFS)
Sequential backward selection (SBS)
Embedded Methods
Random forest
import numpy as np
import pandas as pd
import warnings
warnings.filterwarnings('ignore')
ANOVA F-value method estimates the degree of linearity between the input
feature (i.e., predictor) and the output feature.
A high F-value indicates high degree of linearity and a low F-value indicates low
degree of linearity.
The main disadvantage of using ANOVA F-value is it only captures linear
relationships between input and output feature.
In other words, any non-linear relationships cannot be detected by F-value.
Scikit-learn has two functions to calculate F-value:
f_classif, which calculate F-value between input and output feature for classification
task
f_regression, which calculate F-value between input and output feature for
classification task
EFS finds the best subset of features by evaluating all feature combinations.
Suppose we have a dataset with three features. EFS will evaluate the
following feature combinations:
feature_1
feature_2
feature_3
feature_1 and feature_2
feature_1 and feature_3
feature_2 and feature_3
feature_1, feature_2, and feature_3
EFS selects a subset that generates the best performance (e.g., accuracy,
precision, recall, etc.) of the model being considered.
Mlxtend provides ExhaustiveFeatureSelector function to perform EFS.
CSDS3202-Introduction to Data Science 35
EXHAUSTIVE FEATURE SELECTION (EFS)…
lr = LogisticRegression()
Transform the dataset into a new dataset containing only the subset of features that
generates the best score by using transform method.
# Transform the dataset
X_data_new = efs.transform(X_data)
# Print the results
print('Number of features before transformation:
{}'.format(X_data.shape[1]))
print('Number of features after transformation:
{}'.format(X_data_new.shape[1]))
# Show the performance of each subset of features
efs_results = pd.DataFrame.from_dict(efs.get_metric_dict()).T
efs_results.sort_values(by='avg_score', ascending=True, inplace=True)
efs_results
Visualize the performance of each subset of features by creating a horizontal bar chart:
# Create a horizontal bar chart for visualizing
fig, ax = plt.subplots(figsize=(12,9))
y_pos = np.arange(len(efs_results))
ax.barh(y_pos, efs_results['avg_score'],
xerr=efs_results['std_dev'], color='tomato')
ax.set_yticks(y_pos)
ax.set_yticklabels(efs_results['feature_names'])
ax.set_xlabel('Accuracy')
plt.show()
Random forest is one of the most popular learning algorithms used for feature
selection in a data science workflow.
Split dataset into train and test split because the feature selection is a part of the
training process.
Use gini criterion to define feature importance
If we add up al the importance scores, the result is 100%. As we can see, petal length and petal
width correspond to 83% of the total importance score. They are clearly the most important features!
Rahil Shaikh (2018). Feature Selection Techniques in Machine Learning with Python.
Available 2023-02-22 at https://towardsdatascience.com/feature-selection-
techniques-in-machine-learning-with-python-f24e7da3f36e