0% found this document useful (0 votes)

185 views27 pages

Top 9 Feature Engineering Techniques With Python: Dataset & Prerequisites

The document discusses feature engineering techniques for machine learning models. It begins by introducing feature engineering and its importance in improving model performance and preparing data for algorithms. Then it provides a tutorial on 9 key feature engineering techniques: [1] imputation to handle missing data, [2] categorical encoding methods like label and one-hot encoding, and [3] scaling techniques including min-max scaling, standardization, and quantile transformation. The document uses a penguin dataset to demonstrate applying these techniques step-by-step.

Uploaded by

Mamafou

Available Formats

Download as PDF, TXT or read online on Scribd

Download as pdf or txt

0% found this document useful (0 votes)

185 views27 pages

Top 9 Feature Engineering Techniques With Python: Dataset & Prerequisites

Uploaded by

Mamafou

Available Formats

Download as PDF, TXT or read online on Scribd

Download as pdf or txt

You are on page 1/ 27

Top 9 Feature Engineering Techniques with Python

rubikscode.net/2021/06/29/top-9-feature-engineering-techniques

In a previous couple of articles, we specifically focused on the performance of machine

learning models. First, we talked about how to quantify machine learning model performance
and how to improve it with regularization. Then we covered the other optimization
techniques, both basic ones like Gradient Descent and advanced ones, like Adam.
Finally, we were able to see how to perform hyperparameter optimization and get the
best “configuration” for your model.

However, what we haven’t considered so far is how we can improve performance by making
modifications in the data itself. We were focused on the model. So far in our articles about
SVM and clustering, we applied some techniques (like scaling) to our data, but we haven’t
done a deeper analysis of this process and how manipulations with the dataset can help us
with performance improvements. In this article we do exactly that, explore the most effective
feature engineering techniques, that are often required in order to get good results.

This bundle of e-books is specially crafted for beginners.

Everything from Python basics to the deployment of Machine Learning algorithms to
production in one place.
Become a Machine Learning Superhero TODAY!

Don’t get me wrong, feature engineering is not there just to optimize models. Sometimes we
need to apply these techniques so our data is compatible with the machine learning
algorithm. Machine learning algorithms sometimes expect data formatted in a certain way,
and that is where feature engineering can help us. Apart from that, it is important to note
that data scientists and engineers spend most of their time on data preprocessing. That is
why it is important to master these techniques. In this article we explore:

Dataset & Prerequisites

For the purpose of this tutorial, make sure that you have installed the following
Python libraries:

NumPy – Follow this guide if you need help with installation.

SciKit Learn – Follow this guide if you need help with installation.
Pandas – Follow this guide if you need help with installation.
Matplotlib – Follow this guide if you need help with installation.
SeaBorn – Follow this guide if you need help with installation.

Once installed make sure that you have imported all the necessary modules that are used in
this tutorial.

1/27
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sb

from sklearn.preprocessing import StandardScaler, MinMaxScaler, MaxAbsScaler,

QuantileTransformer
from sklearn.feature_selection import SelectKBest, f_classif

Data that we use in this article is from PalmerPenguins Dataset. This dataset has been
recently introduced as an alternative to the famous Iris dataset. It is created by Dr. Kristen
Gorman and the Palmer Station, Antarctica LTER. You can obtain this dataset here, or via
Kaggle. This dataset is essentially composed of two datasets, each containing data of 344
penguins. Just like in Iris dataset there are 3 different species of penguins coming from 3
islands in the Palmer Archipelago.

Also, these datasets contain culmen dimensions for each species. The culmen is the upper
ridge of a bird’s bill. In the simplified penguin’s data, culmen length and depth are renamed
as variables culmen_length_mm and culmen_depth_mm. Loading this dataset is done using
Pandas:

data = pd.read_csv('./data/penguins_size.csv')
data.head()

2/27
1. Imputation
Data that we get from clients can come in all shapes and forms. Often it is sparse, meaning
some samples may miss data for some features. We need to detect those instances and
remove those samples or replace empty values with something. Depending on the rest of the
dataset, we may apply different strategies for replacing those missing values. For example,
we may fill these empty slots with average feature value, or maximal feature value. However,
let’s first detect missing data. For that we can use Pandas:

print(data.isnull().sum())

species 0
island 0
culmen_length_mm 2
culmen_depth_mm 2
flipper_length_mm 2
body_mass_g 2
sex 10

This means that there are instances in our dataset that are missing values in some of the
features. There are two instances that are missing the culmen_length_mm feature value and
10 instances that are missing the sex feature. We were able to see that even in the first couple
of samples (NaN means Not a Number, meaning missing value):

The easiest deal with missing values is to drop samples with missing values from the dataset,
in fact, some machine learning platforms automatically do that for you. However, this may
reduce the performance of the dataset, because of the reduced dataset. The easy way to do it
is again using Pandas:

3/27
data = pd.read_csv('./data/penguins_size.csv')
data = data.dropna()
data.head()

Note that the third sample with missing values is removed from the dataset. This is not
optimal, but sometimes it is necessary since most of the machine learning algorithms don’t
work with sparse data. The other way is to use imputation, meaning to replace missing
values. To do so we can pick some value, or use the mean value of the feature, or an
average value of the feature, etc. Still, we need need to be careful. Observe missing value at
the row with index 3:

If just replace it with simple value, we apply the same value for categorical and for numerical
features:

data = data.fillna(0)

4/27
This is not good. So, here is the proper way. We detected missing data in numerical features
culmen_length_mm, culmen_depth_mm, flipper_length_mm and body_mass_g. For the
imputation value of these features, we will use the mean value of the feature. For the
categorical feature ‘sex‘, we use the most frequent value. Here is how we do it:

data = pd.read_csv('./data/penguins_size.csv')

data['culmen_length_mm'].fillna((data['culmen_length_mm'].mean()), inplace=True)
data['culmen_depth_mm'].fillna((data['culmen_depth_mm'].mean()), inplace=True)
data['flipper_length_mm'].fillna((data['flipper_length_mm'].mean()), inplace=True)
data['body_mass_g'].fillna((data['body_mass_g'].mean()), inplace=True)

data['sex'].fillna((data['sex'].value_counts().index[0]), inplace=True)

data.reset_index()
data.head()

Observe how the mentioned third sample looks like now:

Often, data is not missing, but it has an invalid value. For example, we know that for the
‘sex‘ feature we can have two values: FEMALE and MALE. We can check if we have values
other than this:

5/27
data.loc[(data['sex'] != 'FEMALE') & (data['sex'] != 'MALE')]

As it turnes out we have one record that has value ‘.’ for this feature, which is not correct. We
can observe these instances as a missing data and drop them or replace them:

data = data.drop([336])
data.reset_index()

2. Categorical Encoding
One way to improve your predictions is by applying clever ways when working with
categorical variables. These variables, as the name suggests, have discrete values and
represent some sort of category or class. For example color can be categorical variable (‘red’,
‘blue‘, ‘green‘).

The challenge is including these variables into data analysis and use them with machine
learning algorithms. Some machine learning algorithms support categorical variables
without further manipulation, but some don’t. That is why we use a categorical encoding.
In this tutorial, we cover several types of categorical encoding, but before we continue, let’s
extract those variables from our dataset into a separate variable and mark them as
categorical type:

data["species"] = data["species"].astype('category')
data["island"] = data["island"].astype('category')
data["sex"] = data["sex"].astype('category')
data.dtypes

species category
island category
culmen_length_mm float64
culmen_depth_mm float64
flipper_length_mm float64
body_mass_g float64
sex category

categorical_data = data.drop(['culmen_length_mm', 'culmen_depth_mm',

'flipper_length_mm', \
'body_mass_g'], axis=1)
categorical_data.head()

Ok, now we are ready to roll. We start with the simplest form of encoding Label Encoding.

6/27
2.1 Label Encoding
Label encoding is converting each categorical value into some number. For example, the
‘species‘ feature contains 3 categories. We can assign value 0 to Adelie, 1 to Gentoo and 2 to
Chinstrap. To perform this technique we can use Pandas:

categorical_data["species_cat"] = categorical_data["species"].cat.codes
categorical_data["island_cat"] = categorical_data["island"].cat.codes
categorical_data["sex_cat"] = categorical_data["sex"].cat.codes
categorical_data.head()

As you can see, we added three new features each containing encoded categorical features.
From the first five instances, we can see that species category Adelie is encoded with
value 0, island category Torgensesn is encoded with value 2 and sex
categories FEMALE and MALE are encoded with values 0 and 1 respectively.

2.2 One-Hot Encoding

This is one of the most popular categorical encoding techniques. It spreads the values in a
feature to multiple flag features and assigns values 0 or 1 to them. This binary value
represents the relationship between non-encoded and encoded features.

7/27
For example, in our dataset, we have two possible values in ‘sex‘ feature: FEMALE and
MALE. This technique will create two separate features labeled let’s say ‘sex_female‘ and
‘sex_male‘. If in the ‘sex‘ feature we have value ‘FEMALE‘ for some sample, the ‘sex_female‘
will be assigned value 1 and ‘sex_male‘ will be assigned value 0. In the same way, if in the
‘sex‘ feature we have the value ‘MALE‘ for some sample, the ‘sex_male‘ will be assigned value
1 and ‘sex_female‘ will be assigned value 0. Let’s apply this technique to our categorical data
and see what we get:

encoded_spicies = pd.get_dummies(categorical_data['species'])
encoded_island = pd.get_dummies(categorical_data['island'])
encoded_sex = pd.get_dummies(categorical_data['sex'])

categorical_data = categorical_data.join(encoded_spicies)
categorical_data = categorical_data.join(encoded_island)
categorical_data = categorical_data.join(encoded_sex)

As you we gave some new columns there. Essentially, every category in each feature got a
separate column. Often, just one-hot encoded values are used as input to a machine learning
algorithm.

2.3 Count Encoding

8/27
Count encoding is converting each categorical value to its frequency, ie. the number of times
it appears in the dataset. For example, if the ‘species‘ feature contains 6 occurrences of class
Adelie we will replace every Adelie value with the number 6. Here is how we do that in the
code:

categorical_data = data.drop(['culmen_length_mm', 'culmen_depth_mm', \

'flipper_length_mm', 'body_mass_g'], axis=1)

species_count = categorical_data['species'].value_counts()
island_count = categorical_data['island'].value_counts()
sex_count = categorical_data['sex'].value_counts()

categorical_data['species_count_enc'] =
categorical_data['species'].map(species_count)
categorical_data['island_count_enc'] = categorical_data['island'].map(island_count)
categorical_data['sex_count_enc'] = categorical_data['sex'].map(sex_count)

categorical_data

Notice how every category value is replaced with the number of occurances.

2.4 Target Encoding

Unlike previous techniques, this one is a little bit more complicated. It replaces a categorical
value with the average value of the output (ie. target) for that value of the feature.
Essentially, all you need to do is calculate the average output for all the rows with specific
category value. Now, this is quite straight forward when the output value is numerical. If the
output is categorical, like in our PalmerPenguins dataset, we need to apply some of the
previous techniques to it.

9/27
Often this average value is blended with the outcome probability over the entire dataset in
order to reduce the variance of values with few occurrences. It is important to note that since
category values are calculated based on the output value, these calculations should be done
on the training dataset and then applied to other datasets. Otherwise, we would face
information leakage, meaning that we would include information about the output values
from the test set inside of the training set. This would render our tests invalid or give us false
confidence. Ok, let’s see how we can do this in code:

categorical_data["species"] = categorical_data["species"].cat.codes

island_means = categorical_data.groupby('island')['species'].mean()
sex_means = categorical_data.groupby('sex')['species'].mean()

Here we used label encoding for output feature and then calculated mean values for
categorical features ‘island‘ and ‘sex‘. Here is what we get for the ‘island‘ feature:

island_means

island
Biscoe 1.473054
Dream 0.548387
Torgersen 0.000000

This means that values Biscoe, Dream and Torgersen will be replaced with values 1.473054,
0.548387 and 0 respectively. For the ‘sex‘ feature we have a similar situation:

sex_means

sex
FEMALE 0.909091
MALE 0.921348

10/27
Meaning that values FEMALE and MALE will be replaced with 0.909091 and 0.921348
respectively. Here is what that looks like in the dataset:

categorical_data['island_target_enc'] = categorical_data['island'].map(island_means)
categorical_data['sex_target_enc'] = categorical_data['sex'].map(sex_means)
categorical_data

2.5 Leave One Out Target Encoding

The final type of encoding that we explore in this tutorial is built on top of Target Encoding.
It works in the same way as Target encoding with one difference. When we are calculating the
mean output value for the sample, we exclude that sample. Here is how it is done in the
code. First, we define a function that does this:

def leave_one_out_mean(series):
series = (series.sum() - series)/(len(series) - 1)
return series

And then we apply it to categorical values in our dataset:

categorical_data['island_loo_enc'] = categorical_data.groupby('island')
['species'].apply(leave_one_out_mean)
categorical_data['sex_loo_enc'] = categorical_data.groupby('sex')
['species'].apply(leave_one_out_mean)
categorical_data

11/27
3. Handling Outliers
Outliers are values that are deviating from the whole distribution of the data. Sometimes
these values are mistakes and wrong measurements and should be removed from datasets,
but sometimes they are valuable edge-case information. This means that sometimes we
want to leave these values in the dataset, since they may carry some important information,
while other times we want to remove those samples, because of the wrong information.

In a nutshell, we can use the Inter-quartile range to detect these points. Inter-quartile
range or IQR indicates where 50 percent of data is located. When we are looking for this
value we first look for the median, since it splits data into half. Then we are locating the
median of the lower end of the data (denoted as Q1) and the median of the higher end of the
data (denoted as Q3).

Data between Q1 and Q3 is the IQR. Outliers are defined as samples that fall below Q1 –
1.5(IQR) or above Q3 + 1.5(IQR). We can do this using a boxplot. The purpose of the boxplot
is to visualize the distribution. In essence, it includes important points: max value, min value,
median, and two IQR points (Q1, Q3). Here is how one example of a boxplot looks like:

Let’s apply it to PalmerPenguins dataset:

12/27
fig, axes = plt.subplots(nrows=4,ncols=1)
fig.set_size_inches(10, 30)
sb.boxplot(data=data,y="culmen_length_mm",x="species"
palette="Oranges")
sb.boxplot(data=data,y="culmen_depth_mm",x="species",
palette="Oranges")
sb.boxplot(data=data,y="flipper_length_mm",x="species
palette="Oranges")
sb.boxplot(data=data,y="body_mass_g",x="species",orie
palette="Oranges")

The other way for detecting and removing outliers would

by using standard deviation.

factor = 2
upper_lim = data['culmen_length_mm'].mean () +
data['culmen_length_mm'].std () * factor
lower_lim = data['culmen_length_mm'].mean () -
data['culmen_length_mm'].std () * factor

no_outliers = data[(data['culmen_length_mm'] <

upper_lim) & (data['culmen_length_mm'] > lower_lim)]
no_outliers

13/27
14/27
Note that now we have only 100 samples left after this operation. Here we need to define the
factor by which we multiply the standard deviation. Usually, we use values between 2 and 4
for this purpose.

Finally, we can use a method to detect outliers is to use percentiles. We can assume a
certain percentage of the value from the top or the bottom as an outlier. Again a value for the
percentiles we use as outliers border depends on the distribution of the data. Here is what we
can do on PalmerPenguins dataset:

upper_lim = data['culmen_length_mm'].quantile(.95)
lower_lim = data['culmen_length_mm'].quantile(.05)

no_outliers = data[(data['culmen_length_mm'] < upper_lim) & (data['culmen_length_mm']

> lower_lim)]
no_outliers

15/27
After this operation, we have 305 samples in our dataset. With this approach we need to be
extremely careful since it reduces the dataset size and highly depends on the data
distribution.

4. Binning
Binning is a simple technique that groups different values into bins. For example, when we
want to bin numerical features that would look like something like this:

0-10 – Low
10-50 – Medium
50-100 – High

In this particular case, we replace numerical features with categorical ones.

16/27
However, we can bin categorical values too. For example, we can bin countries by the
continent it is on:

Serbia – Europe
Germany – Europe
Japan – Asia
China – Asia
USA – North America
Canada – North America

The problem with binning is that it can downgrade performance, but it can prevent
overfitting and increase the robustness of the machine learning model. Here is what that
looks like in the code:

bin_data = data[['culmen_length_mm']]
bin_data['culmen_length_bin'] = pd.cut(data['culmen_length_mm'], bins=[0, 40, 50,
100], \
labels=["Low", "Mid", "High"])
bin_data

17/27
5. Scaling
In previous articles, we often had a chance to how scaling helps machine learning models
make better predictions. Scaling is done for one simple reason, if features are not in the
same range, they will be treated differently by the machine learning algorithm. To put it in
lame terms, if we have one feature that has a range of values from 0-10 and another 0-100, a
machine learning algorithm might deduce that the second feature is more important than
the first one just because it has a higher value.

As we already know that is not always the case. On the other hand, it is unrealistic to expect
that real data comes in the same range. That is why we use scaling, to put our numerical
features into the same range. This standardization of data is a common requirement for
many machine learning algorithms. Some of them even require that features look like
standard normally distributed data. There are several ways we can scale and standardize the
data, but before we go through them, let’s observe one feature of PalmerPenguins dataset
‘body_mass_g‘.

scaled_data = data[['body_mass_g']]

print('Mean:', scaled_data['body_mass_g'].mean())
print('Standard Deviation:', scaled_data['body_mass_g'].std())

Mean: 4199.791570763644
Standard Deviation: 799.9508688401579

Also, observe the distribution of this feature:

18/27
First, let’s explore scaling techniques that preserve distribution.

5.1 Standard Scaling

This type of scaling removes mean and scale data to unit variance. It is defined by the
formula:

where mean is the mean of the training samples, and std is the standard deviation of the
training samples. The best way to understand it is to look at it in practice. For that we use
SciKit Learn and StandardScaler class:

standard_scaler = StandardScaler()
scaled_data['body_mass_scaled'] =
standard_scaler.fit_transform(scaled_data[['body_mass_g']])

print('Mean:', scaled_data['body_mass_scaled'].mean())
print('Standard Deviation:', scaled_data['body_mass_scaled'].std())

Mean: -1.6313481178165566e-16
Standard Deviation: 1.0014609211587777

19/27
We can see that the original distribution of data is preserved. However, now data is in
range -3 to 3.

5.2 Min-Max Scaling (Normalization)

The most popular scaling technique is normalization (also called min-max normalization
and min-max scaling). It scales all data in the 0 to 1 range. This technique is defined by the
formula:

If we use MinMaxScaler from SciKit learn library:

minmax_scaler = MinMaxScaler()
scaled_data['body_mass_min_max_scaled'] =
minmax_scaler.fit_transform(scaled_data[['body_mass_g']])

print('Mean:', scaled_data['body_mass_min_max_scaled'].mean())
print('Standard Deviation:', scaled_data['body_mass_min_max_scaled'].std())

Mean: 0.4166087696565679
Standard Deviation: 0.2222085746778217

20/27
Distribution is perserved, but data is now in range from 0 to 1.

5.3 Quantile Transformation

As we mentioned, sometimes machine learning algorithms require that the distribution of

our data is uniform or normal. We can achieve that using QuantileTransformer class from
SciKit Learn. First, here is how it looks like when we transform our data to uniform
distribution:

qtrans = QuantileTransformer()
scaled_data['body_mass_q_trans_uniform'] =
qtrans.fit_transform(scaled_data[['body_mass_g']])

print('Mean:', scaled_data['body_mass_q_trans_uniform'].mean())
print('Standard Deviation:', scaled_data['body_mass_q_trans_uniform'].std())

Mean: 0.5002855778903038
Standard Deviation: 0.2899458384920982

Here is the code that puts your data into normal distribution:

21/27
qtrans = QuantileTransformer(output_distribution='normal', random_state=0)
scaled_data['body_mass_q_trans_normal'] =
qtrans.fit_transform(scaled_data[['body_mass_g']])

print('Mean:', scaled_data['body_mass_q_trans_normal'].mean())
print('Standard Deviation:', scaled_data['body_mass_q_trans_normal'].std())

Mean: 0.0011584329410665568
Standard Deviation: 1.0603614567765762

Essentially, we use the output_distribution parameter in the constructor to define the type of
distribution. Finally, we can observe scaled values of all features, with different types of
scaling:

6. Log Transform
One of the most popular mathematical transformations of data is logarithm
transformation. Essentially, we just apply the log function to the current values. It is
important to note that data must be positive, so if you need a scale or normalize data

22/27
beforehand. This transformation brings many benefits. One of them is that the distribution
of the data becomes more normal. In turn, this helps us to handle skewed data and
decreases the impact of the outliers. Here is what that looks like in the code:

log_data = data[['body_mass_g']]
log_data['body_mass_log'] = (data['body_mass_g'] + 1).transform(np.log)
log_data

If we check the distribution of non-transformed data

and transformed data we can see that transformed data
is closer to the normal distribution:

7. Feature Selection
Datasets that are coming from the client are often huge. We can have hundreds or even
thousands of features. Especially if we perform some of the techniques from above. A large
number of features can lead to overfitting. Apart from that, optimizing hyperparameters
and training algorithms, in general, will take longer. That is why we want to pick the most
relevant features from the beginning.

23/27
There are several techniques when it comes to feature selection, however, in this tutorial, we
cover only the simplest one (and the most often used) – Univariate Feature Selection.
This method is based on univariate statistical tests. It calculates how strongly the output
feature depends on each feature from the dataset using statistical tests (like χ2). In this
example, we utilize SelectKBest which has several options when it comes to used statistical
tests (the default however is χ2 and we use that one in this example). Here is how we can do
it:

feature_sel_data = data.drop(['species'], axis=1)

feature_sel_data["island"] = feature_sel_data["island"].cat.codes
feature_sel_data["sex"] = feature_sel_data["sex"].cat.codes

# Use 3 features
selector = SelectKBest(f_classif, k=3)

selected_data = selector.fit_transform(feature_sel_data, data['species'])

selected_data

array([[ 39.1, 18.7, 181. ],

[ 39.5, 17.4, 186. ],
[ 40.3, 18. , 195. ],
...,
[ 50.4, 15.7, 222. ],
[ 45.2, 14.8, 212. ],
[ 49.9, 16.1, 213. ]])

Using hyperparameter k we defined that we want to keep the 3 most influential features from
the dataset. The output of this operation is NumPy array which contains selected features. To
make it into pandas Dataframe we need to do the following:

24/27
selected_features = pd.DataFrame(selector.inverse_transform(selected_data),
index=data.index,
columns=feature_sel_data.columns)

selected_columns = selected_features.columns[selected_features.var() != 0]
selected_features[selected_columns].head()

8. Feature Grouping
The dataset that we observed so far is an almost perfect situation when it comes to terms of
so-called “tidiness“. This means that each feature has it’s own column, each observation is a
row, and each type of observational unit is a table.

However, sometimes we have observations that are spread over several rows. The goal of
the Feature Grouping is to connect these rows into a single one and then use those
aggregated rows. The main question when doing so is which type of aggregation function will
be applied to features. This is especially complicated for categorical features.

As we mentioned, PalmerPenguins dataset is very tydi so the following example is just

educational to show the code that can be used for this operation:

grouped_data = data.groupby('species')

sums_data = grouped_data['culmen_length_mm',
'culmen_depth_mm'].sum().add_suffix('_sum')
avgs_data = grouped_data['culmen_length_mm',
'culmen_depth_mm'].mean().add_suffix('_mean')

sumed_averaged = pd.concat([sums_data, avgs_data], axis=1)

sumed_averaged

Here we grouped data by spicies value and for each numerical value we created two new
features with sum and mean value.

25/27
9. Feature Split
Sometimes, data is not connected over rows, but over colums. For example, imagine you
have list of names in one of the features:

data.names

0 Andjela Zivkovic
1 Vanja Zivkovic
2 Petar Zivkovic
3 Veljko Zivkovic
4 Nikola Zivkovic

So, if we want to extract only first name from this feature we can do the following:

data.names

0 Andjela
1 Vanja
2 Petar
3 Veljko
4 Nikola

This technique is called feature spliting and it is often used with string data.

Where to go from here?

In general, Machine Learning algorithms are a part of some bigger application. More often
than not, we need to use features for other parts of the application. This is just one of the
many reasons why taking Machine Learning applications to production is especially
challenging. One way to mitigate this problem is by using so-called Feature Stores. This is a
relatively new concept in data architecture, however it is already applied by companies such
as Uber and Gojek.

The feature store is both a computational and storage service. In essence, they expose
features, so they can be discovered and used as part of Machine Learning pipelines and
online applications. Since they need to provide both, storing large volumes of data and
providing low computational latency, Feature Stores are implemented as dual-database

26/27
systems. One one end there is a low latency key-value store or real-time database and on the
other end there is a SQL database that can store a lot of data. This is an interesting concept
that can be further explored.

27/27

Statistics Cheat Sheets Harvard University
100% (2)
Statistics Cheat Sheets Harvard University
12 pages
Lesson 3 - Personal Power Lab - Atharva
60% (5)
Lesson 3 - Personal Power Lab - Atharva
3 pages
KCB Tariff Guide Q1 2023
No ratings yet
KCB Tariff Guide Q1 2023
1 page
Machine Learning with Clustering: A Visual Guide for Beginners with Examples in Python
From Everand
Machine Learning with Clustering: A Visual Guide for Beginners with Examples in Python
Artem Kovera
No ratings yet
ML Performance Improvement Cheatsheet
No ratings yet
ML Performance Improvement Cheatsheet
11 pages
MAchine Learning
No ratings yet
MAchine Learning
120 pages
Machine Learning Algorithms PDF
100% (1)
Machine Learning Algorithms PDF
148 pages
Unit 4 Basics of Feature Engineering
No ratings yet
Unit 4 Basics of Feature Engineering
33 pages
DATA MINING and MACHINE LEARNING. CLASSIFICATION PREDICTIVE TECHNIQUES: SUPPORT VECTOR MACHINE, LOGISTIC REGRESSION, DISCRIMINANT ANALYSIS and DECISION TREES: Examples with MATLAB
From Everand
DATA MINING and MACHINE LEARNING. CLASSIFICATION PREDICTIVE TECHNIQUES: SUPPORT VECTOR MACHINE, LOGISTIC REGRESSION, DISCRIMINANT ANALYSIS and DECISION TREES: Examples with MATLAB
César Pérez López
No ratings yet
DEEP LEARNING TECHNIQUES: CLUSTER ANALYSIS and PATTERN RECOGNITION with NEURAL NETWORKS. Examples with MATLAB
From Everand
DEEP LEARNING TECHNIQUES: CLUSTER ANALYSIS and PATTERN RECOGNITION with NEURAL NETWORKS. Examples with MATLAB
César Pérez López
No ratings yet
Top 100 ML Interview Q&A
100% (1)
Top 100 ML Interview Q&A
39 pages
Feature Engineering
No ratings yet
Feature Engineering
13 pages
Feature Engineering Handout
No ratings yet
Feature Engineering Handout
33 pages
Statistical Machine Learning
100% (1)
Statistical Machine Learning
12 pages
K Means
100% (2)
K Means
329 pages
Data Science Use Cases
100% (1)
Data Science Use Cases
10 pages
Feature Selection Techniques in ML With Python-1
No ratings yet
Feature Selection Techniques in ML With Python-1
7 pages
Scikit Learn
No ratings yet
Scikit Learn
25 pages
TensorFlow Cheatsheet Zero To Mastery V1.01
No ratings yet
TensorFlow Cheatsheet Zero To Mastery V1.01
26 pages
Feature Engineering
100% (2)
Feature Engineering
76 pages
A Short Guide For Feature Engineering and Feature Selection
No ratings yet
A Short Guide For Feature Engineering and Feature Selection
32 pages
Deploy A Machine Learning Model Using Flask - Towards Data Science
No ratings yet
Deploy A Machine Learning Model Using Flask - Towards Data Science
12 pages
Ensemble Learning Methods
100% (1)
Ensemble Learning Methods
24 pages
Machine Learning IQs
100% (1)
Machine Learning IQs
13 pages
Deep Learning With Keras
100% (5)
Deep Learning With Keras
136 pages
Python For Data Science
100% (1)
Python For Data Science
4 pages
7 Time Series Datasets For Machine Learning
No ratings yet
7 Time Series Datasets For Machine Learning
8 pages
Fundamentals of Statistics For Data Science
No ratings yet
Fundamentals of Statistics For Data Science
23 pages
Aws Mlops Framework
No ratings yet
Aws Mlops Framework
43 pages
L2 - Machine Learning Process
No ratings yet
L2 - Machine Learning Process
17 pages
Mathematics of Generative AI
No ratings yet
Mathematics of Generative AI
22 pages
Combined ML
100% (1)
Combined ML
705 pages
05 Logistic - Regression
No ratings yet
05 Logistic - Regression
7 pages
Career Plans For Next 2 Years
No ratings yet
Career Plans For Next 2 Years
11 pages
Statistics in Details
100% (2)
Statistics in Details
283 pages
Deep Learning Questions
50% (2)
Deep Learning Questions
51 pages
Advanced Deep Learning Questions - ChatGPT
No ratings yet
Advanced Deep Learning Questions - ChatGPT
13 pages
Keras For Beginners: Implementing A Recurrent Neural Network
No ratings yet
Keras For Beginners: Implementing A Recurrent Neural Network
13 pages
Machine Learning Basics: An Illustrated Guide For Non-Technical Readers
No ratings yet
Machine Learning Basics: An Illustrated Guide For Non-Technical Readers
16 pages
02 - Lecture Note - TensorFlow Ops
No ratings yet
02 - Lecture Note - TensorFlow Ops
21 pages
ML Summary PDF
No ratings yet
ML Summary PDF
5 pages
Machine Learning Handouts
No ratings yet
Machine Learning Handouts
110 pages
Statistics For Data Science
100% (1)
Statistics For Data Science
27 pages
Chapter 5.3-Mulitple Linear Regression
No ratings yet
Chapter 5.3-Mulitple Linear Regression
26 pages
Predictive Model For E-Commerce
100% (1)
Predictive Model For E-Commerce
3 pages
Data Mining Project Shivani Pandey
100% (1)
Data Mining Project Shivani Pandey
40 pages
A Novel Adoption of LSTM in Customer Touchpoint Prediction Problems Presentation 1
No ratings yet
A Novel Adoption of LSTM in Customer Touchpoint Prediction Problems Presentation 1
73 pages
Machine Learning Interview Questions
No ratings yet
Machine Learning Interview Questions
41 pages
Churn For Bank Customers
No ratings yet
Churn For Bank Customers
28 pages
Ensemble Learning: Wisdom of The Crowd
100% (1)
Ensemble Learning: Wisdom of The Crowd
12 pages
Elevating Customer Satisfaction With LLM-Powered Chatbots
No ratings yet
Elevating Customer Satisfaction With LLM-Powered Chatbots
18 pages
Machine Learning GenAI Roadma
No ratings yet
Machine Learning GenAI Roadma
36 pages
1 - Machine Learning (Start)
No ratings yet
1 - Machine Learning (Start)
32 pages
Deep Learning Cheatsheet
No ratings yet
Deep Learning Cheatsheet
5 pages
Deep Learning Based Recommendation Systems
No ratings yet
Deep Learning Based Recommendation Systems
47 pages
Feature Engineering
100% (2)
Feature Engineering
44 pages
Time Series Summary
100% (1)
Time Series Summary
23 pages
Interpretable Machine Learning PDF
100% (2)
Interpretable Machine Learning PDF
251 pages
Design A Machine Learning System
No ratings yet
Design A Machine Learning System
9 pages
Machine Learning Platforms: The Definitive Guide To
No ratings yet
Machine Learning Platforms: The Definitive Guide To
39 pages
Generative AI
No ratings yet
Generative AI
5 pages
Machine Learning Python
100% (1)
Machine Learning Python
9 pages
Clustering K-Means
100% (2)
Clustering K-Means
28 pages
Machine Learning For Everyone
No ratings yet
Machine Learning For Everyone
35 pages
2021 Trends Report:: Data, Machine Learning & AI
No ratings yet
2021 Trends Report:: Data, Machine Learning & AI
15 pages
Getting Started With MLOPs 21 Page Tutorial
No ratings yet
Getting Started With MLOPs 21 Page Tutorial
21 pages
14 Python Automl Frameworks Data Scientists Can Use
No ratings yet
14 Python Automl Frameworks Data Scientists Can Use
3 pages
Case Study Consumer Behavior
No ratings yet
Case Study Consumer Behavior
4 pages
Naming Practice 2
No ratings yet
Naming Practice 2
1 page
One Punch Man
No ratings yet
One Punch Man
11 pages
PRE-NEED - New Rules On The Registration and Sale of Pre-Need Plans
No ratings yet
PRE-NEED - New Rules On The Registration and Sale of Pre-Need Plans
41 pages
Syllabus for Assistant Registrar
No ratings yet
Syllabus for Assistant Registrar
2 pages
Elementary Physical Education Locomotor Skills
No ratings yet
Elementary Physical Education Locomotor Skills
9 pages
Half Adder Fulladder VHDL Codes
No ratings yet
Half Adder Fulladder VHDL Codes
3 pages
A1+ Extra Tasks For Early Finishers
No ratings yet
A1+ Extra Tasks For Early Finishers
13 pages
JDSW51A USB MACH3 (Blue) - 5aixs
No ratings yet
JDSW51A USB MACH3 (Blue) - 5aixs
36 pages
USHA Dealer Satisfaction
100% (26)
USHA Dealer Satisfaction
55 pages
PG Thesis Report Format
No ratings yet
PG Thesis Report Format
14 pages
Mcdonalds Slideshare 130416154359 Phpapp02
No ratings yet
Mcdonalds Slideshare 130416154359 Phpapp02
50 pages
Standardized Chemical Pumps: To EN 22858/ISO 2858/ISO 5199
No ratings yet
Standardized Chemical Pumps: To EN 22858/ISO 2858/ISO 5199
16 pages
Department of Education: Music
No ratings yet
Department of Education: Music
2 pages
Test Sobre Líneas de Pesca
No ratings yet
Test Sobre Líneas de Pesca
9 pages
Drones - Parámetros de Vuelo - Atencio, Edison Plaza-Muñoz, Felipe Muñoz, Felipe Lozano-Galant, José Antonio - 2022 - Artículo
No ratings yet
Drones - Parámetros de Vuelo - Atencio, Edison Plaza-Muñoz, Felipe Muñoz, Felipe Lozano-Galant, José Antonio - 2022 - Artículo
17 pages
800 Series Highways Vol 1
No ratings yet
800 Series Highways Vol 1
25 pages
Ethics Module Template AY 22 23
No ratings yet
Ethics Module Template AY 22 23
51 pages
MATH 10 DLL
No ratings yet
MATH 10 DLL
17 pages
isoiec14763-3-amd1{ed1.0}en
No ratings yet
isoiec14763-3-amd1{ed1.0}en
26 pages
Ensayo Sobre El Asesinato de JFK
100% (1)
Ensayo Sobre El Asesinato de JFK
4 pages
Arts Concept of Art Lesson Plan
No ratings yet
Arts Concept of Art Lesson Plan
13 pages
Sheniblog-Sslc-Ujwalam-Physics (Eng) Final
No ratings yet
Sheniblog-Sslc-Ujwalam-Physics (Eng) Final
79 pages
Lightand Shadow
No ratings yet
Lightand Shadow
13 pages
De Sample Dan Nhap Tu Vung 101
No ratings yet
De Sample Dan Nhap Tu Vung 101
4 pages
Investor Call Transcript Q2FY25
No ratings yet
Investor Call Transcript Q2FY25
15 pages
Exercise 8 Syste Lab
No ratings yet
Exercise 8 Syste Lab
3 pages
FRL Unit - Function, Diagram, Construction, Working, Symbol, Advantages
No ratings yet
FRL Unit - Function, Diagram, Construction, Working, Symbol, Advantages
20 pages