0% found this document useful (0 votes)
92 views13 pages

Pytorch (Tabular) - Regression

This document summarizes the steps to perform regression on tabular data using PyTorch. It reads in a wine quality dataset, splits it into train, validation and test sets, normalizes the input features, and defines a neural network model. Key steps include preprocessing the data through normalization, stratifying the split to balance output classes, defining a dataset class to load data into the model, and using mean squared error and R2 score to evaluate performance.

Uploaded by

Guru75
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
Download as pdf or txt
0% found this document useful (0 votes)
92 views13 pages

Pytorch (Tabular) - Regression

This document summarizes the steps to perform regression on tabular data using PyTorch. It reads in a wine quality dataset, splits it into train, validation and test sets, normalizes the input features, and defines a neural network model. Key steps include preprocessing the data through normalization, stratifying the split to balance output classes, defining a dataset class to load data into the model, and using mean squared error and R2 score to evaluate performance.

Uploaded by

Guru75
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
Download as pdf or txt
Download as pdf or txt
You are on page 1/ 13

12/06/2020 Pytorch [Tabular] — Regression - Towards Data Science

Photo by Simon Basler on Unsplash [Image [0]]

HOW TO TRAIN YOUR NEURAL NET

Pytorch [Tabular] — Regression


This blog post takes you through an implementation of regression on tabular data using
PyTorch.

Akshaj Verma
Mar 28 · 8 min read

We will use the red wine quality dataset available on Kaggle. This dataset has 12
columns where the first 11 are the features and the last column is the target column.
The data set has 1599 rows.

Import Libraries
We’re using tqdm to enable progress bars for training and testing loops.

https://towardsdatascience.com/pytorch-tabular-regression-428e9c9ac93 1/13
12/06/2020 Pytorch [Tabular] — Regression - Towards Data Science

import numpy as np
import pandas as pd
import seaborn as sns
from tqdm.notebook import tqdm
import matplotlib.pyplot as plt

import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader

from sklearn.preprocessing import MinMaxScaler


from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score

Read Data

df = pd.read_csv("data/tabular/classification/winequality-red.csv")

df.head()

Input dataframe [Image [2]]

EDA and Preprocessing


First off, we plot the output rows to observe the class distribution. There’s a lot of
imbalance here. Classes 3, 4, and 8 have a very few number of samples.

We will not treat the output variables as classes here because we’re performing
regression. We will convert output column, which is all integers , to float values.

sns.countplot(x = 'quality', data=df)

https://towardsdatascience.com/pytorch-tabular-regression-428e9c9ac93 2/13
12/06/2020 Pytorch [Tabular] — Regression - Towards Data Science

Output Distribution [image [3]]

Create Input and Output Data


In order to split our data into train, validation, and test sets, we need to separate out
our inputs and outputs.

Input X is all but the last column. Output y is the last column.

X = df.iloc[:, 0:-1]
y = df.iloc[:, -1]

Train — Validation — Test


To create the train-val-test split, we’ll use train_test_split() from Sklearn.

First, we’ll split our data into train+val and test sets. Then, we'll further split our
train+val set to create our train and val sets.

Because there’s a “class” imbalance, we want to have equal distribution of all output
classes in our train, validation, and test sets.

To do that, we use the stratify option in function train_test_split() .

Remember that stratification only works with classes, not numbers. So, in general, we
can bin our numbers into classes using quartiles, deciles, histogram( np.histogram() )
and so on. So, you would have to create a new dataframe which contains the output
and it's "class". This "class" was obtained using the above mentioned methods.

https://towardsdatascience.com/pytorch-tabular-regression-428e9c9ac93 3/13
12/06/2020 Pytorch [Tabular] — Regression - Towards Data Science

In our case, let’s use the numbers as is because they are already like classes. After we
split our data, we can convert the output to float (because regression).

# Train - Test
X_trainval, X_test, y_trainval, y_test = train_test_split(X, y,
test_size=0.2, stratify=y, random_state=69)

# Split train into train-val


X_train, X_val, y_train, y_val = train_test_split(X_trainval,
y_trainval, test_size=0.1, stratify=y_trainval, random_state=21)

Normalize Input
Neural networks need data that lies between the range of (0,1). There’s a ton of
material available online on why we need to do it.

To scale our values, we’ll use the MinMaxScaler() from Sklearn. The MinMaxScaler

transforms features by scaling each feature to a given range which is (0,1) in our case.

x_scaled = (x-min(x)) / (max(x)–min(x))


Notice that we use .fit_transform() on X_train while we use .transform() on X_val

and X_test .

We do this because we want to scale the validation and test set with the same
parameters as that of the train set to avoid data leakage. fit_transform() calculates
scaling values and applies them while .transform() only applies the calculated values.

scaler = MinMaxScaler()

X_train = scaler.fit_transform(X_train)
X_val = scaler.transform(X_val)
X_test = scaler.transform(X_test)

X_train, y_train = np.array(X_train), np.array(y_train)


X_val, y_val = np.array(X_val), np.array(y_val)
X_test, y_test = np.array(X_test), np.array(y_test)

Visualize Class Distribution in Train, Val, and Test

https://towardsdatascience.com/pytorch-tabular-regression-428e9c9ac93 4/13
12/06/2020 Pytorch [Tabular] — Regression - Towards Data Science

Once we’ve split our data into train, validation, and test sets, let’s make sure the
distribution of classes is equal in all three sets.

To do that, let’s create a function called get_class_distribution() . This function takes

as input the obj y , ie. y_train , y_val , or y_test . Inside the function, we initialize a

dictionary which contains the output classes as keys and their count as values. The
counts are all initialized to 0.

We then loop through our y object and update our dictionary.

def get_class_distribution(obj):
count_dict = {
"rating_3": 0,
"rating_4": 0,
"rating_5": 0,
"rating_6": 0,
"rating_7": 0,
"rating_8": 0,
}

for i in obj:
if i == 3:
count_dict['rating_3'] += 1
elif i == 4:
count_dict['rating_4'] += 1
elif i == 5:
count_dict['rating_5'] += 1
elif i == 6:
count_dict['rating_6'] += 1
elif i == 7:
count_dict['rating_7'] += 1
elif i == 8:
count_dict['rating_8'] += 1
else:
print("Check classes.")

return count_dict

Once we have the dictionary count, we use Seaborn library to plot the bar charts.

To make the plot, we first convert our dictionary to a dataframe using


pd.DataFrame.from_dict([get_class_distribution(y_train)]) .

Subsequently, we .melt() our convert our dataframe into the long format and finally
use sns.barplot() to build the plots.

https://towardsdatascience.com/pytorch-tabular-regression-428e9c9ac93 5/13
12/06/2020 Pytorch [Tabular] — Regression - Towards Data Science

fig, axes = plt.subplots(nrows=1, ncols=3, figsize=(25,7))

# Train
sns.barplot(data =
pd.DataFrame.from_dict([get_class_distribution(y_train)]).melt(), x
= "variable", y="value", hue="variable",
ax=axes[0]).set_title('Class Distribution in Train Set')

# Val
sns.barplot(data =
pd.DataFrame.from_dict([get_class_distribution(y_val)]).melt(), x =
"variable", y="value", hue="variable", ax=axes[1]).set_title('Class
Distribution in Val Set')

# Test
sns.barplot(data =
pd.DataFrame.from_dict([get_class_distribution(y_test)]).melt(), x =
"variable", y="value", hue="variable", ax=axes[2]).set_title('Class
Distribution in Test Set')

Output distribution after train-val-test split [Image [4]]

Convert Output Variable to Float

y_train, y_test, y_val = y_train.astype(float),


y_test.astype(float), y_val.astype(float)

Neural Network
Initialize Dataset

https://towardsdatascience.com/pytorch-tabular-regression-428e9c9ac93 6/13
12/06/2020 Pytorch [Tabular] — Regression - Towards Data Science

class RegressionDataset(Dataset):

def __init__(self, X_data, y_data):


self.X_data = X_data
self.y_data = y_data

def __getitem__(self, index):


return self.X_data[index], self.y_data[index]

def __len__ (self):


return len(self.X_data)

train_dataset = RegressionDataset(torch.from_numpy(X_train).float(),
torch.from_numpy(y_train).float())

val_dataset = RegressionDataset(torch.from_numpy(X_val).float(),
torch.from_numpy(y_val).float())

test_dataset = RegressionDataset(torch.from_numpy(X_test).float(),
torch.from_numpy(y_test).float())

Model Params

EPOCHS = 150
BATCH_SIZE = 64
LEARNING_RATE = 0.001

NUM_FEATURES = len(X.columns)

Initialize Dataloader

train_loader = DataLoader(dataset=train_dataset,
batch_size=BATCH_SIZE, shuffle=True)

val_loader = DataLoader(dataset=val_dataset, batch_size=1)

test_loader = DataLoader(dataset=test_dataset, batch_size=1)

Define Neural Network Architecture


We have a simple 3 layer feedforward neural net here. We use ReLU as the activation at
all layers.

https://towardsdatascience.com/pytorch-tabular-regression-428e9c9ac93 7/13
12/06/2020 Pytorch [Tabular] — Regression - Towards Data Science

class MultipleRegression(nn.Module):
def __init__(self, num_features):
super(MultipleRegression, self).__init__()

self.layer_1 = nn.Linear(num_features, 16)


self.layer_2 = nn.Linear(16, 32)
self.layer_3 = nn.Linear(32, 16)
self.layer_out = nn.Linear(16, 1)

self.relu = nn.ReLU()

def forward(self, inputs):


x = self.relu(self.layer_1(inputs))
x = self.relu(self.layer_2(x))
x = self.relu(self.layer_3(x))
x = self.layer_out(x)

return (x)

def predict(self, test_inputs):


x = self.relu(self.layer_1(test_inputs))
x = self.relu(self.layer_2(x))
x = self.relu(self.layer_3(x))
x = self.layer_out(x)

return (x)

Check for GPU

device = torch.device("cuda:0" if torch.cuda.is_available() else


"cpu")

print(device)

###################### OUTPUT ######################

cuda:0

Initialize the model, optimizer, and loss function. Transfer the model to GPU.

We are using the Mean Squared Error loss.

model = MultipleRegression(NUM_FEATURES)
model.to(device)

print(model)

https://towardsdatascience.com/pytorch-tabular-regression-428e9c9ac93 8/13
12/06/2020 Pytorch [Tabular] — Regression - Towards Data Science

criterion = nn.MSELoss()
optimizer = optim.Adam(model.parameters(), lr=LEARNING_RATE)

###################### OUTPUT ######################

MultipleRegression(
(layer_1): Linear(in_features=11, out_features=16, bias=True)
(layer_2): Linear(in_features=16, out_features=32, bias=True)
(layer_3): Linear(in_features=32, out_features=16, bias=True)
(layer_out): Linear(in_features=16, out_features=1, bias=True)
(relu): ReLU()
)

Train Model
Before we start our training, let’s define a dictionary which will store the loss/epoch
for both train and validation sets.

loss_stats = {
'train': [],
"val": []
}

Let the training begin.

print("Begin training.")

for e in tqdm(range(1, EPOCHS+1)):

# TRAINING
train_epoch_loss = 0

model.train()
for X_train_batch, y_train_batch in train_loader:
X_train_batch, y_train_batch = X_train_batch.to(device),
y_train_batch.to(device)
optimizer.zero_grad()

y_train_pred = model(X_train_batch)

train_loss = criterion(y_train_pred,
y_train_batch.unsqueeze(1))

train_loss.backward()
optimizer.step()

train_epoch_loss += train_loss.item()
https://towardsdatascience.com/pytorch-tabular-regression-428e9c9ac93 9/13
12/06/2020 Pytorch [Tabular] — Regression - Towards Data Science

# VALIDATION
with torch.no_grad():

val_epoch_loss = 0

model.eval()
for X_val_batch, y_val_batch in val_loader:
X_val_batch, y_val_batch = X_val_batch.to(device),
y_val_batch.to(device)

y_val_pred = model(X_val_batch)

val_loss = criterion(y_val_pred,
y_val_batch.unsqueeze(1))

val_epoch_loss += val_loss.item()

loss_stats['train'].append(train_epoch_loss/len(train_loader))
loss_stats['val'].append(val_epoch_loss/len(val_loader))

print(f'Epoch {e+0:03}: | Train Loss:


{train_epoch_loss/len(train_loader):.5f} | Val Loss:
{val_epoch_loss/len(val_loader):.5f}')

###################### OUTPUT ######################

Epoch 001: | Train Loss: 31.22514 | Val Loss: 30.50931

Epoch 002: | Train Loss: 30.02529 | Val Loss: 28.97327

.
.
.

Epoch 149: | Train Loss: 0.42277 | Val Loss: 0.37748


Epoch 150: | Train Loss: 0.42012 | Val Loss: 0.37028

You can see we’ve put a model.train() at the before the loop. model.train() tells
PyTorch that you’re in training mode.

Well, why do we need to do that? If you’re using layers such as Dropout or BatchNorm

which behave differently during training and evaluation (for example; not use dropout
during evaluation), you need to tell PyTorch to act accordingly.

Similarly, we’ll call model.eval() when we test our model. We’ll see that below.

https://towardsdatascience.com/pytorch-tabular-regression-428e9c9ac93 10/13
12/06/2020 Pytorch [Tabular] — Regression - Towards Data Science

Back to training; we start a for-loop. At the top of this for-loop, we initialize our loss per
epoch to 0. After every epoch, we’ll print out the loss and reset it back to 0.

Then we have another for-loop. This for-loop is used to get our data in batches from the
train_loader .

We do optimizer.zero_grad() before we make any predictions. Since the backward()

function accumulates gradients, we need to set it to 0 manually per mini-batch.

From our defined model, we then obtain a prediction, get the loss(and accuracy) for
that mini-batch, perform back-propagation using loss.backward() and
optimizer.step() .

Finally, we add all the mini-batch losses to obtain the average loss for that epoch. We
add up all the losses for each mini-batch and finally divide it by the number of mini-
batches ie. length of train_loader to obtain the average loss per epoch.

The procedure we follow for training is the exact same for validation except for the fact
that we wrap it up in torch.no_grad and not perform any back-propagation.
torch.no_grad() tells PyTorch that we do not want to perform back-propagation,
which reduces memory usage and speeds up computation.

Visualize Loss and Accuracy


To plot the loss line plots, we again create a dataframe from the `loss_stats` dictionary.

train_val_loss_df =
pd.DataFrame.from_dict(loss_stats).reset_index().melt(id_vars=
['index']).rename(columns={"index":"epochs"})

plt.figure(figsize=(15,8))

sns.lineplot(data=train_val_loss_df, x = "epochs", y="value",


hue="variable").set_title('Train-Val Loss/Epoch')

https://towardsdatascience.com/pytorch-tabular-regression-428e9c9ac93 11/13
12/06/2020 Pytorch [Tabular] — Regression - Towards Data Science

Train-Val loss curve [Image [6]]

Test Model
After training is done, we need to test how our model fared. Note that we’ve used
model.eval() before we run our testing code. To tell PyTorch that we do not want to
perform back-propagation during inference, we use torch.no_grad() , just like we did it

for the validation loop above.

y_pred_list = []

with torch.no_grad():
model.eval()
for X_batch, _ in test_loader:
X_batch = X_batch.to(device)
y_test_pred = model(X_batch)
y_pred_list.append(y_test_pred.cpu().numpy())

y_pred_list = [a.squeeze().tolist() for a in y_pred_list]

Let’s check the MSE and R-squared metrics.

mse = mean_squared_error(y_test, y_pred_list)


r_square = r2_score(y_test, y_pred_list)

print("Mean Squared Error :",mse)


print("R^2 :",r_square)

###################### OUTPUT ######################

Mean Squared Error : 0.40861496703609534


R^2 : 0.36675687655886924
https://towardsdatascience.com/pytorch-tabular-regression-428e9c9ac93 12/13
12/06/2020 Pytorch [Tabular] — Regression - Towards Data Science

. . .

Thank you for reading. Suggestions and constructive criticism are welcome. :)

This blogpost is a part of the column — ” How to train you Neural Net”. You can find
the column here.

You can find me on LinkedIn and Twitter. If you liked this, check out my other
blogposts.

Machine Learning Data Science AI Akshaj Wields Pytorch Programming

About Help Legal

Get the Medium app

https://towardsdatascience.com/pytorch-tabular-regression-428e9c9ac93 13/13

You might also like