Pytorch (Tabular) - Regression
Pytorch (Tabular) - Regression
Akshaj Verma
Mar 28 · 8 min read
We will use the red wine quality dataset available on Kaggle. This dataset has 12
columns where the first 11 are the features and the last column is the target column.
The data set has 1599 rows.
Import Libraries
We’re using tqdm to enable progress bars for training and testing loops.
https://towardsdatascience.com/pytorch-tabular-regression-428e9c9ac93 1/13
12/06/2020 Pytorch [Tabular] — Regression - Towards Data Science
import numpy as np
import pandas as pd
import seaborn as sns
from tqdm.notebook import tqdm
import matplotlib.pyplot as plt
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader
Read Data
df = pd.read_csv("data/tabular/classification/winequality-red.csv")
df.head()
We will not treat the output variables as classes here because we’re performing
regression. We will convert output column, which is all integers , to float values.
https://towardsdatascience.com/pytorch-tabular-regression-428e9c9ac93 2/13
12/06/2020 Pytorch [Tabular] — Regression - Towards Data Science
Input X is all but the last column. Output y is the last column.
X = df.iloc[:, 0:-1]
y = df.iloc[:, -1]
First, we’ll split our data into train+val and test sets. Then, we'll further split our
train+val set to create our train and val sets.
Because there’s a “class” imbalance, we want to have equal distribution of all output
classes in our train, validation, and test sets.
Remember that stratification only works with classes, not numbers. So, in general, we
can bin our numbers into classes using quartiles, deciles, histogram( np.histogram() )
and so on. So, you would have to create a new dataframe which contains the output
and it's "class". This "class" was obtained using the above mentioned methods.
https://towardsdatascience.com/pytorch-tabular-regression-428e9c9ac93 3/13
12/06/2020 Pytorch [Tabular] — Regression - Towards Data Science
In our case, let’s use the numbers as is because they are already like classes. After we
split our data, we can convert the output to float (because regression).
# Train - Test
X_trainval, X_test, y_trainval, y_test = train_test_split(X, y,
test_size=0.2, stratify=y, random_state=69)
Normalize Input
Neural networks need data that lies between the range of (0,1). There’s a ton of
material available online on why we need to do it.
To scale our values, we’ll use the MinMaxScaler() from Sklearn. The MinMaxScaler
transforms features by scaling each feature to a given range which is (0,1) in our case.
and X_test .
We do this because we want to scale the validation and test set with the same
parameters as that of the train set to avoid data leakage. fit_transform() calculates
scaling values and applies them while .transform() only applies the calculated values.
scaler = MinMaxScaler()
X_train = scaler.fit_transform(X_train)
X_val = scaler.transform(X_val)
X_test = scaler.transform(X_test)
https://towardsdatascience.com/pytorch-tabular-regression-428e9c9ac93 4/13
12/06/2020 Pytorch [Tabular] — Regression - Towards Data Science
Once we’ve split our data into train, validation, and test sets, let’s make sure the
distribution of classes is equal in all three sets.
as input the obj y , ie. y_train , y_val , or y_test . Inside the function, we initialize a
dictionary which contains the output classes as keys and their count as values. The
counts are all initialized to 0.
def get_class_distribution(obj):
count_dict = {
"rating_3": 0,
"rating_4": 0,
"rating_5": 0,
"rating_6": 0,
"rating_7": 0,
"rating_8": 0,
}
for i in obj:
if i == 3:
count_dict['rating_3'] += 1
elif i == 4:
count_dict['rating_4'] += 1
elif i == 5:
count_dict['rating_5'] += 1
elif i == 6:
count_dict['rating_6'] += 1
elif i == 7:
count_dict['rating_7'] += 1
elif i == 8:
count_dict['rating_8'] += 1
else:
print("Check classes.")
return count_dict
Once we have the dictionary count, we use Seaborn library to plot the bar charts.
Subsequently, we .melt() our convert our dataframe into the long format and finally
use sns.barplot() to build the plots.
https://towardsdatascience.com/pytorch-tabular-regression-428e9c9ac93 5/13
12/06/2020 Pytorch [Tabular] — Regression - Towards Data Science
# Train
sns.barplot(data =
pd.DataFrame.from_dict([get_class_distribution(y_train)]).melt(), x
= "variable", y="value", hue="variable",
ax=axes[0]).set_title('Class Distribution in Train Set')
# Val
sns.barplot(data =
pd.DataFrame.from_dict([get_class_distribution(y_val)]).melt(), x =
"variable", y="value", hue="variable", ax=axes[1]).set_title('Class
Distribution in Val Set')
# Test
sns.barplot(data =
pd.DataFrame.from_dict([get_class_distribution(y_test)]).melt(), x =
"variable", y="value", hue="variable", ax=axes[2]).set_title('Class
Distribution in Test Set')
Neural Network
Initialize Dataset
https://towardsdatascience.com/pytorch-tabular-regression-428e9c9ac93 6/13
12/06/2020 Pytorch [Tabular] — Regression - Towards Data Science
class RegressionDataset(Dataset):
train_dataset = RegressionDataset(torch.from_numpy(X_train).float(),
torch.from_numpy(y_train).float())
val_dataset = RegressionDataset(torch.from_numpy(X_val).float(),
torch.from_numpy(y_val).float())
test_dataset = RegressionDataset(torch.from_numpy(X_test).float(),
torch.from_numpy(y_test).float())
Model Params
EPOCHS = 150
BATCH_SIZE = 64
LEARNING_RATE = 0.001
NUM_FEATURES = len(X.columns)
Initialize Dataloader
train_loader = DataLoader(dataset=train_dataset,
batch_size=BATCH_SIZE, shuffle=True)
https://towardsdatascience.com/pytorch-tabular-regression-428e9c9ac93 7/13
12/06/2020 Pytorch [Tabular] — Regression - Towards Data Science
class MultipleRegression(nn.Module):
def __init__(self, num_features):
super(MultipleRegression, self).__init__()
self.relu = nn.ReLU()
return (x)
return (x)
print(device)
cuda:0
Initialize the model, optimizer, and loss function. Transfer the model to GPU.
model = MultipleRegression(NUM_FEATURES)
model.to(device)
print(model)
https://towardsdatascience.com/pytorch-tabular-regression-428e9c9ac93 8/13
12/06/2020 Pytorch [Tabular] — Regression - Towards Data Science
criterion = nn.MSELoss()
optimizer = optim.Adam(model.parameters(), lr=LEARNING_RATE)
MultipleRegression(
(layer_1): Linear(in_features=11, out_features=16, bias=True)
(layer_2): Linear(in_features=16, out_features=32, bias=True)
(layer_3): Linear(in_features=32, out_features=16, bias=True)
(layer_out): Linear(in_features=16, out_features=1, bias=True)
(relu): ReLU()
)
Train Model
Before we start our training, let’s define a dictionary which will store the loss/epoch
for both train and validation sets.
loss_stats = {
'train': [],
"val": []
}
print("Begin training.")
# TRAINING
train_epoch_loss = 0
model.train()
for X_train_batch, y_train_batch in train_loader:
X_train_batch, y_train_batch = X_train_batch.to(device),
y_train_batch.to(device)
optimizer.zero_grad()
y_train_pred = model(X_train_batch)
train_loss = criterion(y_train_pred,
y_train_batch.unsqueeze(1))
train_loss.backward()
optimizer.step()
train_epoch_loss += train_loss.item()
https://towardsdatascience.com/pytorch-tabular-regression-428e9c9ac93 9/13
12/06/2020 Pytorch [Tabular] — Regression - Towards Data Science
# VALIDATION
with torch.no_grad():
val_epoch_loss = 0
model.eval()
for X_val_batch, y_val_batch in val_loader:
X_val_batch, y_val_batch = X_val_batch.to(device),
y_val_batch.to(device)
y_val_pred = model(X_val_batch)
val_loss = criterion(y_val_pred,
y_val_batch.unsqueeze(1))
val_epoch_loss += val_loss.item()
loss_stats['train'].append(train_epoch_loss/len(train_loader))
loss_stats['val'].append(val_epoch_loss/len(val_loader))
.
.
.
You can see we’ve put a model.train() at the before the loop. model.train() tells
PyTorch that you’re in training mode.
Well, why do we need to do that? If you’re using layers such as Dropout or BatchNorm
which behave differently during training and evaluation (for example; not use dropout
during evaluation), you need to tell PyTorch to act accordingly.
Similarly, we’ll call model.eval() when we test our model. We’ll see that below.
https://towardsdatascience.com/pytorch-tabular-regression-428e9c9ac93 10/13
12/06/2020 Pytorch [Tabular] — Regression - Towards Data Science
Back to training; we start a for-loop. At the top of this for-loop, we initialize our loss per
epoch to 0. After every epoch, we’ll print out the loss and reset it back to 0.
Then we have another for-loop. This for-loop is used to get our data in batches from the
train_loader .
From our defined model, we then obtain a prediction, get the loss(and accuracy) for
that mini-batch, perform back-propagation using loss.backward() and
optimizer.step() .
Finally, we add all the mini-batch losses to obtain the average loss for that epoch. We
add up all the losses for each mini-batch and finally divide it by the number of mini-
batches ie. length of train_loader to obtain the average loss per epoch.
The procedure we follow for training is the exact same for validation except for the fact
that we wrap it up in torch.no_grad and not perform any back-propagation.
torch.no_grad() tells PyTorch that we do not want to perform back-propagation,
which reduces memory usage and speeds up computation.
train_val_loss_df =
pd.DataFrame.from_dict(loss_stats).reset_index().melt(id_vars=
['index']).rename(columns={"index":"epochs"})
plt.figure(figsize=(15,8))
https://towardsdatascience.com/pytorch-tabular-regression-428e9c9ac93 11/13
12/06/2020 Pytorch [Tabular] — Regression - Towards Data Science
Test Model
After training is done, we need to test how our model fared. Note that we’ve used
model.eval() before we run our testing code. To tell PyTorch that we do not want to
perform back-propagation during inference, we use torch.no_grad() , just like we did it
y_pred_list = []
with torch.no_grad():
model.eval()
for X_batch, _ in test_loader:
X_batch = X_batch.to(device)
y_test_pred = model(X_batch)
y_pred_list.append(y_test_pred.cpu().numpy())
. . .
Thank you for reading. Suggestions and constructive criticism are welcome. :)
This blogpost is a part of the column — ” How to train you Neural Net”. You can find
the column here.
You can find me on LinkedIn and Twitter. If you liked this, check out my other
blogposts.
https://towardsdatascience.com/pytorch-tabular-regression-428e9c9ac93 13/13