Lab 2.ipynb - Colab

2/5/25, 10:53 AM Lab 2.
ipynb - Colab
keyboard_arrow_down Lab Instructions

change year from continuous to discrete
Print out all the other evaluation metrics
Experiment to determine what model gives the best performance
keyboard_arrow_down 4.4 Lab 2: Price Prediction

In this lab, we'll keep using the 100,000 UK Used Car Dataset from Kaggle. It contains scraped data of used car listings split into CSV files
according to the manufacturer: Audi, BMW, Ford, Hyundai, Mercedes, Skoda, Toyota, Vauxhall, and VW. It also contains a few extra files of
particular models ( cclass.csv , focus.csv , unclean_cclass.csv , and unclean_focus.csv ) that we won't be using.
Each file has nine columns with the car's attributes: model, year, price, transmission, mileage, fuel type, road tax, fuel consumption (mpg), and
engine size. Transmission, fuel type, and year are discrete/categorical attributes, the others are continous. Our goal here is to predict the car's
price based on its other attributes.
We'll start by building a datapipe that reads all the information from the CSV files, and then we'll use this datapipe as a drop-in replacement
for the dataset we typically use with data loaders. The use of datapipes in its functional form, will illustrate some of the challenges when
dealing with real-world data.
To download the dataset, you'll need to create a Kaggle account. In the following sections, we're assuming the dataset was downloaded and
unzipped to a local folder named car_prices . Alternatively, you can download it from the following link:
https://github.com/dvgodoy/assets/raw/main/PyTorchInPractice/data/100KUsedCar/car_prices.zip
In Colab, you can run the following commands to download and unzip the dataset:
!wget https://github.com/dvgodoy/assets/raw/main/PyTorchInPractice/data/100KUsedCar/car_prices.zip
!unzip car_prices.zip -d car_prices
--2024-10-24 20:33:19-- https://github.com/dvgodoy/assets/raw/main/PyTorchInPractice/data/100KUsedCar/car_prices.zip

Resolving github.com (github.com)... 140.82.114.4
Connecting to github.com (github.com)|140.82.114.4|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/dvgodoy/assets/main/PyTorchInPractice/data/100KUsedCar/car_prices.zip [following]
--2024-10-24 20:33:20-- https://raw.githubusercontent.com/dvgodoy/assets/main/PyTorchInPractice/data/100KUsedCar/car_prices.zip
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1152744 (1.1M) [application/zip]
Saving to: ‘car_prices.zip’
car_prices.zip 100%[===================>] 1.10M --.-KB/s in 0.04s
2024-10-24 20:33:21 (25.5 MB/s) - ‘car_prices.zip’ saved [1152744/1152744]
Archive: car_prices.zip
inflating: car_prices/audi.csv
inflating: car_prices/bmw.csv
inflating: car_prices/cclass.csv
inflating: car_prices/focus.csv
inflating: car_prices/ford.csv
inflating: car_prices/hyundi.csv
inflating: car_prices/merc.csv
inflating: car_prices/skoda.csv
inflating: car_prices/toyota.csv
inflating: car_prices/unclean cclass.csv
inflating: car_prices/unclean focus.csv
inflating: car_prices/vauxhall.csv
inflating: car_prices/vw.csv
keyboard_arrow_down 4.4.1 Recap

Let's recap what we did in Chapter 3 to load our data into a datapipe, so we can use it to train a new model in PyTorch. You may run all the
cells in this section as they are.
First, we built the "dropdown" dictionaries that we used to preprocess the data:
!pip install torchdata==0.8.0
https://colab.research.google.com/drive/1auik7Ch4x6unyze1X9tEAIOnvgCuknNe?usp=sharing#scrollTo=bd8d017e 1/10
2/5/25, 10:53 AM Lab 2.ipynb - Colab
Collecting torchdata==0.8.0
Downloading torchdata-0.8.0-cp310-cp310-manylinux1_x86_64.whl.metadata (5.4 kB)
Requirement already satisfied: urllib3>=1.25 in /usr/local/lib/python3.10/dist-packages (from torchdata==0.8.0) (2.2.3)
Requirement already satisfied: requests in /usr/local/lib/python3.10/dist-packages (from torchdata==0.8.0) (2.32.3)
Requirement already satisfied: torch>=2 in /usr/local/lib/python3.10/dist-packages (from torchdata==0.8.0) (2.5.0+cu121)
Requirement already satisfied: filelock in /usr/local/lib/python3.10/dist-packages (from torch>=2->torchdata==0.8.0) (3.16.1)
Requirement already satisfied: typing-extensions>=4.8.0 in /usr/local/lib/python3.10/dist-packages (from torch>=2->torchdata==0.8.0
Requirement already satisfied: networkx in /usr/local/lib/python3.10/dist-packages (from torch>=2->torchdata==0.8.0) (3.4.2)
Requirement already satisfied: jinja2 in /usr/local/lib/python3.10/dist-packages (from torch>=2->torchdata==0.8.0) (3.1.4)
Requirement already satisfied: fsspec in /usr/local/lib/python3.10/dist-packages (from torch>=2->torchdata==0.8.0) (2024.6.1)
Requirement already satisfied: sympy==1.13.1 in /usr/local/lib/python3.10/dist-packages (from torch>=2->torchdata==0.8.0) (1.13.1)
Requirement already satisfied: mpmath<1.4,>=1.1.0 in /usr/local/lib/python3.10/dist-packages (from sympy==1.13.1->torch>=2->torchdat
Requirement already satisfied: charset-normalizer<4,>=2 in /usr/local/lib/python3.10/dist-packages (from requests->torchdata==0.8.0
Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.10/dist-packages (from requests->torchdata==0.8.0) (3.10)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.10/dist-packages (from requests->torchdata==0.8.0) (2024
Requirement already satisfied: MarkupSafe>=2.0 in /usr/local/lib/python3.10/dist-packages (from jinja2->torch>=2->torchdata==0.8.0)
Downloading torchdata-0.8.0-cp310-cp310-manylinux1_x86_64.whl (2.7 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 2.7/2.7 MB 32.5 MB/s eta 0:00:00
Installing collected packages: torchdata
Successfully installed torchdata-0.8.0
import os
import numpy as np
import pandas as pd
import torchdata.datapipes as dp
from torch.utils.data import DataLoader
def filter_for_data(filename):
return ("unclean" not in filename) and ("focus" not in filename) and ("cclass" not in filename) and filename.endswith(".csv")
def get_manufacturer(content):
path, data = content
manuf = os.path.splitext(os.path.basename(path))[0].upper()
data.extend([manuf])
return data
def gen_encoder_dict(series):
values = series.unique()
return dict(zip(values, range(len(values))))
tmp_dp = dp.iter.FileLister('./car_prices')
tmp_dp = tmp_dp.filter(filter_fn=filter_for_data)
tmp_dp = tmp_dp.open_files(mode='rt')
tmp_dp = tmp_dp.parse_csv(delimiter=",", skip_lines=1, return_path=True)
tmp_dp = tmp_dp.map(get_manufacturer)
colnames = ['model', 'year', 'price', 'transmission', 'mileage', 'fuel_type', 'road_tax', 'mpg', 'engine_size', 'manufacturer']
df = pd.DataFrame(list(tmp_dp), columns=colnames)
N_ROWS = len(df)
cont_attr = ['year', 'mileage', 'road_tax', 'mpg', 'engine_size']

cat_attr = ['model', 'transmission', 'fuel_type', 'manufacturer']
dropdown_encoders = {col: gen_encoder_dict(df[col]) for col in cat_attr}
def preproc(row):
colnames = ['model', 'year', 'price', 'transmission', 'mileage', 'fuel_type', 'road_tax', 'mpg', 'engine_size', 'manufacturer']
cat_attr = ['model', 'transmission', 'fuel_type', 'manufacturer']

cont_attr = ['year', 'mileage', 'road_tax', 'mpg', 'engine_size']
target = 'price'
vals = dict(zip(colnames, row))

cont_X = [float(vals[name]) for name in cont_attr]
cat_X = [dropdown_encoders[name][vals[name]] for name in cat_attr]
return {'label': np.array([float(vals[target])], dtype=np.float32),

'cont_X': np.array(cont_X, dtype=np.float32),
'cat_X': np.array(cat_X, dtype=int)}
/usr/local/lib/python3.10/dist-packages/torchdata/datapipes/__init__.py:18: UserWarning:
################################################################################
WARNING!
The 'datapipes', 'dataloader2' modules are deprecated and will be removed in a
future torchdata release! Please see https://github.com/pytorch/data/issues/1196
to learn more and leave feedback.
################################################################################
deprecation_warning()
Next, we used the preprocessing function to build our main datapipe:
datapipe = dp.iter.FileLister('./car_prices')
datapipe = datapipe.filter(filter_fn=filter_for_data)
datapipe = datapipe.open_files(mode='rt')
datapipe = datapipe.parse_csv(delimiter=",", skip_lines=1, return_path=True)
datapipe = datapipe.map(get_manufacturer)
datapipe = datapipe.map(preproc)
datapipes = {}
datapipes['train'] = datapipe.random_split(total_length=N_ROWS, weights={"train": 0.8, "val": 0.1, "test": 0.1}, seed=11, target='train
datapipes['val'] = datapipe.random_split(total_length=N_ROWS, weights={"train": 0.8, "val": 0.1, "test": 0.1}, seed=11, target='val')
datapipes['test'] = datapipe.random_split(total_length=N_ROWS, weights={"train": 0.8, "val": 0.1, "test": 0.1}, seed=11, target='test')
datapipes['train'] = datapipes['train'].shuffle(buffer_size=100000)
Once the datapipes are ready, we created data loaders so we can load mini-batches of data, one at a time:
dataloaders = {}
dataloaders['train'] = DataLoader(dataset=datapipes['train'], batch_size=128, drop_last=True, shuffle=True)
dataloaders['val'] = DataLoader(dataset=datapipes['val'], batch_size=128)
dataloaders['test'] = DataLoader(dataset=datapipes['test'], batch_size=128)
keyboard_arrow_down 4.4.3 Custom Model

You know the drill: write a custom model class that implements both __init__() and forward() methods. You can use the model you wrote
in Lab 1 as a starting point.
In the constructor method, you will define the parts that make up your model, like linear layers and embeddings, as class attributes. Don't
forget to include a call to super().__init__() at the top of the method so it executes the code from the parent class before your own. In our
case, the model will receive the following arguments:
n_cont : the number of continuous attributes

cat_list : a list of lists of unique values of categorical attributes (as returned by the categories_ property of the OrdinalEncoder )
emb_dim : the number of dimensions of each embedding (we're keeping them the same for every categorical attribute for simplicity)
The forward() method is where the magic happens, as you know. It receives an input x , which can be anything (e.g. a tensor, a tuple, a
dictionary), and forwards this input through your model's components, such as layers, activation functions, and embeddings. In the end, it
should return a prediction.
Don't forget your data loader is returning dictionaries now, you'll need to make adjustments to how your model treats its inputs. Also, don't
forget to add a batch normalization layer to preprocess the continuous attributes and, optionally, you can also add batch normalization layers
after each hidden linear layer. Please refer to the diagram below for the implementation.
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
class FFN(nn.Module):
def __init__(self, n_cont, cat_list, emb_dim):
super().__init__()
# Embedding layers
embedding_layers = []
# Creates one embedding layer for each categorical feature
# write your code here

for categories in cat_list:
embedding_layers.append(nn.Embedding(len(categories), emb_dim))
self.emb_layers = nn.ModuleList(embedding_layers)
# Total number of embedding dimensions

self.n_emb = len(cat_list) * emb_dim
self.n_cont = n_cont
# Batch Normalization layer for continuous features
self.bn_input = nn.BatchNorm1d(n_cont)
# Linear Layer(s)
lin_layers = []
# The input layers takes as many inputs as the number of continuous features
# plus the total number of concatenated embeddings
# The number of outputs is your own choice
# Optionally, add more hidden layers, don't forget to match the dimensions if you do
lin_layers.append(nn.Linear(self.n_emb + self.n_cont, 100))
self.lin_layers = nn.ModuleList(lin_layers)
# Batch Normalization Layer(s)

bn_layers = []
# Creates batch normalization layers for each linear hidden layer

bn_layers.append(nn.BatchNorm1d(100))
self.bn_layers = nn.ModuleList(bn_layers)
# The output layer must have as many inputs as there were outputs in the last hidden layer
self.output_layer = nn.Linear(self.lin_layers[-1].out_features, 1)
# Layer initialization
for lin_layer in self.lin_layers:
nn.init.kaiming_normal_(lin_layer.weight.data, nonlinearity='relu')
nn.init.kaiming_normal_(self.output_layer.weight.data, nonlinearity='relu')
def forward(self, inputs):

# The inputs are the features as returned in the first element of a tuple
# coming from the dataset/dataloader
# Make sure you split it into continuous and categorical attributes according
# to your dataset implementation of __getitem__
cont_data, cat_data = inputs['cont_X'], inputs['cat_X']
# Retrieve embeddings for each categorical attribute and concatenate them

embeddings = []
for i, layer in enumerate(self.emb_layers):
embeddings.append(layer(cat_data[:, i]))
embeddings = torch.cat(embeddings, 1)
# Normalizes continuous features using Batch Normalization layer

normalized_cont_data = self.bn_input(cont_data)
# Concatenate all features together, normalized continuous and embeddings

x = torch.cat([normalized_cont_data, embeddings], 1)
# Run the inputs through each layer and applies an activation function and batch norm to each output
for layer, bn_layer in zip(self.lin_layers, self.bn_layers):
x = layer(x)
x = F.relu(x)
x = bn_layer(x)
# Run the output of the last linear layer through the output layer
x = self.output_layer(x)
# Return the prediction

return x
keyboard_arrow_down 4.4.4 Training

Now it is time to write your own training loop once again. First, you need to instantiate your model.
Just run the cell below as is to populate a few variables and visualize the outputs:
n_cont = len(cont_attr)
cat_list = [np.array(list(dropdown_encoders[name].values())) for name in cat_attr]
n_cont, cat_list
(5,
[array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12,
13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25,
26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38,
39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51,
52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64,
65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77,
78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90,
91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103,
104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116,
117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129,
130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142,
143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155,
156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168,
169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181,
182, 183, 184, 185, 186, 187, 188, 189, 190, 191, 192, 193, 194]),
array([0, 1, 2, 3]),
array([0, 1, 2, 3, 4]),
array([0, 1, 2, 3, 4, 5, 6, 7, 8])])
The n_cont variable contains the number of continuous attributes you're using. The cat_list variable contains a list of lists, each inner list
containing the unique values corresponding to one of the categorical attributes ("dropdowns").
Both variables, together with the number of embedding dimensions you chose ( emb_dim ), should be used as arguments to create an instance
of your custom model class ( FFN ):
torch.manual_seed(42)

emb_dim = 5
model = FFN(n_cont, cat_list, emb_dim=emb_dim)
Now, create the appropriate loss function for the task:

loss_fn = nn.MSELoss()
Then, create an optimizer that will update your model's parameters:
# Suggested learning rate

lr = 3e-3

optimizer = optim.Adam(model.parameters(), lr=lr)
Next, you will write the training loop using the data loaders to iterate through your training and validation data (these loops are written for you
already).
The training loop itself is pretty much the same as in the previous lab, but don't forget your data loaders return dictionaries now, so you'll need
to adjust they way your data is being sent to the appropriate device.
from tqdm import tqdm
device = 'cuda' if torch.cuda.is_available() else 'cpu'
n_epochs = 20
losses = torch.empty(n_epochs)
val_losses = torch.empty(n_epochs)
best_loss = torch.inf
best_epoch = -1
patience = 3
model.to(device)
progress_bar = tqdm(range(n_epochs))
for epoch in progress_bar:

batch_losses = []
## Training
for i, batch in enumerate(dataloaders['train']):
# Set the model to training mode
model.train()
# Send batch features and targets to the device

batch['cont_X'] = batch['cont_X'].to(device)
batch['cat_X'] = batch['cat_X'].to(device)
batch['label'] = batch['label'].to(device)
# Step 1 - forward pass

predictions = model(batch)
# Step 2 - computing the loss

loss = loss_fn(predictions, batch['label'])
# Step 3 - computing the gradients

# Tip: it requires a single method call to backpropagate gradients
loss.backward()
batch_losses.append(loss.item())
# Step 4 - updating parameters and zeroing gradients

# Tip: it takes two calls to optimizer's methods
optimizer.step()
optimizer.zero_grad()
losses[epoch] = torch.tensor(batch_losses).mean()
## Validation
with torch.inference_mode():
batch_losses = []
for i, val_batch in enumerate(dataloaders['val']):

# Set the model to evaluation mode
model.eval()
# Send batch features and targets to the device

val_batch['cont_X'] = val_batch['cont_X'].to(device)
val_batch['cat_X'] = val_batch['cat_X'].to(device)
val_batch['label'] = val_batch['label'].to(device)
# Step 1 - forward pass

predictions = model(val_batch)
# Step 2 - computing the loss

loss = loss_fn(predictions, val_batch['label'])
batch_losses.append(loss.item())
val_losses[epoch] = torch.tensor(batch_losses).mean()
if val_losses[epoch] < best_loss:

best_loss = val_losses[epoch]
best_epoch = epoch
torch.save({'model': model.state_dict(),
'optimizer': optimizer.state_dict()},
'best_model.pth')
elif (epoch - best_epoch) > patience:
print(f"Early stopping at epoch #{epoch}")
break
100%|██████████| 20/20 [03:24<00:00, 10.25s/it]
Let's check the evolution of the losses. Run the cell below as is to plot your losses:
import matplotlib.pyplot as plt
plt.plot(losses[:epoch], label='Training')
plt.plot(val_losses[:epoch], label='Validation')
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.yscale('log')
plt.legend()
<matplotlib.legend.Legend at 0x7ea7fc680970>
Then, let's compare predicted and actual values in the validation set.
Run the cell below as is to visualize a scatterplot comparing predicted and actual values of fuel consumption. A perfect prediction
corresponds to the dashed diagonal line.
split = 'val'
y_hat = []
y_true = []
for batch in dataloaders[split]:
model.eval()
batch['cont_X'] = batch['cont_X'].to(device)
batch['cat_X'] = batch['cat_X'].to(device)
batch['label'] = batch['label'].to(device)
y_hat.extend(model(batch).tolist())
y_true.extend(batch['label'].tolist())
fig, ax = plt.subplots(1, 1, figsize=(5, 5))

ax.scatter(y_true, y_hat, alpha=0.25)
ax.plot([0, 80000], [0, 80000], linestyle='--', c='k', linewidth=1)
ax.set_xlabel('Actual')
ax.set_xlim([0, 80000])
ax.set_ylabel('Predicted')
ax.set_ylim([0, 80000])
ax.set_title('Price')
Text(0.5, 1.0, 'Price')
Ideally, you'll see a cloud of points around the diagonal line. What about the R2 score?
from sklearn.metrics import r2_score

r2_score(y_true, y_hat)
0.9350898010156775
If your cloud of points were indeed around the diagonal line, you're probably expecting a high R2 score (>0.8). If you got a surprisingly low
value for it, can you guess why?
TIP: Try removing the set_ylim() range and look for extreme or negative values. If, for some reason, your model is producing extreme
predictions (even if there's only a few of them), it may impact negatively the R2 score.

Lab 2.ipynb - Colab

Uploaded by

Lab 2.ipynb - Colab

Uploaded by

2/5/25, 10:53 AM Lab 2.

keyboard_arrow_down Lab Instructions

Print out all the other evaluation metrics

Experiment to determine what model gives the best performance

keyboard_arrow_down 4.4 Lab 2: Price Prediction

--2024-10-24 20:33:19-- https://github.com/dvgodoy/assets/raw/main/PyTorchInPractice/data/100KUsedCar/car_prices.zip

car_prices.zip 100%[===================>] 1.10M --.-KB/s in 0.04s

2024-10-24 20:33:21 (25.5 MB/s) - ‘car_prices.zip’ saved [1152744/1152744]

keyboard_arrow_down 4.4.1 Recap

!pip install torchdata==0.8.0

cont_attr = ['year', 'mileage', 'road_tax', 'mpg', 'engine_size']

dropdown_encoders = {col: gen_encoder_dict(df[col]) for col in cat_attr}

cat_attr = ['model', 'transmission', 'fuel_type', 'manufacturer']

vals = dict(zip(colnames, row))

return {'label': np.array([float(vals[target])], dtype=np.float32),

Next, we used the preprocessing function to build our main datapipe:

keyboard_arrow_down 4.4.3 Custom Model

n_cont : the number of continuous attributes

# write your code here

# Total number of embedding dimensions

# Batch Normalization Layer(s)

# write your code here

def forward(self, inputs):

# Retrieve embeddings for each categorical attribute and concatenate them

# Normalizes continuous features using Batch Normalization layer

# Concatenate all features together, normalized continuous and embeddings

# Return the prediction

keyboard_arrow_down 4.4.4 Training

# write your code here

Now, create the appropriate loss function for the task:

# write your code here

Then, create an optimizer that will update your model's parameters:

# Suggested learning rate

# write your code here

from tqdm import tqdm

device = 'cuda' if torch.cuda.is_available() else 'cpu'

for epoch in progress_bar:

# Send batch features and targets to the device

# Step 1 - forward pass

# Step 2 - computing the loss

# Step 3 - computing the gradients

# Step 4 - updating parameters and zeroing gradients

for i, val_batch in enumerate(dataloaders['val']):

# Send batch features and targets to the device

# Step 1 - forward pass

# Step 2 - computing the loss

if val_losses[epoch] < best_loss:

100%|██████████| 20/20 [03:24<00:00, 10.25s/it]

fig, ax = plt.subplots(1, 1, figsize=(5, 5))

Text(0.5, 1.0, 'Price')

from sklearn.metrics import r2_score

You might also like