Lab 2.ipynb - Colab
Lab 2.ipynb - Colab
ipynb - Colab
Each file has nine columns with the car's attributes: model, year, price, transmission, mileage, fuel type, road tax, fuel consumption (mpg), and
engine size. Transmission, fuel type, and year are discrete/categorical attributes, the others are continous. Our goal here is to predict the car's
price based on its other attributes.
We'll start by building a datapipe that reads all the information from the CSV files, and then we'll use this datapipe as a drop-in replacement
for the dataset we typically use with data loaders. The use of datapipes in its functional form, will illustrate some of the challenges when
dealing with real-world data.
To download the dataset, you'll need to create a Kaggle account. In the following sections, we're assuming the dataset was downloaded and
unzipped to a local folder named car_prices . Alternatively, you can download it from the following link:
https://github.com/dvgodoy/assets/raw/main/PyTorchInPractice/data/100KUsedCar/car_prices.zip
In Colab, you can run the following commands to download and unzip the dataset:
!wget https://github.com/dvgodoy/assets/raw/main/PyTorchInPractice/data/100KUsedCar/car_prices.zip
!unzip car_prices.zip -d car_prices
Archive: car_prices.zip
inflating: car_prices/audi.csv
inflating: car_prices/bmw.csv
inflating: car_prices/cclass.csv
inflating: car_prices/focus.csv
inflating: car_prices/ford.csv
inflating: car_prices/hyundi.csv
inflating: car_prices/merc.csv
inflating: car_prices/skoda.csv
inflating: car_prices/toyota.csv
inflating: car_prices/unclean cclass.csv
inflating: car_prices/unclean focus.csv
inflating: car_prices/vauxhall.csv
inflating: car_prices/vw.csv
First, we built the "dropdown" dictionaries that we used to preprocess the data:
https://colab.research.google.com/drive/1auik7Ch4x6unyze1X9tEAIOnvgCuknNe?usp=sharing#scrollTo=bd8d017e 1/10
2/5/25, 10:53 AM Lab 2.ipynb - Colab
Collecting torchdata==0.8.0
Downloading torchdata-0.8.0-cp310-cp310-manylinux1_x86_64.whl.metadata (5.4 kB)
Requirement already satisfied: urllib3>=1.25 in /usr/local/lib/python3.10/dist-packages (from torchdata==0.8.0) (2.2.3)
Requirement already satisfied: requests in /usr/local/lib/python3.10/dist-packages (from torchdata==0.8.0) (2.32.3)
Requirement already satisfied: torch>=2 in /usr/local/lib/python3.10/dist-packages (from torchdata==0.8.0) (2.5.0+cu121)
Requirement already satisfied: filelock in /usr/local/lib/python3.10/dist-packages (from torch>=2->torchdata==0.8.0) (3.16.1)
Requirement already satisfied: typing-extensions>=4.8.0 in /usr/local/lib/python3.10/dist-packages (from torch>=2->torchdata==0.8.0
Requirement already satisfied: networkx in /usr/local/lib/python3.10/dist-packages (from torch>=2->torchdata==0.8.0) (3.4.2)
Requirement already satisfied: jinja2 in /usr/local/lib/python3.10/dist-packages (from torch>=2->torchdata==0.8.0) (3.1.4)
Requirement already satisfied: fsspec in /usr/local/lib/python3.10/dist-packages (from torch>=2->torchdata==0.8.0) (2024.6.1)
Requirement already satisfied: sympy==1.13.1 in /usr/local/lib/python3.10/dist-packages (from torch>=2->torchdata==0.8.0) (1.13.1)
Requirement already satisfied: mpmath<1.4,>=1.1.0 in /usr/local/lib/python3.10/dist-packages (from sympy==1.13.1->torch>=2->torchdat
Requirement already satisfied: charset-normalizer<4,>=2 in /usr/local/lib/python3.10/dist-packages (from requests->torchdata==0.8.0
Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.10/dist-packages (from requests->torchdata==0.8.0) (3.10)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.10/dist-packages (from requests->torchdata==0.8.0) (2024
Requirement already satisfied: MarkupSafe>=2.0 in /usr/local/lib/python3.10/dist-packages (from jinja2->torch>=2->torchdata==0.8.0)
Downloading torchdata-0.8.0-cp310-cp310-manylinux1_x86_64.whl (2.7 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 2.7/2.7 MB 32.5 MB/s eta 0:00:00
Installing collected packages: torchdata
Successfully installed torchdata-0.8.0
import os
import numpy as np
import pandas as pd
import torchdata.datapipes as dp
from torch.utils.data import DataLoader
def filter_for_data(filename):
return ("unclean" not in filename) and ("focus" not in filename) and ("cclass" not in filename) and filename.endswith(".csv")
def get_manufacturer(content):
path, data = content
manuf = os.path.splitext(os.path.basename(path))[0].upper()
data.extend([manuf])
return data
def gen_encoder_dict(series):
values = series.unique()
return dict(zip(values, range(len(values))))
tmp_dp = dp.iter.FileLister('./car_prices')
tmp_dp = tmp_dp.filter(filter_fn=filter_for_data)
tmp_dp = tmp_dp.open_files(mode='rt')
tmp_dp = tmp_dp.parse_csv(delimiter=",", skip_lines=1, return_path=True)
tmp_dp = tmp_dp.map(get_manufacturer)
colnames = ['model', 'year', 'price', 'transmission', 'mileage', 'fuel_type', 'road_tax', 'mpg', 'engine_size', 'manufacturer']
df = pd.DataFrame(list(tmp_dp), columns=colnames)
N_ROWS = len(df)
def preproc(row):
colnames = ['model', 'year', 'price', 'transmission', 'mileage', 'fuel_type', 'road_tax', 'mpg', 'engine_size', 'manufacturer']
/usr/local/lib/python3.10/dist-packages/torchdata/datapipes/__init__.py:18: UserWarning:
################################################################################
WARNING!
The 'datapipes', 'dataloader2' modules are deprecated and will be removed in a
future torchdata release! Please see https://github.com/pytorch/data/issues/1196
to learn more and leave feedback.
################################################################################
deprecation_warning()
https://colab.research.google.com/drive/1auik7Ch4x6unyze1X9tEAIOnvgCuknNe?usp=sharing#scrollTo=bd8d017e 2/10
2/5/25, 10:53 AM Lab 2.ipynb - Colab
datapipe = dp.iter.FileLister('./car_prices')
datapipe = datapipe.filter(filter_fn=filter_for_data)
datapipe = datapipe.open_files(mode='rt')
datapipe = datapipe.parse_csv(delimiter=",", skip_lines=1, return_path=True)
datapipe = datapipe.map(get_manufacturer)
datapipe = datapipe.map(preproc)
datapipes = {}
datapipes['train'] = datapipe.random_split(total_length=N_ROWS, weights={"train": 0.8, "val": 0.1, "test": 0.1}, seed=11, target='train
datapipes['val'] = datapipe.random_split(total_length=N_ROWS, weights={"train": 0.8, "val": 0.1, "test": 0.1}, seed=11, target='val')
datapipes['test'] = datapipe.random_split(total_length=N_ROWS, weights={"train": 0.8, "val": 0.1, "test": 0.1}, seed=11, target='test')
datapipes['train'] = datapipes['train'].shuffle(buffer_size=100000)
Once the datapipes are ready, we created data loaders so we can load mini-batches of data, one at a time:
dataloaders = {}
dataloaders['train'] = DataLoader(dataset=datapipes['train'], batch_size=128, drop_last=True, shuffle=True)
dataloaders['val'] = DataLoader(dataset=datapipes['val'], batch_size=128)
dataloaders['test'] = DataLoader(dataset=datapipes['test'], batch_size=128)
In the constructor method, you will define the parts that make up your model, like linear layers and embeddings, as class attributes. Don't
forget to include a call to super().__init__() at the top of the method so it executes the code from the parent class before your own. In our
case, the model will receive the following arguments:
The forward() method is where the magic happens, as you know. It receives an input x , which can be anything (e.g. a tensor, a tuple, a
dictionary), and forwards this input through your model's components, such as layers, activation functions, and embeddings. In the end, it
should return a prediction.
Don't forget your data loader is returning dictionaries now, you'll need to make adjustments to how your model treats its inputs. Also, don't
forget to add a batch normalization layer to preprocess the continuous attributes and, optionally, you can also add batch normalization layers
after each hidden linear layer. Please refer to the diagram below for the implementation.
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
class FFN(nn.Module):
def __init__(self, n_cont, cat_list, emb_dim):
super().__init__()
# Embedding layers
embedding_layers = []
# Creates one embedding layer for each categorical feature
self.emb_layers = nn.ModuleList(embedding_layers)
https://colab.research.google.com/drive/1auik7Ch4x6unyze1X9tEAIOnvgCuknNe?usp=sharing#scrollTo=bd8d017e 3/10
2/5/25, 10:53 AM Lab 2.ipynb - Colab
# Linear Layer(s)
lin_layers = []
# The input layers takes as many inputs as the number of continuous features
# plus the total number of concatenated embeddings
# The number of outputs is your own choice
# Optionally, add more hidden layers, don't forget to match the dimensions if you do
# write your code here
lin_layers.append(nn.Linear(self.n_emb + self.n_cont, 100))
self.lin_layers = nn.ModuleList(lin_layers)
# The output layer must have as many inputs as there were outputs in the last hidden layer
# write your code here
self.output_layer = nn.Linear(self.lin_layers[-1].out_features, 1)
# Layer initialization
for lin_layer in self.lin_layers:
nn.init.kaiming_normal_(lin_layer.weight.data, nonlinearity='relu')
nn.init.kaiming_normal_(self.output_layer.weight.data, nonlinearity='relu')
# Run the inputs through each layer and applies an activation function and batch norm to each output
for layer, bn_layer in zip(self.lin_layers, self.bn_layers):
# write your code here
x = layer(x)
x = F.relu(x)
x = bn_layer(x)
# Run the output of the last linear layer through the output layer
# write your code here
x = self.output_layer(x)
Just run the cell below as is to populate a few variables and visualize the outputs:
n_cont = len(cont_attr)
cat_list = [np.array(list(dropdown_encoders[name].values())) for name in cat_attr]
n_cont, cat_list
(5,
[array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12,
13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25,
https://colab.research.google.com/drive/1auik7Ch4x6unyze1X9tEAIOnvgCuknNe?usp=sharing#scrollTo=bd8d017e 4/10
2/5/25, 10:53 AM Lab 2.ipynb - Colab
26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38,
39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51,
52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64,
65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77,
78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90,
91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103,
104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116,
117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129,
130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142,
143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155,
156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168,
169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181,
182, 183, 184, 185, 186, 187, 188, 189, 190, 191, 192, 193, 194]),
array([0, 1, 2, 3]),
array([0, 1, 2, 3, 4]),
array([0, 1, 2, 3, 4, 5, 6, 7, 8])])
The n_cont variable contains the number of continuous attributes you're using. The cat_list variable contains a list of lists, each inner list
containing the unique values corresponding to one of the categorical attributes ("dropdowns").
Both variables, together with the number of embedding dimensions you chose ( emb_dim ), should be used as arguments to create an instance
of your custom model class ( FFN ):
torch.manual_seed(42)
Next, you will write the training loop using the data loaders to iterate through your training and validation data (these loops are written for you
already).
The training loop itself is pretty much the same as in the previous lab, but don't forget your data loaders return dictionaries now, so you'll need
to adjust they way your data is being sent to the appropriate device.
n_epochs = 20
losses = torch.empty(n_epochs)
val_losses = torch.empty(n_epochs)
best_loss = torch.inf
best_epoch = -1
patience = 3
model.to(device)
progress_bar = tqdm(range(n_epochs))
## Training
for i, batch in enumerate(dataloaders['train']):
# Set the model to training mode
https://colab.research.google.com/drive/1auik7Ch4x6unyze1X9tEAIOnvgCuknNe?usp=sharing#scrollTo=bd8d017e 5/10
2/5/25, 10:53 AM Lab 2.ipynb - Colab
# write your code here
model.train()
batch_losses.append(loss.item())
losses[epoch] = torch.tensor(batch_losses).mean()
## Validation
with torch.inference_mode():
batch_losses = []
batch_losses.append(loss.item())
val_losses[epoch] = torch.tensor(batch_losses).mean()
Let's check the evolution of the losses. Run the cell below as is to plot your losses:
https://colab.research.google.com/drive/1auik7Ch4x6unyze1X9tEAIOnvgCuknNe?usp=sharing#scrollTo=bd8d017e 6/10
2/5/25, 10:53 AM Lab 2.ipynb - Colab
import matplotlib.pyplot as plt
plt.plot(losses[:epoch], label='Training')
plt.plot(val_losses[:epoch], label='Validation')
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.yscale('log')
plt.legend()
<matplotlib.legend.Legend at 0x7ea7fc680970>
Then, let's compare predicted and actual values in the validation set.
Run the cell below as is to visualize a scatterplot comparing predicted and actual values of fuel consumption. A perfect prediction
corresponds to the dashed diagonal line.
split = 'val'
y_hat = []
y_true = []
for batch in dataloaders[split]:
model.eval()
batch['cont_X'] = batch['cont_X'].to(device)
batch['cat_X'] = batch['cat_X'].to(device)
batch['label'] = batch['label'].to(device)
y_hat.extend(model(batch).tolist())
y_true.extend(batch['label'].tolist())
https://colab.research.google.com/drive/1auik7Ch4x6unyze1X9tEAIOnvgCuknNe?usp=sharing#scrollTo=bd8d017e 7/10
2/5/25, 10:53 AM Lab 2.ipynb - Colab
Ideally, you'll see a cloud of points around the diagonal line. What about the R2 score?
0.9350898010156775
If your cloud of points were indeed around the diagonal line, you're probably expecting a high R2 score (>0.8). If you got a surprisingly low
value for it, can you guess why?
TIP: Try removing the set_ylim() range and look for extreme or negative values. If, for some reason, your model is producing extreme
predictions (even if there's only a few of them), it may impact negatively the R2 score.
https://colab.research.google.com/drive/1auik7Ch4x6unyze1X9tEAIOnvgCuknNe?usp=sharing#scrollTo=bd8d017e 8/10
2/5/25, 10:53 AM Lab 2.ipynb - Colab
https://colab.research.google.com/drive/1auik7Ch4x6unyze1X9tEAIOnvgCuknNe?usp=sharing#scrollTo=bd8d017e 9/10
2/5/25, 10:53 AM Lab 2.ipynb - Colab
https://colab.research.google.com/drive/1auik7Ch4x6unyze1X9tEAIOnvgCuknNe?usp=sharing#scrollTo=bd8d017e 10/10