0% found this document useful (0 votes)
3 views9 pages

Computer Vision Learning

The document outlines the development of a deep learning fusion model by CV Group 68, which integrates CT and MRI data using cross-modal attention mechanisms for enhanced multimodal representation learning. It includes sections on data preprocessing, model development, training pipeline, evaluation metrics, and comparisons of single-modality models against the fusion model. The results show that the CT model outperforms the fusion model in accuracy, indicating a need for further tuning of the fusion approach.

Uploaded by

vipula
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
Download as pdf or txt
0% found this document useful (0 votes)
3 views9 pages

Computer Vision Learning

The document outlines the development of a deep learning fusion model by CV Group 68, which integrates CT and MRI data using cross-modal attention mechanisms for enhanced multimodal representation learning. It includes sections on data preprocessing, model development, training pipeline, evaluation metrics, and comparisons of single-modality models against the fusion model. The results show that the CT model outperforms the fusion model in accuracy, indicating a need for further tuning of the fusion approach.

Uploaded by

vipula
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
Download as pdf or txt
Download as pdf or txt
You are on page 1/ 9

CV_assignment_2_group_68.ipynb - Colab https://colab.research.google.com/drive/1U9JpH0WZytzRLmf5XhcqN...

CV Group 68

Group Members:

1. DEVBRAT KUMAR 2022ac05567@wilp.bits-pilani.ac.in


2. DHOLAKIA YATHARTH MANISH 2023aa05496@wilp.bits-pilani.ac.in
3. GUJJARI MANIDEEP 2023aa05543@wilp.bits-pilani.ac.in
4. VIPULA VIJAY KUMAR 2023aa05691@wilp.bits-pilani.ac.in

 Problem Statement 1
To develop a deep learning fusion model that integrates both CT and MRI data using cross-
modal attention mechanisms to perform feature-level fusion, enhancing multimodal
representation learning.

Double-click (or enter) to edit

from google.colab import drive


drive.mount('/content/drive')

Mounted at /content/drive

from tqdm import tqdm # For progress bar


import pandas as pd # For creating tables
import torch
import torch.nn as nn
import torch.optim as optim
import torchvision.transforms as transforms
import torchvision.models as models
from torch.utils.data import DataLoader, Dataset
from PIL import Image
import numpy as np
import os
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

 1. Data Preprocessing
A pair of CT and MRI images at a given index is loaded, undergoes a series of transformations
including resizing, tensor conversion, and normalization and is returned along with a simulated

1 of 9 11-03-2025, 16:01
CV_assignment_2_group_68.ipynb - Colab https://colab.research.google.com/drive/1U9JpH0WZytzRLmf5XhcqN...

label. This section handles data loading from speci�ed directories, initializes a CTMRIDataset
object, and sets up a DataLoader for e�cient batch processing.

# ===========================
# 1. Data Preprocessing
# ===========================
class CTMRIDataset(Dataset):
def __init__(self, ct_dir, mri_dir, transform=None):
self.ct_images = sorted(os.listdir(ct_dir))
self.mri_images = sorted(os.listdir(mri_dir))
self.ct_dir = ct_dir
self.mri_dir = mri_dir
self.transform = transform

def __len__(self):
return min(len(self.ct_images), len(self.mri_images))

def __getitem__(self, idx):


ct_path = os.path.join(self.ct_dir, self.ct_images[idx])
mri_path = os.path.join(self.mri_dir, self.mri_images[idx])

ct_image = Image.open(ct_path).convert("RGB")
mri_image = Image.open(mri_path).convert("RGB")

if self.transform:
ct_image = self.transform(ct_image)
mri_image = self.transform(mri_image)

label = torch.randint(0, 2, (1,)).item() # Simulated label


return ct_image, mri_image, label

transform = transforms.Compose([
transforms.Resize((224, 224)),
transforms.ToTensor(),
transforms.Normalize(mean=[0.5], std=[0.5])
])

# Load Data
ct_dir = "/content/drive/MyDrive/Colab Notebooks/3rd Sem/CV/ct_dir"
mri_dir = "/content/drive/MyDrive/Colab Notebooks/3rd Sem/CV/mri_dir"
dataset = CTMRIDataset(ct_dir, mri_dir, transform)
data_loader = DataLoader(dataset, batch_size=1, shuffle=True)

 2. Model Development
The cross-modal attention mechanism is implemented to fuse information from both CT and
MRI features by calculating attention weights. The FusionViT class de�nes the main deep

2 of 9 11-03-2025, 16:01
CV_assignment_2_group_68.ipynb - Colab https://colab.research.google.com/drive/1U9JpH0WZytzRLmf5XhcqN...

MRI features by calculating attention weights. The FusionViT class de�nes the main deep
learning model for multimodal fusion. It utilizes two pre-trained Vision Transformers (ViTs) to
extract features from CT and MRI images separately. The extracted features are then fused
using the CrossModalAttention module, which enhances complementary information across
modalities. Finally, a fully connected layer is used for classi�cation.

# ===========================
# 2. Model Development
# ===========================
class CrossModalAttention(nn.Module):
def __init__(self, embed_dim):
super(CrossModalAttention, self).__init__()
self.query = nn.Linear(embed_dim, embed_dim)
self.key = nn.Linear(embed_dim, embed_dim)
self.value = nn.Linear(embed_dim, embed_dim)
self.softmax = nn.Softmax(dim=-1)

def forward(self, ct_features, mri_features):


Q = self.query(ct_features)
K = self.key(mri_features)
V = self.value(mri_features)
attention_weights = self.softmax(torch.matmul(Q, K.transpose(-2, -1)))
return torch.matmul(attention_weights, V)

class FusionViT(nn.Module):
def __init__(self, embed_dim=768, num_classes=2):
super(FusionViT, self).__init__()
self.vit_ct = models.vit_b_16(weights=models.ViT_B_16_Weights.DEFAULT)
self.vit_mri = models.vit_b_16(weights=models.ViT_B_16_Weights.DEFAULT)
self.vit_ct.heads = nn.Identity()
self.vit_mri.heads = nn.Identity()
self.attention = CrossModalAttention(embed_dim)
self.fc = nn.Linear(embed_dim, num_classes)

def forward(self, ct_image, mri_image):


ct_features = self.vit_ct(ct_image).unsqueeze(1)[:, 0, :]
mri_features = self.vit_mri(mri_image).unsqueeze(1)[:, 0, :]
fused_features = self.attention(ct_features, mri_features)
return self.fc(fused_features)

 3. Training Pipeline
The train_function handles the training process by iterating over the dataset for a speci�ed
number of epochs. It takes the model, data loader, loss function, and optimizer as inputs. During
each iteration, the function processes the data, computes the loss, and updates the model’s
parameters to improve performance.

3 of 9 11-03-2025, 16:01
CV_assignment_2_group_68.ipynb - Colab https://colab.research.google.com/drive/1U9JpH0WZytzRLmf5XhcqN...

# ===========================
# 3. Training Pipeline
# ===========================

def train_function(model, data_loader, criterion, optimizer, epochs=10, is_fusion=True):


model.train()
for epoch in range(epochs):
total_loss = 0
for ct_image, mri_image, labels in data_loader:
ct_image, mri_image, labels = ct_image.to(device), mri_image.to(device), labels.to(de
optimizer.zero_grad()

# Conditional forward pass based on model type


if is_fusion:
outputs = model(ct_image, mri_image)
else:
outputs = model(ct_image) # Pass only ct_image for single-modality model

loss = criterion(outputs, labels)


loss.backward()
optimizer.step()
total_loss += loss.item()
print(f"Epoch [{epoch+1}/{epochs}], Loss: {total_loss/len(data_loader):.4f}")

 4. Evaluation Metrics
The evaluate function calcuates the performance of the trained model taking the model and
data loader as input. It evaluates and prints the accuracy, precision, recall, and F1-score.

# ===========================
# 4. Evaluation Metrics
# ===========================

def evaluate(model, data_loader, is_fusion=True):


model.eval()
all_preds, all_labels = [], []
with torch.no_grad():
for ct_image, mri_image, labels in data_loader:
ct_image, mri_image = ct_image.to(device), mri_image.to(device)

# Conditional forward pass based on model type


if is_fusion:
outputs = model(ct_image, mri_image)
else:
outputs = model(ct_image) # Pass only ct_image for single-modality model

predicted = torch.argmax(outputs, dim=1).cpu().numpy()

4 of 9 11-03-2025, 16:01
CV_assignment_2_group_68.ipynb - Colab https://colab.research.google.com/drive/1U9JpH0WZytzRLmf5XhcqN...

predicted = torch.argmax(outputs, dim=1).cpu().numpy()


all_preds.extend(predicted)
all_labels.extend(labels.numpy())
acc = accuracy_score(all_labels, all_preds)
prec = precision_score(all_labels, all_preds, average='weighted', zero_division=1)
rec = recall_score(all_labels, all_preds, average='weighted', zero_division=1)
f1 = f1_score(all_labels, all_preds, average='weighted', zero_division=1)
print(f"Accuracy: {acc:.2f}, Precision: {prec:.2f}, Recall: {rec:.2f}, F1-score: {f1:.2f
return acc, prec, rec, f1

 5. Single-Modality Models and Comparison


In this section, instances of the CTViT and MRIViT classes are created and trained using the
train_function. Additionally, an instance of the FusionViT class is initialized and trained using the
same function. After training, the fusion model, CT model, and MRI model are evaluated using
the evaluate function, and their performance is compared.

# ===========================
# 5. Single-Modality Models and Comparison
# ===========================

class CTViT(nn.Module):
def __init__(self, embed_dim=768, num_classes=2):
super(CTViT, self).__init__()
self.vit = models.vit_b_16(weights=models.ViT_B_16_Weights.DEFAULT)
self.vit.heads = nn.Identity()
self.fc = nn.Linear(embed_dim, num_classes)
def forward(self, ct_image):
return self.fc(self.vit(ct_image).unsqueeze(1)[:, 0, :])

class MRIViT(nn.Module):
def __init__(self, embed_dim=768, num_classes=2):
super(MRIViT, self).__init__()
self.vit = models.vit_b_16(weights=models.ViT_B_16_Weights.DEFAULT)
self.vit.heads = nn.Identity()
self.fc = nn.Linear(embed_dim, num_classes)
def forward(self, mri_image):
return self.fc(self.vit(mri_image).unsqueeze(1)[:, 0, :])

# Instantiate and Train Single-Modality Models


ct_model, mri_model = CTViT().to(device), MRIViT().to(device)
ct_optimizer, mri_optimizer = optim.Adam(ct_model.parameters(), lr=0.0001), optim.Adam(mri_model.

# Define the loss function (criterion)


criterion = nn.CrossEntropyLoss()

5 of 9 11-03-2025, 16:01
CV_assignment_2_group_68.ipynb - Colab https://colab.research.google.com/drive/1U9JpH0WZytzRLmf5XhcqN...

train_function(ct_model, data_loader, criterion, ct_optimizer, epochs=10, is_fusion=False)


train_function(mri_model, data_loader, criterion, mri_optimizer, epochs=10, is_fusion=False

# Instantiate the Fusion Model


model = FusionViT().to(device) # Create an instance of the FusionViT class
fusion_optimizer = optim.Adam(model.parameters(), lr=0.0001) # Define optimizer for the fusion mo
train_function(model, data_loader, criterion, fusion_optimizer, epochs=10, is_fusion=True)

# Evaluate All Models


print("\nEvaluation Results:")
print("\nFusion Model:")
fusion_results = evaluate(model, data_loader, is_fusion=True)
print("\nCT Model:")
ct_results = evaluate(ct_model, data_loader, is_fusion=False)
print("\nMRI Model:")
mri_results = evaluate(mri_model, data_loader, is_fusion=False)

print("\nPerformance Comparison:")
print(f"Fusion: {fusion_results}, CT: {ct_results}, MRI: {mri_results}")

Downloading: "https://download.pytorch.org/models/vit_b_16-c867db91.pth" to /root/.cache/tor


100%|██████████| 330M/330M [00:02<00:00, 118MB/s]
Epoch [1/10], Loss: 0.7419
Epoch [2/10], Loss: 0.7252
Epoch [3/10], Loss: 0.7137
Epoch [4/10], Loss: 0.7048
Epoch [5/10], Loss: 0.7092
Epoch [6/10], Loss: 0.7014
Epoch [7/10], Loss: 0.7068
Epoch [8/10], Loss: 0.7015
Epoch [9/10], Loss: 0.7018
Epoch [10/10], Loss: 0.7013
Epoch [1/10], Loss: 0.7692
Epoch [2/10], Loss: 0.7066
Epoch [3/10], Loss: 0.7100
Epoch [4/10], Loss: 0.7036
Epoch [5/10], Loss: 0.7008
Epoch [6/10], Loss: 0.6969
Epoch [7/10], Loss: 0.6957
Epoch [8/10], Loss: 0.6950
Epoch [9/10], Loss: 0.6949
Epoch [10/10], Loss: 0.6957
Epoch [1/10], Loss: 0.7346
Epoch [2/10], Loss: 0.7078
Epoch [3/10], Loss: 0.7074
Epoch [4/10], Loss: 0.6979
Epoch [5/10], Loss: 0.6987
Epoch [6/10], Loss: 0.6995
Epoch [7/10], Loss: 0.6947
Epoch [8/10], Loss: 0.6986
Epoch [9/10], Loss: 0.6974
Epoch [10/10], Loss: 0.6955

6 of 9 11-03-2025, 16:01
CV_assignment_2_group_68.ipynb - Colab https://colab.research.google.com/drive/1U9JpH0WZytzRLmf5XhcqN...

Evaluation Results:

Fusion Model:
Accuracy: 0.51, Precision: 0.75, Recall: 0.51, F1-score: 0.35

CT Model:
Accuracy: 0.54, Precision: 0.75, Recall: 0.54, F1-score: 0.38

MRI Model:
Accuracy: 0.49, Precision: 0.75, Recall: 0.49, F1-score: 0.32

Performance Comparison:
Fusion: (0.510752688172043, 0.7501156203029252, 0.510752688172043, 0.3453487927141928), CT:

Accuracy comparison

The results indicate that the fusion model does not outperform the single-modality CT model
and might require further tuning. Accuracy comparison is as below:

Fusion Model: 51.0%

CT Model: 54.2% (Highest)

MRI Model: 48.9% (Lowest)

The CT-only model achieves better accuracy than the fusion model, which suggests that the
fusion mechanism might not be effectively combining information from both modalities.

Precision Analysis
Fusion Model: 75.0%

CT Model: 75.2% (Highest)

MRI Model: 75.0%

The precision is nearly identical across all models, suggesting that when the model makes a
positive prediction, it is equally con�dent across all three models.

Recall (Sensitivity)

Fusion Model: 51.0%

CT Model: 54.2% (Highest)

MRI Model: 48.9% (Lowest)

The fusion model doesn't improve recall, which indicates it does not capture more relevant
instances than the single-modality CT model.

7 of 9 11-03-2025, 16:01
CV_assignment_2_group_68.ipynb - Colab https://colab.research.google.com/drive/1U9JpH0WZytzRLmf5XhcqN...

instances than the single-modality CT model.

F1-Score (Balance of Precision & Recall):

Fusion Model: 0.35

CT Model: 0.38 (Highest)

MRI Model: 0.32 (Lowest)

The CT model achieves the best F1-score, indicating that it has a better trade-off between
precision and recall compared to the fusion model.

Interpretation of Results

CT model outperforms fusion and MRI models in cccuracy and F1-score, suggesting that CT
data alone is more informative than MRI for the given classi�cation task.

Fusion model does not show signi�cant improvement. Despite the integration of both CT and
MRI data, the fusion model's accuracy is slightly lower than the CT-only model and F1-score also
underperforms compared to CT, indicating that the cross-modal attention mechanism is not
leveraging complementary features.

MRI Model performs poor. The MRI-only model achieves the lowest accuracy and F1-score. This
suggests that MRI images, in isolation, contain less discriminative information compared to CT
scans for this task.

Reasons for result's trend

a) Fusion model not Fully exploiting cross-modal features

The fusion mechanism is not effectively integrating features from CT and MRI. The cross-modal
attention layer is not optimally weighting important information from both modalities. Fine-
tuning of attention mechanisms and experimenting with feature concatenation can help to over
come this problem.

b) Data mismatch between CT and MRI

CT and MRI scans are not well-aligned and contains different levels of detail. Hence, the model
struggles to extract shared patterns. A possible solution is to improve data preprocessing, such
as image registration techniques to align CT and MRI scans better.

c) Over�tting or under�tting

The relatively high precision (0.75) but low recall (Fusion: 0.51, CT: 0.54, MRI: 0.49) suggests

8 of 9 11-03-2025, 16:01
CV_assignment_2_group_68.ipynb - Colab https://colab.research.google.com/drive/1U9JpH0WZytzRLmf5XhcqN...

The relatively high precision (0.75) but low recall (Fusion: 0.51, CT: 0.54, MRI: 0.49) suggests
that the models are con�dent in their predictions but fail to recall all relevant instances.
Increasing dataset size to improve generalization and application of regularization techniques
like dropout or L2 regularization can prevent over�tting.

d) Class imbalance

If one class dominates the dataset, models may favor the majority class, leading to high
precision but low recall. Trying data augmentation or class-weighted loss functions to balance
predictions can resolve this issue.

9 of 9 11-03-2025, 16:01

You might also like