Computer Vision Learning
Computer Vision Learning
CV Group 68
Group Members:
Problem Statement 1
To develop a deep learning fusion model that integrates both CT and MRI data using cross-
modal attention mechanisms to perform feature-level fusion, enhancing multimodal
representation learning.
Mounted at /content/drive
1. Data Preprocessing
A pair of CT and MRI images at a given index is loaded, undergoes a series of transformations
including resizing, tensor conversion, and normalization and is returned along with a simulated
1 of 9 11-03-2025, 16:01
CV_assignment_2_group_68.ipynb - Colab https://colab.research.google.com/drive/1U9JpH0WZytzRLmf5XhcqN...
label. This section handles data loading from speci�ed directories, initializes a CTMRIDataset
object, and sets up a DataLoader for e�cient batch processing.
# ===========================
# 1. Data Preprocessing
# ===========================
class CTMRIDataset(Dataset):
def __init__(self, ct_dir, mri_dir, transform=None):
self.ct_images = sorted(os.listdir(ct_dir))
self.mri_images = sorted(os.listdir(mri_dir))
self.ct_dir = ct_dir
self.mri_dir = mri_dir
self.transform = transform
def __len__(self):
return min(len(self.ct_images), len(self.mri_images))
ct_image = Image.open(ct_path).convert("RGB")
mri_image = Image.open(mri_path).convert("RGB")
if self.transform:
ct_image = self.transform(ct_image)
mri_image = self.transform(mri_image)
transform = transforms.Compose([
transforms.Resize((224, 224)),
transforms.ToTensor(),
transforms.Normalize(mean=[0.5], std=[0.5])
])
# Load Data
ct_dir = "/content/drive/MyDrive/Colab Notebooks/3rd Sem/CV/ct_dir"
mri_dir = "/content/drive/MyDrive/Colab Notebooks/3rd Sem/CV/mri_dir"
dataset = CTMRIDataset(ct_dir, mri_dir, transform)
data_loader = DataLoader(dataset, batch_size=1, shuffle=True)
2. Model Development
The cross-modal attention mechanism is implemented to fuse information from both CT and
MRI features by calculating attention weights. The FusionViT class de�nes the main deep
2 of 9 11-03-2025, 16:01
CV_assignment_2_group_68.ipynb - Colab https://colab.research.google.com/drive/1U9JpH0WZytzRLmf5XhcqN...
MRI features by calculating attention weights. The FusionViT class de�nes the main deep
learning model for multimodal fusion. It utilizes two pre-trained Vision Transformers (ViTs) to
extract features from CT and MRI images separately. The extracted features are then fused
using the CrossModalAttention module, which enhances complementary information across
modalities. Finally, a fully connected layer is used for classi�cation.
# ===========================
# 2. Model Development
# ===========================
class CrossModalAttention(nn.Module):
def __init__(self, embed_dim):
super(CrossModalAttention, self).__init__()
self.query = nn.Linear(embed_dim, embed_dim)
self.key = nn.Linear(embed_dim, embed_dim)
self.value = nn.Linear(embed_dim, embed_dim)
self.softmax = nn.Softmax(dim=-1)
class FusionViT(nn.Module):
def __init__(self, embed_dim=768, num_classes=2):
super(FusionViT, self).__init__()
self.vit_ct = models.vit_b_16(weights=models.ViT_B_16_Weights.DEFAULT)
self.vit_mri = models.vit_b_16(weights=models.ViT_B_16_Weights.DEFAULT)
self.vit_ct.heads = nn.Identity()
self.vit_mri.heads = nn.Identity()
self.attention = CrossModalAttention(embed_dim)
self.fc = nn.Linear(embed_dim, num_classes)
3. Training Pipeline
The train_function handles the training process by iterating over the dataset for a speci�ed
number of epochs. It takes the model, data loader, loss function, and optimizer as inputs. During
each iteration, the function processes the data, computes the loss, and updates the model’s
parameters to improve performance.
3 of 9 11-03-2025, 16:01
CV_assignment_2_group_68.ipynb - Colab https://colab.research.google.com/drive/1U9JpH0WZytzRLmf5XhcqN...
# ===========================
# 3. Training Pipeline
# ===========================
4. Evaluation Metrics
The evaluate function calcuates the performance of the trained model taking the model and
data loader as input. It evaluates and prints the accuracy, precision, recall, and F1-score.
# ===========================
# 4. Evaluation Metrics
# ===========================
4 of 9 11-03-2025, 16:01
CV_assignment_2_group_68.ipynb - Colab https://colab.research.google.com/drive/1U9JpH0WZytzRLmf5XhcqN...
# ===========================
# 5. Single-Modality Models and Comparison
# ===========================
class CTViT(nn.Module):
def __init__(self, embed_dim=768, num_classes=2):
super(CTViT, self).__init__()
self.vit = models.vit_b_16(weights=models.ViT_B_16_Weights.DEFAULT)
self.vit.heads = nn.Identity()
self.fc = nn.Linear(embed_dim, num_classes)
def forward(self, ct_image):
return self.fc(self.vit(ct_image).unsqueeze(1)[:, 0, :])
class MRIViT(nn.Module):
def __init__(self, embed_dim=768, num_classes=2):
super(MRIViT, self).__init__()
self.vit = models.vit_b_16(weights=models.ViT_B_16_Weights.DEFAULT)
self.vit.heads = nn.Identity()
self.fc = nn.Linear(embed_dim, num_classes)
def forward(self, mri_image):
return self.fc(self.vit(mri_image).unsqueeze(1)[:, 0, :])
5 of 9 11-03-2025, 16:01
CV_assignment_2_group_68.ipynb - Colab https://colab.research.google.com/drive/1U9JpH0WZytzRLmf5XhcqN...
print("\nPerformance Comparison:")
print(f"Fusion: {fusion_results}, CT: {ct_results}, MRI: {mri_results}")
6 of 9 11-03-2025, 16:01
CV_assignment_2_group_68.ipynb - Colab https://colab.research.google.com/drive/1U9JpH0WZytzRLmf5XhcqN...
Evaluation Results:
Fusion Model:
Accuracy: 0.51, Precision: 0.75, Recall: 0.51, F1-score: 0.35
CT Model:
Accuracy: 0.54, Precision: 0.75, Recall: 0.54, F1-score: 0.38
MRI Model:
Accuracy: 0.49, Precision: 0.75, Recall: 0.49, F1-score: 0.32
Performance Comparison:
Fusion: (0.510752688172043, 0.7501156203029252, 0.510752688172043, 0.3453487927141928), CT:
Accuracy comparison
The results indicate that the fusion model does not outperform the single-modality CT model
and might require further tuning. Accuracy comparison is as below:
The CT-only model achieves better accuracy than the fusion model, which suggests that the
fusion mechanism might not be effectively combining information from both modalities.
Precision Analysis
Fusion Model: 75.0%
The precision is nearly identical across all models, suggesting that when the model makes a
positive prediction, it is equally con�dent across all three models.
Recall (Sensitivity)
The fusion model doesn't improve recall, which indicates it does not capture more relevant
instances than the single-modality CT model.
7 of 9 11-03-2025, 16:01
CV_assignment_2_group_68.ipynb - Colab https://colab.research.google.com/drive/1U9JpH0WZytzRLmf5XhcqN...
The CT model achieves the best F1-score, indicating that it has a better trade-off between
precision and recall compared to the fusion model.
Interpretation of Results
CT model outperforms fusion and MRI models in cccuracy and F1-score, suggesting that CT
data alone is more informative than MRI for the given classi�cation task.
Fusion model does not show signi�cant improvement. Despite the integration of both CT and
MRI data, the fusion model's accuracy is slightly lower than the CT-only model and F1-score also
underperforms compared to CT, indicating that the cross-modal attention mechanism is not
leveraging complementary features.
MRI Model performs poor. The MRI-only model achieves the lowest accuracy and F1-score. This
suggests that MRI images, in isolation, contain less discriminative information compared to CT
scans for this task.
The fusion mechanism is not effectively integrating features from CT and MRI. The cross-modal
attention layer is not optimally weighting important information from both modalities. Fine-
tuning of attention mechanisms and experimenting with feature concatenation can help to over
come this problem.
CT and MRI scans are not well-aligned and contains different levels of detail. Hence, the model
struggles to extract shared patterns. A possible solution is to improve data preprocessing, such
as image registration techniques to align CT and MRI scans better.
c) Over�tting or under�tting
The relatively high precision (0.75) but low recall (Fusion: 0.51, CT: 0.54, MRI: 0.49) suggests
8 of 9 11-03-2025, 16:01
CV_assignment_2_group_68.ipynb - Colab https://colab.research.google.com/drive/1U9JpH0WZytzRLmf5XhcqN...
The relatively high precision (0.75) but low recall (Fusion: 0.51, CT: 0.54, MRI: 0.49) suggests
that the models are con�dent in their predictions but fail to recall all relevant instances.
Increasing dataset size to improve generalization and application of regularization techniques
like dropout or L2 regularization can prevent over�tting.
d) Class imbalance
If one class dominates the dataset, models may favor the majority class, leading to high
precision but low recall. Trying data augmentation or class-weighted loss functions to balance
predictions can resolve this issue.
9 of 9 11-03-2025, 16:01