Assignment 10
NAME-: PRAKSHAL JAIN
ENROLMENT NUMBER – 21102157
ANS-: To compare the cost efficiency of upgrading from a dual-CPU system to a GPU
system for processing 1 million images, we need to consider both the initial hardware costs
and the ongoing operational costs over a period of 1 year.
Step 1: Identify Key Metrics
Cost: Dual-CPU system: $5000 per CPU × 2 = $10,000
GPU system: $30,000
Performance:
Dual-CPU system: 100 images per hour
GPU system: 500 images per hour
Operational Costs:
Let’s assume the following hypothetical operational costs:
Dual-CPU system: $200 per month
GPU system: $400 per month
Step 2: Calculate Time to Process 1 Million Images
Step 3: Calculate Total Operational Costs Over 1 Year
Dual-CPU System:
Operational Costs=12 months×200 USD/month=2,400 USD
GPU System:
Operational Costs=12 months×400 USD/month=4,800 USD
Step 4: Calculate Total Costs Over 1 Year
1. Dual-CPU System:
Total Cost=Initial Cost+Operational Costs=10,000 USD+2,400 USD=12,400 USD
2. GPU System:
Total Cost=Initial Cost+Operational Costs=30,000 USD+4,800 USD=34,800 USD
To calculate the scaling efficiency when using 4 GPUs compared to a single GPU, we follow
these steps:
1. Determine the Speedup: This is the ratio of the time taken with a single GPU to the
time taken with multiple GPUs.
2. Calculate the Scaling Efficiency: This is the speedup divided by the number of
GPUs.
To determine the energy efficiency (in images per watt-hour) for both the GPU-based server
and the CPU-based server, we will follow these steps:
1. Calculate the total energy consumption for each server.
2. Calculate the energy efficiency in terms of images per watt-hour for each server.
3. Compare the energy efficiencies.
GPU-based server:
o Power consumption: 400 watts
o Processing time: 2 hours
o Images processed: 250,000
CPU-based server:
o Power consumption: 250 watts
o Processing time: 10 hours
o Images processed: 250,000
Energy consumption (in watt-hours) is calculated by multiplying the power consumption (in
watts) by the time (in hours).
1. GPU-based server:
Energy Consumption=Power×Time=400 watts×2 hours=800 watt-
Energy Consumption=Power×Time=400 watts×2 hours=800 watt-hours
2. CPU-based server:
Energy Consumption=Power×Time=250 watts×10 hours=2500 watt-hours
Energy Consumption=Power×Time=250 watts×10 hours=2500 watt-hours
Energy efficiency is calculated by dividing the number of images processed by the total
energy consumption.
1. GPU-based server:
Energy Efficiency=Images Processed/Energy Consumption=250,000 images/800 watt
-hours=312.5 images per watt-hour.
2. CPU-based server:
Energy Efficiency=Images Processed/Energy Consumption=250,000 images/2500 wa
tt-hours=100 images per watt-hour.
Given Data
Number of filters: 128
Filter size: 3×33 \times 33×3
Input feature map size: 64×64
Stride: 1
Padding: None (valid convolution)
Number of input channels: 64
Dimensions of the Output Feature Map
The formula for the output size of a convolutional layer is given by:
Output Size=Input Size−Filter Size+2×Padding/Stride+1
Since there is no padding and the stride is 1:
Output Size=64−3+2×0/1+1=62.
So, the output feature map size is 62×62.
Number of FLOPs for a Single Convolution Operation
Each filter is applied to a 3×3 region across all 64 channels. Therefore, the number of
operations per filter application is:
FLOPs per filter application=3×3×64=576.
Each convolutional operation involves a multiply and an add (MAC operation), so each filter
application involves twice the number of FLOPs:
Total FLOPs per filter application=2×576=1152.
Number of Output Elements
The output feature map has dimensions 62×62 and there are 128 filters. So, the total number
of output elements is:
62×62×128
Total Number of FLOPs
The total number of FLOPs required to compute the output feature map is:
Total FLOPs=1152×62×62×128
Let's calculate this step-by-step:
1. Calculate 62×6262 \times 6262×62:
62×62=3844
2. Multiply by 128:
3844×128=491,392
3. Multiply by 1152:
491,392×1152=566,231,424.
Conclusion
The total number of floating-point operations (FLOPs) required to compute the output feature
map for the given convolutional layer is 566,231,424.
To provide a comprehensive analysis of how three popular deep learning
frameworks (TensorFlow, PyTorch, and Keras) utilize GPU acceleration, we
will cover the following:
1. Comparative Analysis of Frameworks:
o Introduction to each framework.
o Overview of GPU acceleration in each framework.
o Key features and ease of use.
o Performance and efficiency considerations.
2. Implementation of a Simple Neural Network for MNIST Digit
Classification:
o Source code for each framework using GPU support.
3. Summary of Training Time and Accuracy Results:
o A table comparing the training times and accuracies for each implementation.
1. Comparative Analysis of Frameworks
TensorFlow
Overview:
TensorFlow, developed by Google, is a highly flexible and comprehensive open-
source platform for machine learning.
It offers extensive support for both research and production, with capabilities for deep
learning and other ML tasks.
GPU Acceleration:
TensorFlow provides built-in support for GPU acceleration using CUDA and cuDNN.
Users can leverage GPUs by simply installing the GPU version of TensorFlow and
setting device contexts in the code.
Key Features:
Flexible and powerful, suitable for both high-level and low-level operations.
TensorFlow Hub for reusable pre-trained models.
TensorFlow Extended (TFX) for production deployment.
Performance:
TensorFlow is optimized for performance with large-scale ML tasks, including
distributed training.
PyTorch
Overview:
PyTorch, developed by Facebook's AI Research lab, is known for its dynamic
computation graph, making it more intuitive and flexible.
It is widely used in both academia and industry for research and development.
GPU Acceleration:
PyTorch offers seamless GPU acceleration. Tensor operations are easy to move
between CPU and GPU.
It uses CUDA for GPU support and allows dynamic graph building, which can be
particularly useful for certain applications.
Key Features:
Dynamic computation graph (eager execution).
Strong community support and extensive tutorials.
Integrates well with Python's ecosystem.
Performance:
PyTorch is designed for flexibility and ease of use, with competitive performance,
especially in research contexts.
Keras
Overview:
Keras is a high-level neural networks API, written in Python and capable of running
on top of TensorFlow, Theano, and other frameworks.
It is user-friendly and fast to prototype with.
GPU Acceleration:
Keras supports GPU acceleration through its backend frameworks, primarily
TensorFlow.
Users can switch between CPU and GPU by setting the backend appropriately.
Key Features:
Simple and consistent interface for building neural networks.
Extensive library of pre-trained models.
Strong support for prototyping and rapid development.
Performance:
Keras prioritizes ease of use and rapid development, with performance largely
dependent on the backend used (e.g., TensorFlow).
2. Implementation of a Simple Neural Network for MNIST Digit
Classification
TensorFlow Implementation
import tensorflow as tf
from tensorflow.keras.datasets import mnist
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Flatten
from tensorflow.keras.utils import to_categorical
# Load data
(x_train, y_train), (x_test, y_test) = mnist.load_data()
x_train, x_test = x_train / 255.0, x_test / 255.0
y_train, y_test = to_categorical(y_train), to_categorical(y_test)
# Define model
model = Sequential([
Flatten(input_shape=(28, 28)),
Dense(128, activation='relu'),
Dense(10, activation='softmax')
])
# Compile model
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])
# Train model
with tf.device('/GPU:0'):
history = model.fit(x_train, y_train, epochs=10, batch_size=32, validation_split=0.2)
# Evaluate model
test_loss, test_acc = model.evaluate(x_test, y_test, verbose=2)
print(f"Test accuracy: {test_acc}")
PyTorch Implementation
import torch
import torch.nn as nn
import torch.optim as optim
from torchvision import datasets, transforms
# Load data
transform = transforms.Compose([transforms.ToTensor(), transforms.Normalize((0.5,),
(0.5,))])
train_dataset = datasets.MNIST('./data', train=True, download=True, transform=transform)
test_dataset = datasets.MNIST('./data', train=False, download=True, transform=transform)
train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=32, shuffle=True)
test_loader = torch.utils.data.DataLoader(test_dataset, batch_size=32, shuffle=False)
# Define model
class Net(nn.Module):
def __init__(self):
super(Net, self).__init__()
self.flatten = nn.Flatten()
self.fc1 = nn.Linear(28*28, 128)
self.fc2 = nn.Linear(128, 10)
def forward(self, x):
x = self.flatten(x)
x = torch.relu(self.fc1(x))
x = self.fc2(x)
return x
model = Net().cuda()
# Define loss and optimizer
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)
# Train model
for epoch in range(10):
model.train()
for data, target in train_loader:
data, target = data.cuda(), target.cuda()
optimizer.zero_grad()
output = model(data)
loss = criterion(output, target)
loss.backward()
optimizer.step()
# Evaluate model
correct = 0
total = 0
model.eval()
with torch.no_grad():
for data, target in test_loader:
data, target = data.cuda(), target.cuda()
output = model(data)
_, predicted = torch.max(output.data, 1)
total += target.size(0)
correct += (predicted == target).sum().item()
test_acc = correct / total
print(f"Test accuracy: {test_acc}")
Keras Implementation
from keras.datasets import mnist
from keras.models import Sequential
from keras.layers import Dense, Flatten
from keras.utils import to_categorical
# Load data
(x_train, y_train), (x_test, y_test) = mnist.load_data()
x_train, x_test = x_train / 255.0, x_test / 255.0
y_train, y_test = to_categorical(y_train), to_categorical(y_test)
# Define model
model = Sequential([
Flatten(input_shape=(28, 28)),
Dense(128, activation='relu'),
Dense(10, activation='softmax')
])
# Compile model
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])
# Train model
history = model.fit(x_train, y_train, epochs=10, batch_size=32, validation_split=0.2)
# Evaluate model
test_loss, test_acc = model.evaluate(x_test, y_test, verbose=2)
print(f"Test accuracy: {test_acc}")
Comprehensive Report: Real-Time Object Detection for Autonomous
Driving Using GPU Computing
1. Introduction
Autonomous driving is one of the most transformative technologies of the
modern era. At its core, it involves enabling vehicles to navigate and operate
without human intervention. A critical component of autonomous driving is
real-time object detection, which allows the vehicle to recognize and respond to
various objects and obstacles on the road, such as other vehicles, pedestrians,
traffic signs, and more. This report delves into the importance of GPU
computing in solving this problem and implements a simplified version of a
deep learning model using GPU acceleration to demonstrate its efficacy.
2. Problem Description
Real-Time Object Detection in Autonomous Driving
Autonomous vehicles must process a vast amount of visual data in real time to
detect and classify objects accurately. The challenge lies in the need for high-
speed processing to ensure safety and reliability. Traditional CPU-based
systems struggle with the computational demands of real-time object detection
due to their limited parallel processing capabilities.
Challenges:
High-speed image processing
Accurate object detection and classification
Handling diverse and dynamic environments
Ensuring safety and reliability
3. Role of GPU Computing
Why GPU Computing is Crucial
1. Parallel Processing Capabilities: GPUs are designed to handle thousands
of simultaneous threads, making them ideal for the parallel nature of
deep learning computations.
2. Speed: GPUs significantly reduce the time required to train and infer
deep learning models, which is essential for real-time applications.
3. Efficiency: Handling large-scale data and complex models is more
efficient with GPUs, enabling faster and more accurate object detection.
4. Implementation
Simplified Model for Real-Time Object Detection
We will implement a simplified version of the YOLO (You Only Look Once)
model for object detection. YOLO is known for its speed and accuracy, making
it suitable for real-time applications.
4.1 Data Preparation
For simplicity, we'll use a subset of a well-known object detection dataset like
COCO (Common Objects in Context).
python
Copy code
import tensorflow as tf
from tensorflow.keras.preprocessing.image import
ImageDataGenerator
# Assuming dataset is already downloaded and
preprocessed
# Load and preprocess data
datagen = ImageDataGenerator(rescale=1./255,
validation_split=0.2)
train_generator = datagen.flow_from_directory(
'data/train', target_size=(416, 416),
batch_size=32, class_mode='categorical',
subset='training'
)
val_generator = datagen.flow_from_directory(
'data/val', target_size=(416, 416),
batch_size=32, class_mode='categorical',
subset='validation'
)
4.2 Model Definition
We will define a simplified version of the YOLO model.
python
Copy code
from tensorflow.keras.layers import Conv2D, Input,
BatchNormalization, LeakyReLU, ZeroPadding2D
from tensorflow.keras.models import Model
def yolo_body(inputs, num_anchors, num_classes):
x = Conv2D(32, (3,3), padding='same',
use_bias=False)(inputs)
x = BatchNormalization()(x)
x = LeakyReLU(alpha=0.1)(x)
# Add more layers as needed to match the
simplified YOLO architecture
x = Conv2D(num_anchors * (num_classes + 5),
(1,1), padding='same', use_bias=False)(x)
return Model(inputs, x)
inputs = Input(shape=(416, 416, 3))
model = yolo_body(inputs, num_anchors=3,
num_classes=80)
model.compile(optimizer='adam',
loss='categorical_crossentropy',
metrics=['accuracy'])
# Train the model with GPU acceleration
with tf.device('/GPU:0'):
history = model.fit(train_generator, epochs=10,
validation_data=val_generator)
5. Performance Evaluation
Training Time and Accuracy
python
Copy code
# Evaluate the model
test_loss, test_acc = model.evaluate(val_generator,
verbose=2)
print(f"Validation accuracy: {test_acc}")
# Summarize performance
import matplotlib.pyplot as plt
# Plot training & validation accuracy values
plt.plot(history.history['accuracy'])
plt.plot(history.history['val_accuracy'])
plt.title('Model accuracy')
plt.ylabel('Accuracy')
plt.xlabel('Epoch')
plt.legend(['Train', 'Validation'], loc='upper left')
plt.show()
# Plot training & validation loss values
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.title('Model loss')
plt.ylabel('Loss')
plt.xlabel('Epoch')
plt.legend(['Train', 'Validation'], loc='upper left')
plt.show()
6. Impact of GPU Computing
Speed and Efficiency
1. Training Time Reduction: GPU acceleration drastically reduces the
training time from hours (or days) to minutes (or hours), allowing for
quicker model iterations.
2. Scalability: Handling larger datasets and more complex models becomes
feasible, enabling more accurate and robust real-time object detection.
3. Real-Time Processing: With GPUs, real-time processing of video frames
is achievable, which is crucial for autonomous driving.
Example Comparison:
CPU-based training: 10 hours
GPU-based training: 1 hour
Inference Speed:
CPU-based inference: 5 frames per second (fps)
GPU-based inference: 30 fps
7. Conclusion
GPU computing plays a pivotal role in the advancement of autonomous driving
by enabling real-time object detection. The ability to process vast amounts of
data quickly and accurately not only enhances the capabilities of autonomous
vehicles but also improves safety and reliability. This simplified implementation
demonstrates the significant impact of GPU acceleration, highlighting its
necessity in real-world applications.
Deliverables
Comprehensive Report: A 6-8 page document detailing the problem, the
role of GPU computing, and the implementation, including performance
evaluation and impact analysis.
Source Code: Provided above, demonstrating the implementation of a
deep learning model using TensorFlow and GPU acceleration.
Performance Evaluation and Impact Analysis: Graphs and metrics
illustrating the training time, accuracy, and the benefits of GPU
computing.
The final report should include sections for introduction, problem description,
role of GPU computing, implementation details, performance evaluation,
impact analysis, and conclusion, along with appropriate references and
appendices for the source code.