PC File
PC File
Practical 1
Introduction to OpenMP
It consists of a set of compiler directives, library routines, and environment variables that
influence run-time behavior.
OpenMP uses a portable, scalable model that gives programmers a simple and flexible
interface for developing parallel applications for platforms ranging from the standard
desktop computer to the supercomputer.
The threads then run concurrently, with the runtime environment allocating threads to
different processors.
The section of code that is meant to run in parallel is marked accordingly, with a
preprocessor directive that will cause the threads to form before the section is executed.
Each thread has an id attached to it which can be obtained using a function (called
omp_get_thread_num()).
After the execution of the parallelized code, the threads join back into the master thread,
which continues onward to the end of the program.
Work-sharing constructs can be used to divide a task among the threads so that each thread
executes its allocated part of the code. Both task parallelism and data parallelism can be
achieved using OpenMP in this way
The runtime environment allocates threads to processors depending on usage, machine load
and other factors.
The number of threads can be assigned by the runtime environment based on environment
variables or in code using functions.
The OpenMP functions are included in a header file labelled omp.h in C/C++.
Page 1
The output may also be garbled because of the race condition caused from the two threads
sharing the standard output.
Page 2
Output:
Conclusion:
Thus we have implemented a simple program in openMP using threads in C
Page 3
PRACTICAL 2
Aim:
To write a C program to solve the producer consumer model using threads
Theory:
In this program, the master thread will act as a producer while the other threads will wait until the
master thread creates buffer, and once added the master notifies the threads using a shared variable
and all other threads will consume the data
Program:
#include<stdio.h>
#include<omp.h>
int main()
{
int i=0;
int x=0;
#pragma omp parallel shared(i)
{
if(omp_get_thread_num()==0)
{
printf("Master thread with Thread ID:%d\n", omp_get_thread_num());
printf("Since it is the producer thread It is adding some data to be consumed by other consumer
threads\n");
i+=10;
x=1;
} else
{
while(x==0)
printf("Waiting for buffer to be filled. Thread ID: %d\n",omp_get_thread_num());
#pragma critical
{
if(i>0){
printf("Data is consumed by Consumer with Thread ID: %d\n",omp_get_thread_num());
i-=5;
} else {
printf("Could not find any data for thread ID: %d\n",omp_get_thread_num());
}}
}
}
}
Page 4
Conclusion:
Thus, we have implemented and studied the producer consumer problem using openMP
Page 5
OpenMP
Programming
Page 6
Practical 3(a)
Aim:
To write an OpenMP Program of Matrix Matrix Multiplication
Program:
#include<stdio.h>
#include<omp.h>
int main(){
int i,j,k,m,n,p;
printf("Enter the number of rows in Matrix 1:");
scanf("%d",&m);
int *matrixA[m];
printf("Enter the number of columns in Matrix 1:");
scanf("%d",&n);
for(i=0;i<m;i++){
matrixA[i] = (int *)malloc(n*sizeof(int));
}
printf("<--Now Input the values for matrix 1 row-wise-->\n");
for(i=0;i<m;i++){
for(j=0;j<n;j++){
scanf("%d",&matrixA[i][j]);
}
}
printf("Enter the number of columns in Matrix 2:");
scanf("%d",&p);
int *matrixB[n];
for(i=0;i<n;i++){
matrixB[i] = (int *)malloc(p*sizeof(int));
}
printf("<--Now Input the values for matrix 2 row-wise-->\n");
for(i=0;i<n;i++){
for(j=0;j<p;j++){
scanf("%d",&matrixB[i][j]);
}
}
int matrixC[m][p];
#pragma omp parallel private(i,j,k) shared(matrixA,matrixB,matrixC)
{
#pragma omp for schedule(static)
for (i=0; i<m; i=i+1){
Veermata Jijabai Technological Institute
Page 7
Page 8
Practical 3(b)
Aim:
To write the openMP program to find prime numbers between 2 and the given
number, and stores all the prime numbers in an array.
Program:
#include<stdio.h>
#include<omp.h>
int IsPrime(int number) {
int i;
for (i = 2; i < number; i++) {
if (number % i == 0 && i != number) return 0;
}
return 1;
}
int main(){
int noOfThreads,valueN,indexCount=0,arrayVal[10000],tempValue;
printf("Enter the Number of threads: ");
scanf("%d",&noOfThreads);
printf("Enter the value of N: ");
scanf("%d",&valueN);
omp_set_num_threads(noOfThreads);
#pragma omp parallel for reduction(+:indexCount)
for(tempValue=2;tempValue<=valueN;tempValue++){
if(IsPrime(tempValue)){
arrayVal[indexCount] = tempValue;
indexCount++;
}
}
printf("Number of prime numbers between 2 and %d: %d\n",valueN,indexCount);
return 0;
}
Page 9
Conclusion: Thus, I have implemented OpenMP program for finding prime numbers
between 2 and N
Page 10
Practical 3(c)
Aim:
To write an OpenMP program to print Largest of an element in an array
Program:
#include<stdio.h>
#include<omp.h>
int main(){
int numberOfElements,currentMax=-1,iIterator,arrayInput[10000];
printf("Enter the Number of Elements: ");
scanf("%d",&numberOfElements);
for(iIterator=0;iIterator<numberOfElements;iIterator++){
scanf("%d",&arrayInput[iIterator]);
}
#pragma omp parallel for shared(currentMax)
for(iIterator=0;iIterator<numberOfElements;iIterator++){
#pragma omp critical
if(arrayInput[iIterator] > currentMax){
currentMax = arrayInput[iIterator];
}
}
printf("The Maximum Element is: %d\n",currentMax);
return 0;
Page 11
Conclusion: Thus, I have implemented OpenMP program for finding largest element
in an array
Page 12
Practical 3(d)
Aim:
To write an OpenMP program for PI calculation
Program:
#include<stdio.h>
#include<omp.h>
int main(){
int num_steps=10000,i;
double aux,pi,step = 1.0/(double) num_steps,x=0.0,sum = 0.0;
#pragma omp parallel private(i,x,aux) shared(sum)
{
#pragma omp for schedule(static)
for (i=0; i<num_steps; i=i+1){
x=(i+0.5)*step;
aux=4.0/(1.0+x*x);
#pragma omp critical
sum = sum + aux;
}
}
pi=step*sum;
printf("The Value of PI is %lf\n",pi);
return 0;
}
Output:
Page 13
MPI Programming
Page 14
Practical 4(a)
Aim:
To write a simple MPI program for calculating Rank and Number of processor.
Program:
#include <mpi.h>
#include <stdio.h>
int main (int argc, char* argv[])
{
int rank, size;
MPI_Init (&argc, &argv);
/* starts MPI */
MPI_Comm_rank (MPI_COMM_WORLD, &rank);
/* get current process id */
MPI_Comm_size (MPI_COMM_WORLD, &size);
/* get number of processes
*/
printf( "Hello world from process %d of %d\n", rank, size );
MPI_Finalize();
return 0;
}
Output:
Conclusion: Thus, I have implemented MPI program for calculating rank and
number of processors
Page 15
Practical 4(b)
Aim:
To write a MPI program for PI calculation
Program:
#include "mpi.h"
#include <stdio.h>
#include <math.h>
int main( int argc, char *argv[] )
{
int n, myid, numprocs, i;
double PI25DT = 3.141592653589793238462643;
double mypi, pi, h, sum, x;
MPI_Init(&argc,&argv);
MPI_Comm_size(MPI_COMM_WORLD,&numprocs);
MPI_Comm_rank(MPI_COMM_WORLD,&myid);
while (1) {
if (myid == 0) {
printf("Enter the number of intervals: (0 quits) ");
scanf("%d",&n);
}
MPI_Bcast(&n, 1, MPI_INT, 0, MPI_COMM_WORLD);
if (n == 0)
break;
else {
h = 1.0 / (double) n;
sum = 0.0;
for (i = myid + 1; i <= n; i += numprocs) {
x = h * ((double)i - 0.5);
sum += (4.0 / (1.0 + x*x));
}
mypi = h * sum;
MPI_Reduce(&mypi, &pi, 1, MPI_DOUBLE, MPI_SUM, 0,
MPI_COMM_WORLD);
if (myid == 0)
printf("pi is approximately %.16f, Error is %.16f\n",
pi, fabs(pi - PI25DT));
}
}
Veermata Jijabai Technological Institute
Page 16
Page 17
Practical 4(c)
Aim:
To write an Advanced MPI program that has a total number of 4 processes. The
process with rank 0 should send VJTI letter to all the processes using MPI_Scatter
call
Program:
#include "mpi.h"
#include <stdio.h>
#include <stdlib.h>
#define SIZE 4
int main (int argc, char *argv[])
{
int numtasks, rank, sendcount, recvcount, source;
char sendbuf[SIZE][SIZE] = {
{'V','J','T','I'},
{'V','J','T','I'},
{'V','J','T','I'},
{'V','J','T','I'}};
char recvbuf[SIZE];
MPI_Init(&argc,&argv);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &numtasks);
if (numtasks == SIZE) {
source = 0;
sendcount = SIZE;
recvcount = SIZE;
MPI_Scatter(sendbuf,sendcount,MPI_CHAR,recvbuf,recvcount,
MPI_CHAR,source,MPI_COMM_WORLD);
printf("rank= %d Results: %c %c %c %c\n",rank,recvbuf[0],
recvbuf[1],recvbuf[2],recvbuf[3]);
}
else
printf("Must specify %d processors. Terminating.\n",SIZE);
MPI_Finalize();
}
Veermata Jijabai Technological Institute
Page 18
Output:
Conclusion: Thus, I have implemented advanced MPI program for Scattering VJTI
to all the processes be root process using MPI_Scatter Call
Page 19
Practical 4(d)
Aim:
To write an Advanced MPI program to find maximum value in array of six integers
with 6 processes and print the result in root process using MPI_Reduce call.
Program:
#include "mpi.h"
#include <stdio.h>
#include <stdlib.h>
#define SIZE 4
int main (int argc, char *argv[])
{
int rank,numtasks,array[6] = {100,600,300,800,250,720},i,inputNumber;
MPI_Init(&argc,&argv);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &numtasks);
printf("Local Input for process %d is %d\n",rank,array[rank]);
inputNumber = array[rank];
int maxNumber;
MPI_Reduce(&inputNumber, &maxNumber, 1, MPI_INT, MPI_MAX, 0,
MPI_COMM_WORLD);
// Print the result
if (rank == 0) {
printf("Maximum of all is: %d\n",maxNumber);
}
MPI_Finalize();
}
Page 20
Output:
Conclusion: Thus, I have implemented advanced MPI program for finding the
maximum of all the elements in an array of 6 elements using 6 processes and
understood the use of MPI_Reduce Call
Page 21
Practical 4(e)
Aim:
To write a MPI program for Ring topology
Program:
#include <stdio.h>
#include "mpi.h"
int main(int argc,char *argv[])
{
int
MyRank, Numprocs, Root = 0;
int
value, sum = 0;
int
Source, Source_tag;
int
Destination, Destination_tag;
MPI_Status status;
/* Initialize MPI */
MPI_Init(&argc,&argv);
MPI_Comm_size(MPI_COMM_WORLD,&Numprocs);
MPI_Comm_rank(MPI_COMM_WORLD,&MyRank);
if (MyRank == Root){
Destination = MyRank + 1;
Destination_tag = 0;
MPI_Send(&MyRank, 1, MPI_INT, Destination, Destination_tag,
MPI_COMM_WORLD);
}
else{
if(MyRank<Numprocs-1){
Source = MyRank - 1;
Source_tag = 0;
MPI_Recv(&value, 1, MPI_INT, Source, Source_tag,
MPI_COMM_WORLD, &status);
sum = MyRank + value;
Destination = MyRank + 1;
Destination_tag = 0;
MPI_Send(&sum, 1, MPI_INT, Destination, Destination_tag,
MPI_COMM_WORLD);
Veermata Jijabai Technological Institute
Page 22
Page 23
Output:
Page 24
Numerical
Computing
Programming
Page 25
Practical 5(a)
Aim:
To write any one numerical computing programming for implementing Trapezoid
Rule with MPI
Program:
#include <stdio.h>
/* We'll be using MPI routines, definitions, etc. */
#include <mpi.h>
void Get_data(int p, int my_rank, double* a_p, double* b_p, int* n_p);
double Trap(double local_a, double local_b, int local_n,
double h); /* Calculate local area */
double f(double x); /* function we're integrating */
int main(int argc, char** argv) {
int
my_rank; /* My process rank
*/
int
p;
/* The number of processes */
double
a;
/* Left endpoint
*/
double
b;
/* Right endpoint
*/
int
n;
/* Number of trapezoids
*/
double
h;
/* Trapezoid base length */
double
local_a; /* Left endpoint my process */
double
local_b; /* Right endpoint my process */
int
local_n; /* Number of trapezoids for */
/* my calculation
*/
double
my_area; /* Integral over my interval */
double
total; /* Total area
*/
int
source; /* Process sending area
*/
int
dest = 0; /* All messages go to 0
*/
int
tag = 0;
MPI_Status status;
/* Let the system do what it needs to start up MPI */
MPI_Init(&argc, &argv);
Veermata Jijabai Technological Institute
Page 26
Page 27
Page 28
Page 29
Output:
Page 30
Practical 5(b)
Aim:
To write any one numerical computing programming for implementing Gaussian
Filter with MPI
Theory:
In electronics and signal processing, a Gaussian filter is a filter whose impulse
response is a Gaussian function (or an approximation to it). Gaussian filters have the
properties of having no overshoot to a step function input while minimizing the rise
and fall time. This behavior is closely connected to the fact that the Gaussian filter
has the minimum possible group delay. It is considered the ideal time domain filter,
just as the sinc is the ideal frequency domain filter. These properties are important in
areas such as oscilloscopes and digital telecommunication systems.
In two dimensions, it is the product of two such Gaussians, one per direction:
where x is the distance from the origin in the horizontal axis, y is the distance from
the origin in the vertical axis, and is the standard deviation of the Gaussian
distribution.
Shape of the impulse response of a typical Gaussian filter
Conclusion: Thus, I have studied about Gaussian filter and its implementation details
Page 31
CUDA
Programming
Page 32
Practical 6(a)
Aim:
To write a simple CUDA Program for Hello World.
Theory:
'CUDA ' stands for Compute Unified Device Architecture, it is a parallel computing
platform and programming model created by NVIDIA and implemented by the
graphics processing units (GPUs) that they produce. CUDA gives developers direct
access to the virtual instruction set and memory of the parallel computational
elements in CUDA GPUs.
Using CUDA, the GPUs can be used for general purpose processing (i.e., not
exclusively graphics); this approach is known as GPGPU. Unlike CPUs, however,
GPUs have a parallel throughput architecture that emphasizes executing many
concurrent threads slowly, rather than executing a single thread very quickly.
Program:
#include <stdio.h>
#include <cuda.h>
#include <sys/time.h>
#include <assert.h>
__global__ void kernel (void)
{
}
int main(void){
kernel<<input parameters>>();
printf(Hello, World);
return 0;
}
Conclusion: Thus, I have studied basic about CUDA programming and implemented
Hello World program
Page 33
Practical 6(b)
Aim:
To write a CUDA program for Matrix addition
Program:
#include <stdlib.h>
#include <stdio.h>
#include <sys/time.h>
#include <string.h>
#include <cuda.h>
#include <assert.h>
const int N = 4;
const int blocksize = 2;
__global__ void add_matrix_on_gpu( float* a, float *b, float *c, int N )
{
int i = blockIdx.x * blockDim.x + threadIdx.x;
int j = blockIdx.y * blockDim.y + threadIdx.y;
int index = i + j*N;
if ( i < N && j < N )
c[index] = a[index] + b[index];
}
void add_matrix_on_cpu(float *a, float *b, float *d)
{
int i;
for(i = 0; i < N*N; i++)
d[i] = a[i]+b[i];
}
int main()
{
float *a = new float[N*N];
float *b = new float[N*N];
float *c = new float[N*N];
float *d = new float[N*N];
for ( int i = 0; i < N*N; ++i ) {
a[i] = 1.0f; b[i] = 3.5f; }
Veermata Jijabai Technological Institute
Page 34
/*
printf("Matrix A:\n");
for(int i=0; i<N*N; i++)
{
printf("\t%f",a[i]);
if((i+1)%N==0)
printf("\n");
}
printf("Matrix B:\n");
for(int i=0; i<N*N; i++)
{
printf("\t%f",b[i]);
if((i+1)%N==0)
printf("\n");
}
*/
struct timeval TimeValue_Start;
struct timezone TimeZone_Start;
struct timeval TimeValue_Final;
struct timezone TimeZone_Final;
long
time_start, time_end;
double
time_overhead;
float *ad, *bd, *cd;
const int size = N*N*sizeof(float);
cudaMalloc( (void**)&ad, size );
cudaMalloc( (void**)&bd, size );
cudaMalloc( (void**)&cd, size );
cudaMemcpy( ad, a, size, cudaMemcpyHostToDevice );
cudaMemcpy( bd, b, size, cudaMemcpyHostToDevice );
dim3 dimBlock( blocksize, blocksize );
dim3 dimGrid( N/dimBlock.x, N/dimBlock.y );
gettimeofday(&TimeValue_Start, &TimeZone_Start);
add_matrix_on_gpu<<<dimGrid, dimBlock>>>( ad, bd, cd, N );
Veermata Jijabai Technological Institute
Page 35
printf("result is:\n");
for(int i=0; i<N*N; i++)
{
printf("\t%f%f",c[i],d[i]);
if((i+1)%N==0)
printf("\n");
}
*/
for(int i=0; i<N*N; i++)
assert(c[i]==d[i]);
printf("\n\t\t Time in Seconds (T)
: %lf\n\n",time_overhead);
Page 36
Practical 6(c)
Aim:
To write a CUDA program for prefix Sum
Program:
#include<stdio.h>
#include<cuda.h>
#include <assert.h>
#include<sys/time.h>
#define N 5
#define BLOCKSIZE 5
__global__ void PrefixSum(float *dInArray, float *dOutArray, int arrayLen, int
threadDim)
{
int tidx = threadIdx.x;
int tidy = threadIdx.y;
int tindex = (threadDim * tidx) + tidy;
int maxNumThread = threadDim * threadDim;
int pass = 0;
int count ;
int curEleInd;
float tempResult = 0.0;
while( (curEleInd = (tindex + maxNumThread * pass)) < arrayLen )
{
tempResult = 0.0f;
for( count = 0; count <= curEleInd; count++)
tempResult += dInArray[count];
dOutArray[curEleInd] = tempResult;
pass++;
}
__syncthreads();
}//end of Prefix sum function
void PrefixSum_cpu(float *x_h, float *z_h)
{
int i;
Veermata Jijabai Technological Institute
Page 37
Page 38
: %lf\n\n",time_overhead);
free(x_h);
free(y_h);
free(z_h);
cudaFree(x_d);
cudaFree(y_d);
return 0;
}
Conclusion: Thus, I have implemented CUDA program for prefix sum
Page 39
Practical 6(d)
Aim:
To write a CUDA program for Matrix Transpose
Program:
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <cuda.h>
#include <assert.h>
#include <sys/time.h>
const int N = 8;
const int blocksize = 4;
__global__ void transpose_naive( float *out, float *in, const int N ) {
unsigned int xIdx = blockDim.x * blockIdx.x + threadIdx.x;
unsigned int yIdx = blockDim.y * blockIdx.y + threadIdx.y;
if ( xIdx < N && yIdx < N ) {
unsigned int idx_in = xIdx + N * yIdx;
unsigned int idx_out = yIdx + N * xIdx;
out[idx_out] = in[idx_in];
}
}
void mat_trans_cpu(float *a, float *c)
{
int mn = N*N;
/* N rows and N columns */
int q = mn - 1;
int i = 0;
/* Index of 1D array that represents the matrix */
do
{
int k = (i*N) % q;
while (k>i)
k = (N*k) % q;
if (k!=i)
{
c[k] = a[i];
Veermata Jijabai Technological Institute
Page 40
Page 41
: %lf\n\n",time_overhead);
cudaFree( ad ); cudaFree( bd );
delete[] a; delete[] b, delete[] c;
return EXIT_SUCCESS;
}
Conclusion: Thus, I have implemented CUDA program for matrix transpose
Page 42
Practical 6(e)
Aim:
To write a CUDA program for vector addition
Program:
#include <stdio.h>
#include <cuda.h>
#include <stdlib.h>
#include <assert.h>
#include <sys/time.h>
#define N 4096 // size of array
__global__ void vectorAdd(int *a,int *b, int *c)
{
int tid = blockIdx.x * blockDim.x + threadIdx.x;
if(tid < N){
c[tid] = a[tid]+b[tid];
}
}
int main(int argc, char *argv[])
{
int T = 10, B = 1; // threads per block and blocks per grid
int a[N],b[N],c[N]; // vectors, statically declared
int *dev_a, *dev_b, *dev_c;
printf("Size of array = %d\n", N);
do {
printf("Enter number of threads per block (1024 max, comp. cap. 2.x ");
scanf("%d",&T);
printf("\nEnter number of blocks per grid: ");
scanf("%d",&B);
if (T * B < N) printf("Error T x B < N, try again\n");
} while (T * B < N);
cudaEvent_t start, stop; // using cuda events to measure time
float elapsed_time_ms;
cudaMalloc((void**)&dev_a,N * sizeof(int));
cudaMalloc((void**)&dev_b,N * sizeof(int));
Veermata Jijabai Technological Institute
Page 43
for(int i=0;i<N;i++) {
printf("%d+%d=%d\n",a[i],b[i],c[i]);
assert(c[i]==(a[i]+b[i]));
}
printf("Time to calculate results: %f ms.\n", elapsed_time_ms);
cudaFree(dev_a);
cudaFree(dev_b);
cudaFree(dev_c);
return 0;
}
Conclusion: Thus, I have implemented CUDA program for vector addition
Page 44
Practical 6(f)
Aim:
To write a CUDA program for vector multiplication
Program:
#include <stdio.h>
#include <cuda.h>
#include <sys/time.h>
#include <assert.h>
__global__ void mult_vect(float * x, float * y, float * z, int n)
{
int idx= blockIdx.x * blockDim.x + threadIdx.x;
if(idx < n)
{
z[idx] = x[idx] * y[idx];
}
}
int main()
{
float *x_h, *y_h, *z_h;
float *x_d, *y_d, *z_d;
int n= 20,i;
size_t size= n * sizeof(float);
struct timeval TimeValue_Start;
struct timezone TimeZone_Start;
struct timeval TimeValue_Final;
struct timezone TimeZone_Final;
long
time_start, time_end;
double
time_overhead;
/* allocating memory on CPU */
x_h= (float *)malloc(size);
y_h= (float *)malloc(size);
z_h= (float *)malloc(size);
Veermata Jijabai Technological Institute
Page 45
Page 46
: %lf\n\n",time_overhead);
free(x_h);
free(y_h);
free(z_h);
cudaFree(x_d);
cudaFree(y_d);
cudaFree(z_d);
return 0;
}
Conclusion: Thus, I have implemented CUDA program for vector multiplication
Page 47
Practical 7
Case study for Differentiate between CUDA
programming and OPEN CL programming
Performance
The first feature is Performance. Both CUDA and OpenCL are fast, and on GPU
devices they are much faster than the CPU for data-parallel codes, with 10X speedups
commonly seen on data-parallel problems.
Both CUDA and OpenCL can fully utilize the hardware. Performance depends upon a
slew of variables, including hardware type, algorithm type, and code quality.
Scalability
With respect to Scability, there are some other interesting developments of note. The
first is that there is new technology in CUDA called GPUDirect that is aimed at
reducing memory transfer overheads when communicating between multiple GPUs.
It has optimizations to reduce overhead by allowing peer-to-peer memory transfers
between GPUs on the same PCI express bus. It also has optimization to reduce the
overhead of moving data from GPU memory to a network interface card. This is
certainly an interesting development, but it is too new for us to say if it offers enough
benefit to be an important technology.
The second interesting development is in mobile GPU computing. OpenCL has
quickly become the most pervasive way to do GPU computing on mobile devices,
including smartphones and tablets. Companies like ARM, Imagination Technologies,
Freescale, Qualcomm, Samsung, and others are all enabling their mobile GPUs to run
OpenCL codes. There are more mobile devices sold each year than there are PCs, so
this is a huge community that is beginning to put its support behind OpenCL. At
AccelerEyes, we have done several GPU consulting projects on mobile GPUs and are
believers that there is big benefit to accelerating apps, especially computer vision and
video processing apps, directly on the phone or tablet.
Portability
The third feature is Portability. This is perhaps the most reconizable difference
between CUDA and OpenCL. CUDA only runs on NVIDIA GPUs, while OpenCL is
the open industry standard and runs on AMD, Intel, NVIDIA, and other hardware
devices.
Veermata Jijabai Technological Institute
Page 48
Also, with respect to portability, CUDA does not provide CPU fallback. Currently,
developers using CUDA typically put if-statements in their code that distinguish
between the presense or absense of a GPU device at runtime. In contract, with
OpenCL, CPU fallback is supported and makes code maintenance much easier.
Community
The fourth feature is Community. This is the feature that encompasses support,
longevity, commitment, etc. As those things are hard to measure, we put together a
proxy. It is interesting to look at the number of forum topics on NVIDIAs CUDA
forms at nearly 27,000 and AMDs OpenCL forums at 4,000. Also, the neutral 3rd
party site Stackoverflow has tags for CUDA and OpenCL, with the number of CUDA
tags being over 3X the number of OpenCL tags. As you would expect, there are many
more people doing CUDA programming today due to the great investment NVIDIA
has put into building the ecosystem for GPU computing.
Programmability
The fifth and final feature is Programmability. Both CUDA and OpenCL are lowlevel. It is time consuming to do GPU kernel development in either of those
platforms. The bulk of that time is often spent in redesigning algorithms to exploit
data-parallelism.
Libraries really make all the difference in GPU computing. To compare and contract
CUDA versus OpenCL, it is important to look at the comparison of libraries
available. OpenCL has a better library support as compared to that of CUDA
Page 49
Mini ProjectDijkstrer
Page 50
Page 51
Page 52
Page 53
Page 54
Page 55
Page 56
Page 57