0% found this document useful (0 votes)

204 views

PC File

The MPI program calculates an approximation of pi in parallel. Each processor calculates a portion of the sum and then the results are reduced to calculate the total sum. The root processor broadcasts the number of intervals to all processors and gathers the results. The program calculates pi to within a specified error tolerance of the true value of pi.

Uploaded by

Avinash Vad

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

204 views

PC File

Uploaded by

Avinash Vad

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 57

Parallel Computing Lab

Practical 1
Introduction to OpenMP

OpenMP (Open Multi-Processing) is an API that supports multi-platform shared memory

multiprocessing programming in C,C++ and Fortran on most processor architectures and
operating systems, including Solaris, AIX, HP-UX, Linux, Mac OS X, and Windows
platforms.

It consists of a set of compiler directives, library routines, and environment variables that
influence run-time behavior.

OpenMP uses a portable, scalable model that gives programmers a simple and flexible
interface for developing parallel applications for platforms ranging from the standard
desktop computer to the supercomputer.

OpenMP is an implementation of multithreading, a method of parallelizing whereby a

master thread (a series of instructions executed consecutively) forks a specified number of
slave threads and a task is divided among them.

The threads then run concurrently, with the runtime environment allocating threads to
different processors.

The section of code that is meant to run in parallel is marked accordingly, with a
preprocessor directive that will cause the threads to form before the section is executed.

Each thread has an id attached to it which can be obtained using a function (called
omp_get_thread_num()).

The thread id is an integer, and the master thread has an id of 0.

After the execution of the parallelized code, the threads join back into the master thread,
which continues onward to the end of the program.

By default, each thread executes the parallelized section of code independently.

Work-sharing constructs can be used to divide a task among the threads so that each thread
executes its allocated part of the code. Both task parallelism and data parallelism can be
achieved using OpenMP in this way

The runtime environment allocates threads to processors depending on usage, machine load
and other factors.

The number of threads can be assigned by the runtime environment based on environment
variables or in code using functions.

The OpenMP functions are included in a header file labelled omp.h in C/C++.

Veermata Jijabai Technological Institute

Page 1

Parallel Computing Lab

The output may also be garbled because of the race condition caused from the two threads
sharing the standard output.

Fork Join Model

Program:
#include<stdio.h>
#include<omp.h>
int main(){
int nthreads, tid;
/* Fork a team of threads giving them their own copies of variables */
#pragma omp parallel private(nthreads, tid)
{
/* Obtain thread number */
tid = omp_get_thread_num();
printf("Hello World from thread = %d\n", tid);
/* Only master thread does this */
if (tid == 0)
{
nthreads = omp_get_num_threads();
printf("Number of threads = %d\n", nthreads);
}
} /* All threads join master thread and disband */
}

Veermata Jijabai Technological Institute

Page 2

Parallel Computing Lab

Output:

Conclusion:
Thus we have implemented a simple program in openMP using threads in C

Veermata Jijabai Technological Institute

Page 3

Parallel Computing Lab

PRACTICAL 2
Aim:
To write a C program to solve the producer consumer model using threads
Theory:
In this program, the master thread will act as a producer while the other threads will wait until the
master thread creates buffer, and once added the master notifies the threads using a shared variable
and all other threads will consume the data
Program:
#include<stdio.h>
#include<omp.h>
int main()
{
int i=0;
int x=0;
#pragma omp parallel shared(i)
{
if(omp_get_thread_num()==0)
{
printf("Master thread with Thread ID:%d\n", omp_get_thread_num());
printf("Since it is the producer thread It is adding some data to be consumed by other consumer
threads\n");
i+=10;
x=1;
} else
{
while(x==0)
printf("Waiting for buffer to be filled. Thread ID: %d\n",omp_get_thread_num());
#pragma critical
{
if(i>0){
printf("Data is consumed by Consumer with Thread ID: %d\n",omp_get_thread_num());
i-=5;
} else {
printf("Could not find any data for thread ID: %d\n",omp_get_thread_num());
}}
}
}
}

Veermata Jijabai Technological Institute

Page 4

Parallel Computing Lab

Output :

Conclusion:
Thus, we have implemented and studied the producer consumer problem using openMP

Veermata Jijabai Technological Institute

Page 5

Parallel Computing Lab

OpenMP
Programming

Veermata Jijabai Technological Institute

Page 6

Parallel Computing Lab

Practical 3(a)
Aim:
To write an OpenMP Program of Matrix Matrix Multiplication
Program:
#include<stdio.h>
#include<omp.h>
int main(){
int i,j,k,m,n,p;
printf("Enter the number of rows in Matrix 1:");
scanf("%d",&m);
int *matrixA[m];
printf("Enter the number of columns in Matrix 1:");
scanf("%d",&n);
for(i=0;i<m;i++){
matrixA[i] = (int *)malloc(n*sizeof(int));
}
printf("<--Now Input the values for matrix 1 row-wise-->\n");
for(i=0;i<m;i++){
for(j=0;j<n;j++){
scanf("%d",&matrixA[i][j]);
}
}
printf("Enter the number of columns in Matrix 2:");
scanf("%d",&p);
int *matrixB[n];
for(i=0;i<n;i++){
matrixB[i] = (int *)malloc(p*sizeof(int));
}
printf("<--Now Input the values for matrix 2 row-wise-->\n");
for(i=0;i<n;i++){
for(j=0;j<p;j++){
scanf("%d",&matrixB[i][j]);
}
}
int matrixC[m][p];
#pragma omp parallel private(i,j,k) shared(matrixA,matrixB,matrixC)
{
#pragma omp for schedule(static)
for (i=0; i<m; i=i+1){
Veermata Jijabai Technological Institute

Page 7

Parallel Computing Lab

for (j=0; j<p; j=j+1){
matrixC[i][j] = 0;
for (k=0; k<n; k=k+1){
matrixC[i][j]=(matrixC[i][j])+((matrixA[i][k])*(matrixB[k][j]));
}
}
}
}
printf("The output after Matrix Multiplication is: \n");
for(i=0;i<m;i++){
for(j=0;j<p;j++)
printf("%d \t",matrixC[i][j]);
printf("\n");
}
return 0;
}
Output:

Conclusion: Thus, I have implemented OpenMP program for matrix multiplication

Veermata Jijabai Technological Institute

Page 8

Parallel Computing Lab

Practical 3(b)
Aim:
To write the openMP program to find prime numbers between 2 and the given
number, and stores all the prime numbers in an array.
Program:
#include<stdio.h>
#include<omp.h>
int IsPrime(int number) {
int i;
for (i = 2; i < number; i++) {
if (number % i == 0 && i != number) return 0;
}
return 1;
}
int main(){
int noOfThreads,valueN,indexCount=0,arrayVal[10000],tempValue;
printf("Enter the Number of threads: ");
scanf("%d",&noOfThreads);
printf("Enter the value of N: ");
scanf("%d",&valueN);
omp_set_num_threads(noOfThreads);
#pragma omp parallel for reduction(+:indexCount)
for(tempValue=2;tempValue<=valueN;tempValue++){
if(IsPrime(tempValue)){
arrayVal[indexCount] = tempValue;
indexCount++;
}
}
printf("Number of prime numbers between 2 and %d: %d\n",valueN,indexCount);
return 0;
}

Veermata Jijabai Technological Institute

Page 9

Parallel Computing Lab

Output:

Conclusion: Thus, I have implemented OpenMP program for finding prime numbers
between 2 and N

Veermata Jijabai Technological Institute

Page 10

Parallel Computing Lab

Practical 3(c)
Aim:
To write an OpenMP program to print Largest of an element in an array
Program:
#include<stdio.h>
#include<omp.h>
int main(){
int numberOfElements,currentMax=-1,iIterator,arrayInput[10000];
printf("Enter the Number of Elements: ");
scanf("%d",&numberOfElements);
for(iIterator=0;iIterator<numberOfElements;iIterator++){
scanf("%d",&arrayInput[iIterator]);
}
#pragma omp parallel for shared(currentMax)
for(iIterator=0;iIterator<numberOfElements;iIterator++){
#pragma omp critical
if(arrayInput[iIterator] > currentMax){
currentMax = arrayInput[iIterator];
}
}
printf("The Maximum Element is: %d\n",currentMax);
return 0;

Veermata Jijabai Technological Institute

Page 11

Parallel Computing Lab

}Output:

Conclusion: Thus, I have implemented OpenMP program for finding largest element
in an array

Veermata Jijabai Technological Institute

Page 12

Parallel Computing Lab

Practical 3(d)
Aim:
To write an OpenMP program for PI calculation
Program:
#include<stdio.h>
#include<omp.h>
int main(){
int num_steps=10000,i;
double aux,pi,step = 1.0/(double) num_steps,x=0.0,sum = 0.0;
#pragma omp parallel private(i,x,aux) shared(sum)
{
#pragma omp for schedule(static)
for (i=0; i<num_steps; i=i+1){
x=(i+0.5)*step;
aux=4.0/(1.0+x*x);
#pragma omp critical
sum = sum + aux;
}
}
pi=step*sum;
printf("The Value of PI is %lf\n",pi);
return 0;
}
Output:

Conclusion: Thus, I have implemented OpenMP program for calculating value of PI

Veermata Jijabai Technological Institute

Page 13

Parallel Computing Lab

MPI Programming

Veermata Jijabai Technological Institute

Page 14

Parallel Computing Lab

Practical 4(a)
Aim:
To write a simple MPI program for calculating Rank and Number of processor.
Program:
#include <mpi.h>
#include <stdio.h>
int main (int argc, char* argv[])
{
int rank, size;
MPI_Init (&argc, &argv);
/* starts MPI */
MPI_Comm_rank (MPI_COMM_WORLD, &rank);
/* get current process id */
MPI_Comm_size (MPI_COMM_WORLD, &size);
/* get number of processes
*/
printf( "Hello world from process %d of %d\n", rank, size );
MPI_Finalize();
return 0;
}
Output:

Conclusion: Thus, I have implemented MPI program for calculating rank and
number of processors

Veermata Jijabai Technological Institute

Page 15

Parallel Computing Lab

Practical 4(b)
Aim:
To write a MPI program for PI calculation
Program:
#include "mpi.h"
#include <stdio.h>
#include <math.h>
int main( int argc, char *argv[] )
{
int n, myid, numprocs, i;
double PI25DT = 3.141592653589793238462643;
double mypi, pi, h, sum, x;
MPI_Init(&argc,&argv);
MPI_Comm_size(MPI_COMM_WORLD,&numprocs);
MPI_Comm_rank(MPI_COMM_WORLD,&myid);
while (1) {
if (myid == 0) {
printf("Enter the number of intervals: (0 quits) ");
scanf("%d",&n);
}
MPI_Bcast(&n, 1, MPI_INT, 0, MPI_COMM_WORLD);
if (n == 0)
break;
else {
h = 1.0 / (double) n;
sum = 0.0;
for (i = myid + 1; i <= n; i += numprocs) {
x = h * ((double)i - 0.5);
sum += (4.0 / (1.0 + x*x));
}
mypi = h * sum;
MPI_Reduce(&mypi, &pi, 1, MPI_DOUBLE, MPI_SUM, 0,
MPI_COMM_WORLD);
if (myid == 0)
printf("pi is approximately %.16f, Error is %.16f\n",
pi, fabs(pi - PI25DT));
}
}
Veermata Jijabai Technological Institute

Page 16

Parallel Computing Lab

MPI_Finalize();
return 0;
}
Output:

Conclusion: Thus, I have implemented MPI program for calculating value of PI

Veermata Jijabai Technological Institute

Page 17

Parallel Computing Lab

Practical 4(c)
Aim:
To write an Advanced MPI program that has a total number of 4 processes. The
process with rank 0 should send VJTI letter to all the processes using MPI_Scatter
call
Program:
#include "mpi.h"
#include <stdio.h>
#include <stdlib.h>
#define SIZE 4
int main (int argc, char *argv[])
{
int numtasks, rank, sendcount, recvcount, source;
char sendbuf[SIZE][SIZE] = {
{'V','J','T','I'},
{'V','J','T','I'},
{'V','J','T','I'},
{'V','J','T','I'}};
char recvbuf[SIZE];
MPI_Init(&argc,&argv);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &numtasks);
if (numtasks == SIZE) {
source = 0;
sendcount = SIZE;
recvcount = SIZE;
MPI_Scatter(sendbuf,sendcount,MPI_CHAR,recvbuf,recvcount,
MPI_CHAR,source,MPI_COMM_WORLD);
printf("rank= %d Results: %c %c %c %c\n",rank,recvbuf[0],
recvbuf[1],recvbuf[2],recvbuf[3]);
}
else
printf("Must specify %d processors. Terminating.\n",SIZE);
MPI_Finalize();
}
Veermata Jijabai Technological Institute

Page 18

Parallel Computing Lab

Output:

Conclusion: Thus, I have implemented advanced MPI program for Scattering VJTI
to all the processes be root process using MPI_Scatter Call

Veermata Jijabai Technological Institute

Page 19

Parallel Computing Lab

Practical 4(d)
Aim:
To write an Advanced MPI program to find maximum value in array of six integers
with 6 processes and print the result in root process using MPI_Reduce call.
Program:
#include "mpi.h"
#include <stdio.h>
#include <stdlib.h>
#define SIZE 4
int main (int argc, char *argv[])
{
int rank,numtasks,array[6] = {100,600,300,800,250,720},i,inputNumber;
MPI_Init(&argc,&argv);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &numtasks);
printf("Local Input for process %d is %d\n",rank,array[rank]);
inputNumber = array[rank];
int maxNumber;
MPI_Reduce(&inputNumber, &maxNumber, 1, MPI_INT, MPI_MAX, 0,
MPI_COMM_WORLD);
// Print the result
if (rank == 0) {
printf("Maximum of all is: %d\n",maxNumber);
}
MPI_Finalize();
}

Veermata Jijabai Technological Institute

Page 20

Parallel Computing Lab

Output:

Conclusion: Thus, I have implemented advanced MPI program for finding the
maximum of all the elements in an array of 6 elements using 6 processes and
understood the use of MPI_Reduce Call

Veermata Jijabai Technological Institute

Page 21

Parallel Computing Lab

Practical 4(e)
Aim:
To write a MPI program for Ring topology
Program:
#include <stdio.h>
#include "mpi.h"
int main(int argc,char *argv[])
{
int
MyRank, Numprocs, Root = 0;
int
value, sum = 0;
int
Source, Source_tag;
int
Destination, Destination_tag;
MPI_Status status;
/* Initialize MPI */
MPI_Init(&argc,&argv);
MPI_Comm_size(MPI_COMM_WORLD,&Numprocs);
MPI_Comm_rank(MPI_COMM_WORLD,&MyRank);
if (MyRank == Root){
Destination = MyRank + 1;
Destination_tag = 0;
MPI_Send(&MyRank, 1, MPI_INT, Destination, Destination_tag,
MPI_COMM_WORLD);
}
else{
if(MyRank<Numprocs-1){
Source = MyRank - 1;
Source_tag = 0;
MPI_Recv(&value, 1, MPI_INT, Source, Source_tag,
MPI_COMM_WORLD, &status);
sum = MyRank + value;
Destination = MyRank + 1;
Destination_tag = 0;
MPI_Send(&sum, 1, MPI_INT, Destination, Destination_tag,
MPI_COMM_WORLD);
Veermata Jijabai Technological Institute

Page 22

Parallel Computing Lab

}
else{
Source = MyRank - 1;
Source_tag = 0;
MPI_Recv(&value, 1, MPI_INT, Source, Source_tag,
MPI_COMM_WORLD, &status);
sum = MyRank + value;
}
}
if (MyRank == Root)
{
Source = Numprocs - 1;
Source_tag = 0;
MPI_Recv(&sum, 1, MPI_INT, Source, Source_tag,
MPI_COMM_WORLD, &status);
printf("MyRank %d Final SUM %d\n", MyRank, sum);
}
if(MyRank == (Numprocs - 1)){
Destination = 0;
Destination_tag = 0;
MPI_Send(&sum, 1, MPI_INT, Destination, Destination_tag,
MPI_COMM_WORLD);
}
MPI_Finalize();
}

Veermata Jijabai Technological Institute

Page 23

Parallel Computing Lab

Output:

Conclusion: Thus, I have implemented MPI program for ring topology

Veermata Jijabai Technological Institute

Page 24

Parallel Computing Lab

Numerical
Computing
Programming

Veermata Jijabai Technological Institute

Page 25

Parallel Computing Lab

Practical 5(a)
Aim:
To write any one numerical computing programming for implementing Trapezoid
Rule with MPI
Program:
#include <stdio.h>
/* We'll be using MPI routines, definitions, etc. */
#include <mpi.h>
void Get_data(int p, int my_rank, double* a_p, double* b_p, int* n_p);
double Trap(double local_a, double local_b, int local_n,
double h); /* Calculate local area */
double f(double x); /* function we're integrating */
int main(int argc, char** argv) {
int
my_rank; /* My process rank
*/
int
p;
/* The number of processes */
double
a;
/* Left endpoint
*/
double
b;
/* Right endpoint
*/
int
n;
/* Number of trapezoids
*/
double
h;
/* Trapezoid base length */
double
local_a; /* Left endpoint my process */
double
local_b; /* Right endpoint my process */
int
local_n; /* Number of trapezoids for */
/* my calculation
*/
double
my_area; /* Integral over my interval */
double
total; /* Total area
*/
int
source; /* Process sending area
*/
int
dest = 0; /* All messages go to 0
*/
int
tag = 0;
MPI_Status status;
/* Let the system do what it needs to start up MPI */
MPI_Init(&argc, &argv);
Veermata Jijabai Technological Institute

Page 26

Parallel Computing Lab

/* Get my process rank */
MPI_Comm_rank(MPI_COMM_WORLD, &my_rank);
/* Find out how many processes are being used */
MPI_Comm_size(MPI_COMM_WORLD, &p);
Get_data(p, my_rank, &a, &b, &n);
h = (b-a)/n; /* h is the same for all processes */
local_n = n/p; /* So is the number of trapezoids */
/* Length of each process' interval of
* integration = local_n*h. So my interval
* starts at: */
local_a = a + my_rank*local_n*h;
local_b = local_a + local_n*h;
my_area = Trap(local_a, local_b, local_n, h);
/* Add up the areas calculated by each process */
if (my_rank == 0) {
total = my_area;
for (source = 1; source < p; source++) {
MPI_Recv(&my_area, 1, MPI_DOUBLE, source, tag,
MPI_COMM_WORLD, &status);
total = total + my_area;
}
} else {
MPI_Send(&my_area, 1, MPI_DOUBLE, dest,
tag, MPI_COMM_WORLD);
}
/* Print the result */
if (my_rank == 0) {
printf("With n = %d trapezoids, our estimate\n",
n);
printf("of the area from %f to %f = %.15f\n",
a, b, total);
}
/* Shut down MPI */
MPI_Finalize();
return 0;
Veermata Jijabai Technological Institute

Page 27

Parallel Computing Lab

} /* main */
/*-----------------------------------------------------------------* Function: Get_data
* Purpose:
Read in the data on process 0 and send to other
*
processes
* Input args: p, my_rank
* Output args: a_p, b_p, n_p
*/
void Get_data(int p, int my_rank, double* a_p, double* b_p, int* n_p) {
int
q;
MPI_Status status;
if (my_rank == 0) {
printf("Enter a, b, and n\n");
scanf("%lf %lf %d", a_p, b_p, n_p);
for (q = 1; q < p; q++) {
MPI_Send(a_p, 1, MPI_DOUBLE, q, 0, MPI_COMM_WORLD);
MPI_Send(b_p, 1, MPI_DOUBLE, q, 0, MPI_COMM_WORLD);
MPI_Send(n_p, 1, MPI_INT, q, 0, MPI_COMM_WORLD);
}
} else {
MPI_Recv(a_p, 1, MPI_DOUBLE, 0, 0, MPI_COMM_WORLD, &status);
MPI_Recv(b_p, 1, MPI_DOUBLE, 0, 0, MPI_COMM_WORLD, &status);
MPI_Recv(n_p, 1, MPI_INT, 0, 0, MPI_COMM_WORLD, &status);
}
} /* Get_data */
/*-----------------------------------------------------------------* Function: Trap
* Purpose:
Estimate a definite area using the trapezoidal
*
rule
* Input args: local_a (my left endpoint)
*
local_b (my right endpoint)
*
local_n (my number of trapezoids)
*
h (stepsize = length of base of trapezoids)
* Return val: Trapezoidal rule estimate of area from
*
local_a to local_b
*/
double Trap(
double local_a /* in */,
double local_b /* in */,
Veermata Jijabai Technological Institute

Page 28

Parallel Computing Lab

int local_n /* in */,
double h
/* in */) {
double my_area; /* Store my result in my_area */
double x;
int i;
my_area = (f(local_a) + f(local_b))/2.0;
x = local_a;
for (i = 1; i <= local_n-1; i++) {
x = local_a + i*h;
my_area = my_area + f(x);
}
my_area = my_area*h;
return my_area;
} /* Trap */
/*-----------------------------------------------------------------* Function: f
* Purpose: Compute value of function to be integrated
* Input args: x
*/
double f(double x) {
double return_val;
return_val = x*x + 1.0;
return return_val;
} /* f */

Veermata Jijabai Technological Institute

Page 29

Parallel Computing Lab

Output:

Conclusion: Thus, I have implemented MPI program for Trapezoidal Rule

(Numerical Computing problem)

Veermata Jijabai Technological Institute

Page 30

Parallel Computing Lab

Practical 5(b)
Aim:
To write any one numerical computing programming for implementing Gaussian
Filter with MPI
Theory:
In electronics and signal processing, a Gaussian filter is a filter whose impulse
response is a Gaussian function (or an approximation to it). Gaussian filters have the
properties of having no overshoot to a step function input while minimizing the rise
and fall time. This behavior is closely connected to the fact that the Gaussian filter
has the minimum possible group delay. It is considered the ideal time domain filter,
just as the sinc is the ideal frequency domain filter. These properties are important in
areas such as oscilloscopes and digital telecommunication systems.
In two dimensions, it is the product of two such Gaussians, one per direction:

where x is the distance from the origin in the horizontal axis, y is the distance from
the origin in the vertical axis, and is the standard deviation of the Gaussian
distribution.
Shape of the impulse response of a typical Gaussian filter

Conclusion: Thus, I have studied about Gaussian filter and its implementation details

Veermata Jijabai Technological Institute

Page 31

Parallel Computing Lab

CUDA
Programming

Veermata Jijabai Technological Institute

Page 32

Parallel Computing Lab

Practical 6(a)
Aim:
To write a simple CUDA Program for Hello World.
Theory:
'CUDA ' stands for Compute Unified Device Architecture, it is a parallel computing
platform and programming model created by NVIDIA and implemented by the
graphics processing units (GPUs) that they produce. CUDA gives developers direct
access to the virtual instruction set and memory of the parallel computational
elements in CUDA GPUs.
Using CUDA, the GPUs can be used for general purpose processing (i.e., not
exclusively graphics); this approach is known as GPGPU. Unlike CPUs, however,
GPUs have a parallel throughput architecture that emphasizes executing many
concurrent threads slowly, rather than executing a single thread very quickly.
Program:
#include <stdio.h>
#include <cuda.h>
#include <sys/time.h>
#include <assert.h>
__global__ void kernel (void)
{
}
int main(void){
kernel<<input parameters>>();
printf(Hello, World);
return 0;
}
Conclusion: Thus, I have studied basic about CUDA programming and implemented
Hello World program

Veermata Jijabai Technological Institute

Page 33

Parallel Computing Lab

Practical 6(b)
Aim:
To write a CUDA program for Matrix addition
Program:
#include <stdlib.h>
#include <stdio.h>
#include <sys/time.h>
#include <string.h>
#include <cuda.h>
#include <assert.h>
const int N = 4;
const int blocksize = 2;
__global__ void add_matrix_on_gpu( float* a, float *b, float *c, int N )
{
int i = blockIdx.x * blockDim.x + threadIdx.x;
int j = blockIdx.y * blockDim.y + threadIdx.y;
int index = i + j*N;
if ( i < N && j < N )
c[index] = a[index] + b[index];
}
void add_matrix_on_cpu(float *a, float *b, float *d)
{
int i;
for(i = 0; i < N*N; i++)
d[i] = a[i]+b[i];
}
int main()
{
float *a = new float[N*N];
float *b = new float[N*N];
float *c = new float[N*N];
float *d = new float[N*N];
for ( int i = 0; i < N*N; ++i ) {
a[i] = 1.0f; b[i] = 3.5f; }
Veermata Jijabai Technological Institute

Page 34

Parallel Computing Lab

printf("Matrix A:\n");
for(int i=0; i<N*N; i++)
{
printf("\t%f",a[i]);
if((i+1)%N==0)
printf("\n");
}
printf("Matrix B:\n");
for(int i=0; i<N*N; i++)
{
printf("\t%f",b[i]);
if((i+1)%N==0)
printf("\n");
}

*/
struct timeval TimeValue_Start;
struct timezone TimeZone_Start;
struct timeval TimeValue_Final;
struct timezone TimeZone_Final;
long
time_start, time_end;
double
time_overhead;
float *ad, *bd, *cd;
const int size = N*N*sizeof(float);
cudaMalloc( (void**)&ad, size );
cudaMalloc( (void**)&bd, size );
cudaMalloc( (void**)&cd, size );
cudaMemcpy( ad, a, size, cudaMemcpyHostToDevice );
cudaMemcpy( bd, b, size, cudaMemcpyHostToDevice );
dim3 dimBlock( blocksize, blocksize );
dim3 dimGrid( N/dimBlock.x, N/dimBlock.y );
gettimeofday(&TimeValue_Start, &TimeZone_Start);
add_matrix_on_gpu<<<dimGrid, dimBlock>>>( ad, bd, cd, N );
Veermata Jijabai Technological Institute

Page 35

Parallel Computing Lab

gettimeofday(&TimeValue_Final, &TimeZone_Final);
cudaMemcpy( c, cd, size, cudaMemcpyDeviceToHost );
add_matrix_on_cpu(a,b,d);
time_end = TimeValue_Final.tv_sec * 1000000 + TimeValue_Final.tv_usec;
time_start = TimeValue_Start.tv_sec * 1000000 + TimeValue_Start.tv_usec;
time_overhead = (time_end - time_start)/1000000.0;
/*

printf("result is:\n");
for(int i=0; i<N*N; i++)
{
printf("\t%f%f",c[i],d[i]);
if((i+1)%N==0)
printf("\n");
}

*/
for(int i=0; i<N*N; i++)
assert(c[i]==d[i]);
printf("\n\t\t Time in Seconds (T)

: %lf\n\n",time_overhead);

cudaFree( ad ); cudaFree( bd ); cudaFree( cd );

delete[] a; delete[] b; delete[] c, delete[] d;
return EXIT_SUCCESS;
}
Conclusion: Thus, I implemented CUDA program for matrix addition

Veermata Jijabai Technological Institute

Page 36

Parallel Computing Lab

Practical 6(c)
Aim:
To write a CUDA program for prefix Sum
Program:
#include<stdio.h>
#include<cuda.h>
#include <assert.h>
#include<sys/time.h>
#define N 5
#define BLOCKSIZE 5
__global__ void PrefixSum(float *dInArray, float *dOutArray, int arrayLen, int
threadDim)
{
int tidx = threadIdx.x;
int tidy = threadIdx.y;
int tindex = (threadDim * tidx) + tidy;
int maxNumThread = threadDim * threadDim;
int pass = 0;
int count ;
int curEleInd;
float tempResult = 0.0;
while( (curEleInd = (tindex + maxNumThread * pass)) < arrayLen )
{
tempResult = 0.0f;
for( count = 0; count <= curEleInd; count++)
tempResult += dInArray[count];
dOutArray[curEleInd] = tempResult;
pass++;
}
__syncthreads();
}//end of Prefix sum function
void PrefixSum_cpu(float *x_h, float *z_h)
{
int i;
Veermata Jijabai Technological Institute

Page 37

Parallel Computing Lab

for(i=0; i<N; i++)
{
if(i==0)
z_h[i]=x_h[i];
else
z_h[i]=z_h[i-1]+x_h[i];
}
}
int main()
{
float *x_h, *y_h, *z_h;
float *x_d, *y_d;
int i;
struct timeval TimeValue_Start;
struct timezone TimeZone_Start;
struct timeval TimeValue_Final;
struct timezone TimeZone_Final;
long
time_start, time_end;
double
time_overhead;
size_t size = N*sizeof(float);
x_h = (float *)malloc(size);
y_h = (float *)malloc(size);
z_h = (float *)malloc(size);
cudaMalloc((void **)&x_d,size);
cudaMalloc((void **)&y_d,size);
for(i=0; i<N; i++)
{
x_h[i] = (float) i+1;
}
cudaMemcpy(x_d,x_h,size,cudaMemcpyHostToDevice);
dim3 dimBlock(BLOCKSIZE,BLOCKSIZE);
dim3 dimGrid(1,1);
Veermata Jijabai Technological Institute

Page 38

Parallel Computing Lab

gettimeofday(&TimeValue_Start, &TimeZone_Start);
PrefixSum<<<dimGrid, dimBlock>>>(x_d, y_d, N, BLOCKSIZE);
gettimeofday(&TimeValue_Final, &TimeZone_Final);
cudaMemcpy(y_h,y_d,size,cudaMemcpyDeviceToHost);
PrefixSum_cpu(x_h,z_h);
for(i = 0; i < N; i++)
assert(y_h[i]==z_h[i]);
time_end = TimeValue_Final.tv_sec * 1000000 + TimeValue_Final.tv_usec;
time_start = TimeValue_Start.tv_sec * 1000000 + TimeValue_Start.tv_usec;
time_overhead = (time_end - time_start)/1000000.0;
printf("\n\t\t Time in Seconds (T)

: %lf\n\n",time_overhead);

free(x_h);
free(y_h);
free(z_h);
cudaFree(x_d);
cudaFree(y_d);
return 0;
}
Conclusion: Thus, I have implemented CUDA program for prefix sum

Veermata Jijabai Technological Institute

Page 39

Parallel Computing Lab

Practical 6(d)
Aim:
To write a CUDA program for Matrix Transpose
Program:
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <cuda.h>
#include <assert.h>
#include <sys/time.h>
const int N = 8;
const int blocksize = 4;
__global__ void transpose_naive( float *out, float *in, const int N ) {
unsigned int xIdx = blockDim.x * blockIdx.x + threadIdx.x;
unsigned int yIdx = blockDim.y * blockIdx.y + threadIdx.y;
if ( xIdx < N && yIdx < N ) {
unsigned int idx_in = xIdx + N * yIdx;
unsigned int idx_out = yIdx + N * xIdx;
out[idx_out] = in[idx_in];
}
}
void mat_trans_cpu(float *a, float *c)
{
int mn = N*N;
/* N rows and N columns */
int q = mn - 1;
int i = 0;
/* Index of 1D array that represents the matrix */
do
{
int k = (i*N) % q;
while (k>i)
k = (N*k) % q;
if (k!=i)
{
c[k] = a[i];
Veermata Jijabai Technological Institute

Page 40

Parallel Computing Lab

c[i] = a[k];
}
else
c[i] = a[i];
} while ( ++i <= (mn -2) );
c[i]=a[i];
/* Update row and column */
/*
matrix.M = N;
matrix.N = M;*/
}
int main()
{
float *a = new float[N*N];
float *b = new float[N*N];
float *c = new float[N*N];
int i;
for ( i = 0; i < N*N; ++i ) {
a[i] = drand48();
}
struct timeval TimeValue_Start;
struct timezone TimeZone_Start;
struct timeval TimeValue_Final;
struct timezone TimeZone_Final;
long
time_start, time_end;
double
time_overhead;
/*
for ( i = 0; i < N*N; i++)
{
printf("\t%f",a[i]);
if(((i+1)%N == 0))
printf("\n");
}
*/
float *ad, *bd ;
const int size = N*N*sizeof(float);
cudaMalloc( (void**)&ad, size );
cudaMalloc( (void**)&bd, size );
cudaMemcpy( ad, a, size, cudaMemcpyHostToDevice );
Veermata Jijabai Technological Institute

Page 41

Parallel Computing Lab

dim3 dimBlock( blocksize, blocksize );

dim3 dimGrid( N/dimBlock.x, N/dimBlock.y );
gettimeofday(&TimeValue_Start, &TimeZone_Start);
transpose_naive<<<dimGrid, dimBlock>>>( bd, ad, N );
gettimeofday(&TimeValue_Final, &TimeZone_Final);
cudaMemcpy( b, bd, size, cudaMemcpyDeviceToHost );
mat_trans_cpu(a,c);
/* To display uncomment this section */
/*
printf("result matrix\n");
for ( i = 0; i < N*N; ++i ){
printf("\t%f%f",b[i],c[i]);
if( ((i+1)%N == 0))
printf("\n");
}
*/
time_end = TimeValue_Final.tv_sec * 1000000 + TimeValue_Final.tv_usec;
time_start = TimeValue_Start.tv_sec * 1000000 + TimeValue_Start.tv_usec;
time_overhead = (time_end - time_start)/1000000.0;
for(i=0; i<N*N; i++)
assert(b[i]==c[i]);
printf("\n\t\t Time in Seconds (T)

: %lf\n\n",time_overhead);

cudaFree( ad ); cudaFree( bd );
delete[] a; delete[] b, delete[] c;
return EXIT_SUCCESS;
}
Conclusion: Thus, I have implemented CUDA program for matrix transpose

Veermata Jijabai Technological Institute

Page 42

Parallel Computing Lab

Practical 6(e)
Aim:
To write a CUDA program for vector addition
Program:
#include <stdio.h>
#include <cuda.h>
#include <stdlib.h>
#include <assert.h>
#include <sys/time.h>
#define N 4096 // size of array
__global__ void vectorAdd(int *a,int *b, int *c)
{
int tid = blockIdx.x * blockDim.x + threadIdx.x;
if(tid < N){
c[tid] = a[tid]+b[tid];
}
}
int main(int argc, char *argv[])
{
int T = 10, B = 1; // threads per block and blocks per grid
int a[N],b[N],c[N]; // vectors, statically declared
int *dev_a, *dev_b, *dev_c;
printf("Size of array = %d\n", N);
do {
printf("Enter number of threads per block (1024 max, comp. cap. 2.x ");
scanf("%d",&T);
printf("\nEnter number of blocks per grid: ");
scanf("%d",&B);
if (T * B < N) printf("Error T x B < N, try again\n");
} while (T * B < N);
cudaEvent_t start, stop; // using cuda events to measure time
float elapsed_time_ms;
cudaMalloc((void**)&dev_a,N * sizeof(int));
cudaMalloc((void**)&dev_b,N * sizeof(int));
Veermata Jijabai Technological Institute

Page 43

Parallel Computing Lab

cudaMalloc((void**)&dev_c,N * sizeof(int));
for(int i=0;i<N;i++) { // load arrays with some numbers
a[i] = i;
b[i] = i*2;
}
cudaMemcpy(dev_a, a , N*sizeof(int),cudaMemcpyHostToDevice);
cudaMemcpy(dev_b, b , N*sizeof(int),cudaMemcpyHostToDevice);
cudaMemcpy(dev_c, c , N*sizeof(int),cudaMemcpyHostToDevice);
cudaEventCreate( &start ); // instrument code to measure start time
cudaEventCreate( &stop );
cudaEventRecord( start, 0 );
vectorAdd<<<B,T>>>(dev_a,dev_b,dev_c);
cudaMemcpy(c,dev_c,N*sizeof(int),cudaMemcpyDeviceToHost);
cudaEventRecord( stop, 0 ); // instrument code to measure end time
cudaEventSynchronize( stop );
cudaEventElapsedTime( &elapsed_time_ms, start, stop );
//

for(int i=0;i<N;i++) {
printf("%d+%d=%d\n",a[i],b[i],c[i]);
assert(c[i]==(a[i]+b[i]));
}
printf("Time to calculate results: %f ms.\n", elapsed_time_ms);
cudaFree(dev_a);
cudaFree(dev_b);
cudaFree(dev_c);
return 0;

}
Conclusion: Thus, I have implemented CUDA program for vector addition

Veermata Jijabai Technological Institute

Page 44

Parallel Computing Lab

Practical 6(f)
Aim:
To write a CUDA program for vector multiplication
Program:
#include <stdio.h>
#include <cuda.h>
#include <sys/time.h>
#include <assert.h>
__global__ void mult_vect(float * x, float * y, float * z, int n)
{
int idx= blockIdx.x * blockDim.x + threadIdx.x;
if(idx < n)
{
z[idx] = x[idx] * y[idx];
}
}
int main()
{
float *x_h, *y_h, *z_h;
float *x_d, *y_d, *z_d;
int n= 20,i;
size_t size= n * sizeof(float);
struct timeval TimeValue_Start;
struct timezone TimeZone_Start;
struct timeval TimeValue_Final;
struct timezone TimeZone_Final;
long
time_start, time_end;
double
time_overhead;
/* allocating memory on CPU */
x_h= (float *)malloc(size);
y_h= (float *)malloc(size);
z_h= (float *)malloc(size);
Veermata Jijabai Technological Institute

Page 45

Parallel Computing Lab

/* allocating memory on Device */

cudaMalloc( (void**)&x_d, size );
cudaMalloc( (void**)&y_d, size );
cudaMalloc( (void**)&z_d, size );
/* Initialization of Vectors */
for(i=0; i < n; i++)
{
x_h[i]= (float) i;
y_h[i]= (float) i;
}
/* Copying from Host to Device */
cudaMemcpy(x_d, x_h, size, cudaMemcpyHostToDevice);
cudaMemcpy(y_d, y_h, size, cudaMemcpyHostToDevice);
cudaMemcpy(z_d, z_h, size, cudaMemcpyHostToDevice);
int block_size= 4;
int num_blocks= (n + block_size - 1) / block_size;
gettimeofday(&TimeValue_Start, &TimeZone_Start);
/* kernel launching */
mult_vect <<<num_blocks, block_size>>> (x_d, y_d, z_d, n);
gettimeofday(&TimeValue_Final, &TimeZone_Final);
/* Copying from Device to Host */
cudaMemcpy(x_h, x_d, size, cudaMemcpyDeviceToHost);
cudaMemcpy(y_h, y_d, size, cudaMemcpyDeviceToHost);
cudaMemcpy(z_h, z_d, size, cudaMemcpyDeviceToHost);
time_end = TimeValue_Final.tv_sec * 1000000 + TimeValue_Final.tv_usec;
time_start = TimeValue_Start.tv_sec * 1000000 + TimeValue_Start.tv_usec;
time_overhead = (time_end - time_start)/1000000.0;
/* Checking whether the result is correct or not */
for(i = 0; i < n ; i++)
assert(z_h[i]==(x_h[i] * y_h[i]));
Veermata Jijabai Technological Institute

Page 46

Parallel Computing Lab

printf("\n\t\t Time in Seconds (T)

: %lf\n\n",time_overhead);

free(x_h);
free(y_h);
free(z_h);
cudaFree(x_d);
cudaFree(y_d);
cudaFree(z_d);
return 0;
}
Conclusion: Thus, I have implemented CUDA program for vector multiplication

Veermata Jijabai Technological Institute

Page 47

Parallel Computing Lab

Practical 7
Case study for Differentiate between CUDA
programming and OPEN CL programming
Performance
The first feature is Performance. Both CUDA and OpenCL are fast, and on GPU
devices they are much faster than the CPU for data-parallel codes, with 10X speedups
commonly seen on data-parallel problems.
Both CUDA and OpenCL can fully utilize the hardware. Performance depends upon a
slew of variables, including hardware type, algorithm type, and code quality.
Scalability
With respect to Scability, there are some other interesting developments of note. The
first is that there is new technology in CUDA called GPUDirect that is aimed at
reducing memory transfer overheads when communicating between multiple GPUs.
It has optimizations to reduce overhead by allowing peer-to-peer memory transfers
between GPUs on the same PCI express bus. It also has optimization to reduce the
overhead of moving data from GPU memory to a network interface card. This is
certainly an interesting development, but it is too new for us to say if it offers enough
benefit to be an important technology.
The second interesting development is in mobile GPU computing. OpenCL has
quickly become the most pervasive way to do GPU computing on mobile devices,
including smartphones and tablets. Companies like ARM, Imagination Technologies,
Freescale, Qualcomm, Samsung, and others are all enabling their mobile GPUs to run
OpenCL codes. There are more mobile devices sold each year than there are PCs, so
this is a huge community that is beginning to put its support behind OpenCL. At
AccelerEyes, we have done several GPU consulting projects on mobile GPUs and are
believers that there is big benefit to accelerating apps, especially computer vision and
video processing apps, directly on the phone or tablet.
Portability
The third feature is Portability. This is perhaps the most reconizable difference
between CUDA and OpenCL. CUDA only runs on NVIDIA GPUs, while OpenCL is
the open industry standard and runs on AMD, Intel, NVIDIA, and other hardware
devices.
Veermata Jijabai Technological Institute

Page 48

Parallel Computing Lab

Also, with respect to portability, CUDA does not provide CPU fallback. Currently,
developers using CUDA typically put if-statements in their code that distinguish
between the presense or absense of a GPU device at runtime. In contract, with
OpenCL, CPU fallback is supported and makes code maintenance much easier.
Community
The fourth feature is Community. This is the feature that encompasses support,
longevity, commitment, etc. As those things are hard to measure, we put together a
proxy. It is interesting to look at the number of forum topics on NVIDIAs CUDA
forms at nearly 27,000 and AMDs OpenCL forums at 4,000. Also, the neutral 3rd
party site Stackoverflow has tags for CUDA and OpenCL, with the number of CUDA
tags being over 3X the number of OpenCL tags. As you would expect, there are many
more people doing CUDA programming today due to the great investment NVIDIA
has put into building the ecosystem for GPU computing.
Programmability
The fifth and final feature is Programmability. Both CUDA and OpenCL are lowlevel. It is time consuming to do GPU kernel development in either of those
platforms. The bulk of that time is often spent in redesigning algorithms to exploit
data-parallelism.
Libraries really make all the difference in GPU computing. To compare and contract
CUDA versus OpenCL, it is important to look at the comparison of libraries
available. OpenCL has a better library support as compared to that of CUDA

Veermata Jijabai Technological Institute

Page 49

Parallel Computing Lab

Mini ProjectDijkstrer

Veermata Jijabai Technological Institute

Page 50

Parallel Computing Lab

Project Name:
Implementation of Dijkstras Algorithm in MPI
Theory:
Dijkstrer is a C application which calculate the shortest path of a graph using
Dijkstra's Algorithm. Dijkstrer uses MPI library to implement cluster functionality
where every member of the cluster calculates each of a node's childs. Dijkstrer will
use each process to calculate each child of every node. This means that when a node
has fewer childs connected than the number of processes, then some process( the
highers ) will not be executed.Use number of processes equal to the maximum
number of childs connected to a node.
There are many ways of implementing graphs. Dijkstrer uses two structures to
represent nodes and acnes. Each node keeps a list with references to every acne
connected to him. It's acne has a node member which points to the node connected.
By this way we are able to represent a path. By using recursion we are travelling
through the shortest path and calculating the childs of each node.
Compile & Run:
Dijkstrer requires MPICH2 library to be installed. Once you have installed it you can
compile Dijkstrer using the C compiler of your choice.
Example of compiling under Windows using gcc:
gcc -o dijkstrer.exe -I "C:\path\to\MPICH2\include" -L
"C:\path\to\MPICH2\lib" dijkstrer.c "C:\path\to\MPICH2\lib\mpi.lib
Then, once you have your cluster app and running copy the executable to each of the
cluster's members and run it. Dijkstrer will read the graph from a file defined in the
cmd or the default one which is dijkstra.dat.
Files content should be in the format:
nN
startNode endNode distance
startNode endNode distance
.
.
.
startNode endNode distance
where n is the total number of Nodes and N is the total number of Acnes.
The file must also be copied in every cluster's member in the same path.
Veermata Jijabai Technological Institute

Page 51

Parallel Computing Lab

Example of running Dijkstrer under Windows using MPICH2's mpiexec:

mpiexec -hosts 2 <host1> 1 <host2> 2 c:\dijkstrer.exe c:\dijkstra.dat
Don't forget to copy the executables and the data file.
Each of your cluster's member will create a process which will calculate a child in
every node.
Program:
#include "mpich2/mpi.h"
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
typedef struct _node node;
typedef struct _acne acne;
int myid, numprocs;
/*
* There are two structures.
* 'Node' represent each node having childrens acnes represent each connected to him
nodes.
* 'Acne' represent each acne connecting two Nodes.
*/
struct _node {
int distance; // distance from the first node
struct _acne *children[100]; // going to nodes
int edgeno; // how many nodes are connected to
int name;
};
struct _acne {
int value; // distance between these edges
int from; // acne connecting from
int to; // to
struct _node *edge;
};
// Get the node with the minimum distance so we can go on #Dijkstra's Algorithm
int getMinDistanceNode(struct _acne *array[], int size){
Veermata Jijabai Technological Institute

Page 52

Parallel Computing Lab

int minnum=array[0]->edge->distance, minnode=0, j;
for(j=0; j< size;j++){
if(array[j]->edge->distance<minnum){
minnode = j;
minnum = array[j]->edge->distance;
}
}
return minnode;
}
// Get the maximum number of childs connented to one node
int getMaxEdgeNo(struct _node array[], int size){
int maxnum=array[0].edgeno, j;
for(j=0; j< size;j++){
if(array[j].edgeno > maxnum){
maxnum = array[j].edgeno;
}
}
return maxnum;
}
/*
* For every node we are going in (the parameter), we search its connected childs
nodes and
* calculate their distances. then we are going in again( recursion ) in the node with
the
* minimum distance
*
* */
int traverse(struct _node *node){
// the node with edgeno=0 is the last node
if(node->edgeno<=0)
return 0;
if (myid == 0) {
Veermata Jijabai Technological Institute

Page 53

Parallel Computing Lab

printf("start of node %d\n", node->name);
fflush(stdout);
printf("\tchildrens: %d\n", node->edgeno);
fflush(stdout);
}
int i;
// Since in every node, every process calculates each child, we don't want more
processes that the number of childs of each node to be executed
if(myid<node->edgeno){
// Dijkstra's algorithm
if(node->distance + node->children[myid]->value < node>children[myid]->edge->distance ){
node->children[myid]->edge->distance = node->distance + node>children[myid]->value;
}
// Every process should broadcast her results to the others, so they can
collaborate peacefully
// Let the world know my results
for (i = 0; i < node->edgeno; i++) {
MPI_Bcast( &node->children[i]->edge->distance, 1, MPI_INT, i,
MPI_COMM_WORLD );
}
printf("\tprocess id %d calculates node %d with distance %d\n", myid,
node->children[myid]->edge->name, node->children[myid]->edge->distance);
fflush(stdout);
}
if (myid == 0) {
printf("end of node\n-----------------------\n");
fflush(stdout);
printf("\tminnode :%d\n", node->children[getMinDistanceNode(node>children, node->edgeno)]->edge->name);
fflush(stdout);
Veermata Jijabai Technological Institute

Page 54

Parallel Computing Lab

}
// Called recursion
traverse(node->children[getMinDistanceNode(node->children, node>edgeno)]->edge);
return 0;
}
int main(int argc, char *argv[]){
// MPI initialization
MPI_Init(&argc, &argv);
MPI_Comm_size(MPI_COMM_WORLD, &numprocs);
MPI_Comm_rank(MPI_COMM_WORLD, &myid);
FILE *ifp;
int from, to, distance;
int nodesNo = 0;
int edgesNo = 0;
int acnesCtr = 0; // count the acnes
int previous = 0;
int childsCtr = 0; // count childs for every node
char *file;
if(argc==2)
file = argv[argc-1];
else
file = "dijkstrer.dat";
ifp = fopen(file, "r");
if (ifp == NULL) {
fprintf(stderr, "Can't open input file!\n");
MPI_Finalize();
exit(1);
}
/*
* Files content should be in the format
*nN
* startNode endNode distance
* startNode endNode distance
Veermata Jijabai Technological Institute

Page 55

Parallel Computing Lab

*.
*.
*.
* startNode endNode distance
*
* where n is the total number of Nodes and N is the total number of Acnes
*
*/
// read the first line
fscanf(ifp, "%d %d", &nodesNo, &edgesNo);
struct _node nodes[nodesNo];
struct _acne acnes[edgesNo];
while (fscanf(ifp, "%d %d %d", &from, &to, &distance) == 3) { /* read from
file */
/*
* Then for every line we create a node and also
* an acne to place the connected edges.
*
*/
if (previous != from)
childsCtr = 0;
nodes[from].distance = 65000; // infinite
nodes[from].name = from;
acnes[acnesCtr].from = from;
acnes[acnesCtr].to = to;
acnes[acnesCtr].value = distance;
acnes[acnesCtr].edge = &nodes[to];
nodes[from].children[childsCtr] = &acnes[acnesCtr];
nodes[from].edgeno = childsCtr + 1;
acnesCtr++;
childsCtr++;
previous = from;
}
Veermata Jijabai Technological Institute

Page 56

Parallel Computing Lab

// set firsts node's distance
nodes[0].distance = 0;
// set last node
nodes[nodesNo - 1].distance = 65000;
nodes[nodesNo - 1].edgeno = 0;
nodes[nodesNo - 1].name = nodesNo - 1;
//
int maxnum = getMaxEdgeNo(nodes, sizeof(nodes)/sizeof(*nodes));
if(numprocs<maxnum){
fprintf(stderr, "Number of processes should be greater that the maximum
number of edges: %d!\n", maxnum);
MPI_Finalize();
exit(1);
}
// get in calculation
traverse(&nodes[0]);
// print out the results
if(myid == 0){
int j;
for(j=0; j< (sizeof(nodes)/sizeof(*nodes));j++){
printf("Node: %d, Distance from the root node: %d, Number of
edges: %d\n", nodes[j].name, nodes[j].distance, nodes[j].edgeno);
fflush(stdout);
}
}
MPI_Finalize();
return 0;
}
Conclusion:
Thus, Mini project - Dijkstrer is implemented using the MPI

Veermata Jijabai Technological Institute

Page 57

Apache Cassandra Administrator Associate - Exam Practice Tests
From Everand
Apache Cassandra Administrator Associate - Exam Practice Tests
Cristian Scutaru
No ratings yet
BOM F1-Pro v1.1
No ratings yet
BOM F1-Pro v1.1
1 page
Daa M-4
No ratings yet
Daa M-4
28 pages
University of Mumbai Bachelor of Management Studies (B.M.S)
No ratings yet
University of Mumbai Bachelor of Management Studies (B.M.S)
18 pages
722.6 Transmission Fluid and Filter Change
No ratings yet
722.6 Transmission Fluid and Filter Change
6 pages
Fiber Additives in Asphalt Mixtures
No ratings yet
Fiber Additives in Asphalt Mixtures
67 pages
C Technical
No ratings yet
C Technical
118 pages
Institute of Engineering Studies (IES Bangalore) Cse Mock Test-I
100% (1)
Institute of Engineering Studies (IES Bangalore) Cse Mock Test-I
12 pages
Erlang C Table PDF
0% (1)
Erlang C Table PDF
2 pages
TCS NQT Model Programming Coding Questions Paper
No ratings yet
TCS NQT Model Programming Coding Questions Paper
7 pages
Matlab File - Deepak - Yadav - Bca - 4TH - Sem - A50504819015
No ratings yet
Matlab File - Deepak - Yadav - Bca - 4TH - Sem - A50504819015
59 pages
ADA-Unit 1
100% (1)
ADA-Unit 1
49 pages
Chapter 6 Iot Systems Logical Design Using Python
No ratings yet
Chapter 6 Iot Systems Logical Design Using Python
33 pages
DAA R19 - Unit-5
No ratings yet
DAA R19 - Unit-5
12 pages
Coding Question
No ratings yet
Coding Question
36 pages
Web Services Lab Manual
No ratings yet
Web Services Lab Manual
6 pages
Raspberry Pi Int
No ratings yet
Raspberry Pi Int
95 pages
Lab Manual: Subject
No ratings yet
Lab Manual: Subject
48 pages
ADA Lab Manual
No ratings yet
ADA Lab Manual
34 pages
MCQ Camu MMS
No ratings yet
MCQ Camu MMS
5 pages
Difference Between OOP & Procedural Programming
100% (1)
Difference Between OOP & Procedural Programming
4 pages
Lab 5 - Mutual Exclusion
No ratings yet
Lab 5 - Mutual Exclusion
8 pages
PHP Variables
No ratings yet
PHP Variables
4 pages
PHP (Hypertext Pre-Processor)
No ratings yet
PHP (Hypertext Pre-Processor)
10 pages
Unit 2: Graphics Programming The Opengl: by Shubha Raj K.B
No ratings yet
Unit 2: Graphics Programming The Opengl: by Shubha Raj K.B
146 pages
PO-CO MAPPING FOR TOM - 5th SEM
No ratings yet
PO-CO MAPPING FOR TOM - 5th SEM
3 pages
Google App Engine
No ratings yet
Google App Engine
10 pages
CST308 - KQB KtuQbank
No ratings yet
CST308 - KQB KtuQbank
13 pages
Ec 467 Pattern Recognition
No ratings yet
Ec 467 Pattern Recognition
2 pages
CS402 Data Mining and Warehousing PDF
No ratings yet
CS402 Data Mining and Warehousing PDF
3 pages
Compiler Design Lab Manual
No ratings yet
Compiler Design Lab Manual
36 pages
Data Copy in Copy Out
No ratings yet
Data Copy in Copy Out
2 pages
APMC Prachi Synopsis
No ratings yet
APMC Prachi Synopsis
6 pages
Ethical Hacking PDF
No ratings yet
Ethical Hacking PDF
1 page
BSC It/Cs Department of Rizvi College of Arts, Science and Commerce University Questions With Answer
No ratings yet
BSC It/Cs Department of Rizvi College of Arts, Science and Commerce University Questions With Answer
25 pages
C# Program To Demonstrate Multilevel Inheritance
No ratings yet
C# Program To Demonstrate Multilevel Inheritance
5 pages
TCS CodeVita 2017
No ratings yet
TCS CodeVita 2017
12 pages
Mesh Warping
No ratings yet
Mesh Warping
6 pages
Ada Lab Manual
No ratings yet
Ada Lab Manual
64 pages
Multiple Choice Questions Related To Testing Knowledge About Time and Space Complexity of A Program - Tutorial - CodeChef Discuss
No ratings yet
Multiple Choice Questions Related To Testing Knowledge About Time and Space Complexity of A Program - Tutorial - CodeChef Discuss
59 pages
PL SQL Lab Work
No ratings yet
PL SQL Lab Work
10 pages
CFG To PDA Conversion
No ratings yet
CFG To PDA Conversion
11 pages
Chapter 1 PDF
No ratings yet
Chapter 1 PDF
127 pages
Ecs Cse 7thsem Unit 1 For VTU, Belgaum
No ratings yet
Ecs Cse 7thsem Unit 1 For VTU, Belgaum
81 pages
Time: 3 Hours Total Marks: 100: Printed Page 1 of 2 Sub Code:KNC302
No ratings yet
Time: 3 Hours Total Marks: 100: Printed Page 1 of 2 Sub Code:KNC302
2 pages
AJ - Lab Manual
No ratings yet
AJ - Lab Manual
97 pages
Practical Lab File Based ON Programing in C: Submitted by
No ratings yet
Practical Lab File Based ON Programing in C: Submitted by
6 pages
CS8381 Data Structures Record
No ratings yet
CS8381 Data Structures Record
107 pages
Industrial Extreme Programming: Submitted By: Group 3 Submitted To
No ratings yet
Industrial Extreme Programming: Submitted By: Group 3 Submitted To
7 pages
Chapter 3 - Greedy Method
No ratings yet
Chapter 3 - Greedy Method
24 pages
Compiler-Design Notes
No ratings yet
Compiler-Design Notes
5 pages
MD5 Algorithm in Cryptography and Network Security
No ratings yet
MD5 Algorithm in Cryptography and Network Security
8 pages
Design and Analysis of Algorithms: Time Space Trade Off
No ratings yet
Design and Analysis of Algorithms: Time Space Trade Off
6 pages
Ut1 Java - 1460182308240
No ratings yet
Ut1 Java - 1460182308240
16 pages
006 Practical List of DM-2023
No ratings yet
006 Practical List of DM-2023
1 page
Job Scheduling1
No ratings yet
Job Scheduling1
36 pages
Textbook of Engineering Chemistry
From Everand
Textbook of Engineering Chemistry
C. Parameswara Murthy
No ratings yet
Parallel Computing Lab Manual PDF
No ratings yet
Parallel Computing Lab Manual PDF
51 pages
Code: First Method:: (1) Write A C Program Using Open MP To Estimate The Value of PI (Use Minimum Two Methods)
No ratings yet
Code: First Method:: (1) Write A C Program Using Open MP To Estimate The Value of PI (Use Minimum Two Methods)
8 pages
Presentation2 HS OpenMP
No ratings yet
Presentation2 HS OpenMP
29 pages
High Performance Computing (HPC) - Lec3
No ratings yet
High Performance Computing (HPC) - Lec3
35 pages
MPI_tutorial_Fall_Break_2022
No ratings yet
MPI_tutorial_Fall_Break_2022
60 pages
Lecture 10 Shared Memory Programming with OpenMP.pptx
No ratings yet
Lecture 10 Shared Memory Programming with OpenMP.pptx
30 pages
First Independence War of India: Lord Dalhousie
No ratings yet
First Independence War of India: Lord Dalhousie
5 pages
Mantra From Rigveda
100% (2)
Mantra From Rigveda
1 page
Syllabi of Junior Engineer
No ratings yet
Syllabi of Junior Engineer
7 pages
Basel Committee On Banking
No ratings yet
Basel Committee On Banking
4 pages
Skin Cancer Project Report New
0% (1)
Skin Cancer Project Report New
13 pages
UPSC Preliminary Examination Syllabus 2015
No ratings yet
UPSC Preliminary Examination Syllabus 2015
2 pages
Practical 2: Cloud Computing and Network Security Lab 2015
No ratings yet
Practical 2: Cloud Computing and Network Security Lab 2015
5 pages
Exploring Methods To Improve Edge Detection With Canny Algorithm
No ratings yet
Exploring Methods To Improve Edge Detection With Canny Algorithm
21 pages
Secure Authentication and Access Control Systems
No ratings yet
Secure Authentication and Access Control Systems
17 pages
Experiment 4 Aim:: Service Oriented Architecture Lab, 2014-15
No ratings yet
Experiment 4 Aim:: Service Oriented Architecture Lab, 2014-15
4 pages
Hacking
No ratings yet
Hacking
5 pages
Practical No. 01: Name: Vikram Bhaktabahadur Karki Roll No: 121080907
No ratings yet
Practical No. 01: Name: Vikram Bhaktabahadur Karki Roll No: 121080907
3 pages
Multiple Chat Application
No ratings yet
Multiple Chat Application
1 page
CN LAB Vjti
No ratings yet
CN LAB Vjti
15 pages
Mokhada Icds Project Latest Updated
No ratings yet
Mokhada Icds Project Latest Updated
168 pages
Mangesh Software Testing Lab Manual
No ratings yet
Mangesh Software Testing Lab Manual
24 pages
SE - DBS Lab Vjti
No ratings yet
SE - DBS Lab Vjti
9 pages
Product Design Example Coursework
100% (2)
Product Design Example Coursework
5 pages
Function Based Index and Column Statistics
No ratings yet
Function Based Index and Column Statistics
6 pages
Basics of Content Writing
No ratings yet
Basics of Content Writing
6 pages
Crypto Introduction
No ratings yet
Crypto Introduction
26 pages
Tata Aria With 36 Features Lat
No ratings yet
Tata Aria With 36 Features Lat
88 pages
Cements For Long Term Isolation PDF
No ratings yet
Cements For Long Term Isolation PDF
14 pages
Process Compressors-1
No ratings yet
Process Compressors-1
11 pages
Failure Analysis and Prevention
No ratings yet
Failure Analysis and Prevention
25 pages
MTN Data I Ing: People Also Ask
No ratings yet
MTN Data I Ing: People Also Ask
1 page
Application Form Unilever General 2014v1
No ratings yet
Application Form Unilever General 2014v1
9 pages
Hybrid Fixed Rotary Wing UAV
No ratings yet
Hybrid Fixed Rotary Wing UAV
15 pages
Unit-4: Define The Domain For Clustering
No ratings yet
Unit-4: Define The Domain For Clustering
13 pages
ETABS 21.2 results losas
No ratings yet
ETABS 21.2 results losas
6 pages
Inventory Control
100% (3)
Inventory Control
3 pages
D.A.M. Quick VSi
No ratings yet
D.A.M. Quick VSi
1 page
Ontrac Coin Presentation-3
No ratings yet
Ontrac Coin Presentation-3
11 pages
MRM Guide e
100% (1)
MRM Guide e
125 pages
Electrical Engineering For Sustainable and Renewable Energy MSC
No ratings yet
Electrical Engineering For Sustainable and Renewable Energy MSC
2 pages
Diaphragm Pressure Gauge For The Process Industry Models 432.50, 433.50, Up To 10-Fold Overload Safety, Max. 40 Bar
No ratings yet
Diaphragm Pressure Gauge For The Process Industry Models 432.50, 433.50, Up To 10-Fold Overload Safety, Max. 40 Bar
4 pages
FLSmidth CrossBar Cooler v2
No ratings yet
FLSmidth CrossBar Cooler v2
8 pages
xk12emotor_iom_en
No ratings yet
xk12emotor_iom_en
13 pages
Nafteeday Gacalo9
No ratings yet
Nafteeday Gacalo9
3 pages
Improving Die Design
No ratings yet
Improving Die Design
1 page
Quiz 4
No ratings yet
Quiz 4
2 pages
Gottwald Products
100% (1)
Gottwald Products
4 pages
Chevrolet Impala (1980)
100% (1)
Chevrolet Impala (1980)
428 pages
Using Others Work
No ratings yet
Using Others Work
2 pages