0% found this document useful (0 votes)
204 views

PC File

The MPI program calculates an approximation of pi in parallel. Each processor calculates a portion of the sum and then the results are reduced to calculate the total sum. The root processor broadcasts the number of intervals to all processors and gathers the results. The program calculates pi to within a specified error tolerance of the true value of pi.

Uploaded by

Avinash Vad
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
204 views

PC File

The MPI program calculates an approximation of pi in parallel. Each processor calculates a portion of the sum and then the results are reduced to calculate the total sum. The root processor broadcasts the number of intervals to all processors and gathers the results. The program calculates pi to within a specified error tolerance of the true value of pi.

Uploaded by

Avinash Vad
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 57

Parallel Computing Lab

Practical 1
Introduction to OpenMP

OpenMP (Open Multi-Processing) is an API that supports multi-platform shared memory


multiprocessing programming in C,C++ and Fortran on most processor architectures and
operating systems, including Solaris, AIX, HP-UX, Linux, Mac OS X, and Windows
platforms.

It consists of a set of compiler directives, library routines, and environment variables that
influence run-time behavior.

OpenMP uses a portable, scalable model that gives programmers a simple and flexible
interface for developing parallel applications for platforms ranging from the standard
desktop computer to the supercomputer.

OpenMP is an implementation of multithreading, a method of parallelizing whereby a


master thread (a series of instructions executed consecutively) forks a specified number of
slave threads and a task is divided among them.

The threads then run concurrently, with the runtime environment allocating threads to
different processors.

The section of code that is meant to run in parallel is marked accordingly, with a
preprocessor directive that will cause the threads to form before the section is executed.

Each thread has an id attached to it which can be obtained using a function (called
omp_get_thread_num()).

The thread id is an integer, and the master thread has an id of 0.

After the execution of the parallelized code, the threads join back into the master thread,
which continues onward to the end of the program.

By default, each thread executes the parallelized section of code independently.

Work-sharing constructs can be used to divide a task among the threads so that each thread
executes its allocated part of the code. Both task parallelism and data parallelism can be
achieved using OpenMP in this way

The runtime environment allocates threads to processors depending on usage, machine load
and other factors.

The number of threads can be assigned by the runtime environment based on environment
variables or in code using functions.

The OpenMP functions are included in a header file labelled omp.h in C/C++.

Veermata Jijabai Technological Institute

Page 1

Parallel Computing Lab

The output may also be garbled because of the race condition caused from the two threads
sharing the standard output.

Fork Join Model


Program:
#include<stdio.h>
#include<omp.h>
int main(){
int nthreads, tid;
/* Fork a team of threads giving them their own copies of variables */
#pragma omp parallel private(nthreads, tid)
{
/* Obtain thread number */
tid = omp_get_thread_num();
printf("Hello World from thread = %d\n", tid);
/* Only master thread does this */
if (tid == 0)
{
nthreads = omp_get_num_threads();
printf("Number of threads = %d\n", nthreads);
}
} /* All threads join master thread and disband */
}

Veermata Jijabai Technological Institute

Page 2

Parallel Computing Lab

Output:

Conclusion:
Thus we have implemented a simple program in openMP using threads in C

Veermata Jijabai Technological Institute

Page 3

Parallel Computing Lab

PRACTICAL 2
Aim:
To write a C program to solve the producer consumer model using threads
Theory:
In this program, the master thread will act as a producer while the other threads will wait until the
master thread creates buffer, and once added the master notifies the threads using a shared variable
and all other threads will consume the data
Program:
#include<stdio.h>
#include<omp.h>
int main()
{
int i=0;
int x=0;
#pragma omp parallel shared(i)
{
if(omp_get_thread_num()==0)
{
printf("Master thread with Thread ID:%d\n", omp_get_thread_num());
printf("Since it is the producer thread It is adding some data to be consumed by other consumer
threads\n");
i+=10;
x=1;
} else
{
while(x==0)
printf("Waiting for buffer to be filled. Thread ID: %d\n",omp_get_thread_num());
#pragma critical
{
if(i>0){
printf("Data is consumed by Consumer with Thread ID: %d\n",omp_get_thread_num());
i-=5;
} else {
printf("Could not find any data for thread ID: %d\n",omp_get_thread_num());
}}
}
}
}

Veermata Jijabai Technological Institute

Page 4

Parallel Computing Lab


Output :

Conclusion:
Thus, we have implemented and studied the producer consumer problem using openMP

Veermata Jijabai Technological Institute

Page 5

Parallel Computing Lab

OpenMP
Programming

Veermata Jijabai Technological Institute

Page 6

Parallel Computing Lab

Practical 3(a)
Aim:
To write an OpenMP Program of Matrix Matrix Multiplication
Program:
#include<stdio.h>
#include<omp.h>
int main(){
int i,j,k,m,n,p;
printf("Enter the number of rows in Matrix 1:");
scanf("%d",&m);
int *matrixA[m];
printf("Enter the number of columns in Matrix 1:");
scanf("%d",&n);
for(i=0;i<m;i++){
matrixA[i] = (int *)malloc(n*sizeof(int));
}
printf("<--Now Input the values for matrix 1 row-wise-->\n");
for(i=0;i<m;i++){
for(j=0;j<n;j++){
scanf("%d",&matrixA[i][j]);
}
}
printf("Enter the number of columns in Matrix 2:");
scanf("%d",&p);
int *matrixB[n];
for(i=0;i<n;i++){
matrixB[i] = (int *)malloc(p*sizeof(int));
}
printf("<--Now Input the values for matrix 2 row-wise-->\n");
for(i=0;i<n;i++){
for(j=0;j<p;j++){
scanf("%d",&matrixB[i][j]);
}
}
int matrixC[m][p];
#pragma omp parallel private(i,j,k) shared(matrixA,matrixB,matrixC)
{
#pragma omp for schedule(static)
for (i=0; i<m; i=i+1){
Veermata Jijabai Technological Institute

Page 7

Parallel Computing Lab


for (j=0; j<p; j=j+1){
matrixC[i][j] = 0;
for (k=0; k<n; k=k+1){
matrixC[i][j]=(matrixC[i][j])+((matrixA[i][k])*(matrixB[k][j]));
}
}
}
}
printf("The output after Matrix Multiplication is: \n");
for(i=0;i<m;i++){
for(j=0;j<p;j++)
printf("%d \t",matrixC[i][j]);
printf("\n");
}
return 0;
}
Output:

Conclusion: Thus, I have implemented OpenMP program for matrix multiplication

Veermata Jijabai Technological Institute

Page 8

Parallel Computing Lab

Practical 3(b)
Aim:
To write the openMP program to find prime numbers between 2 and the given
number, and stores all the prime numbers in an array.
Program:
#include<stdio.h>
#include<omp.h>
int IsPrime(int number) {
int i;
for (i = 2; i < number; i++) {
if (number % i == 0 && i != number) return 0;
}
return 1;
}
int main(){
int noOfThreads,valueN,indexCount=0,arrayVal[10000],tempValue;
printf("Enter the Number of threads: ");
scanf("%d",&noOfThreads);
printf("Enter the value of N: ");
scanf("%d",&valueN);
omp_set_num_threads(noOfThreads);
#pragma omp parallel for reduction(+:indexCount)
for(tempValue=2;tempValue<=valueN;tempValue++){
if(IsPrime(tempValue)){
arrayVal[indexCount] = tempValue;
indexCount++;
}
}
printf("Number of prime numbers between 2 and %d: %d\n",valueN,indexCount);
return 0;
}

Veermata Jijabai Technological Institute

Page 9

Parallel Computing Lab


Output:

Conclusion: Thus, I have implemented OpenMP program for finding prime numbers
between 2 and N

Veermata Jijabai Technological Institute

Page 10

Parallel Computing Lab

Practical 3(c)
Aim:
To write an OpenMP program to print Largest of an element in an array
Program:
#include<stdio.h>
#include<omp.h>
int main(){
int numberOfElements,currentMax=-1,iIterator,arrayInput[10000];
printf("Enter the Number of Elements: ");
scanf("%d",&numberOfElements);
for(iIterator=0;iIterator<numberOfElements;iIterator++){
scanf("%d",&arrayInput[iIterator]);
}
#pragma omp parallel for shared(currentMax)
for(iIterator=0;iIterator<numberOfElements;iIterator++){
#pragma omp critical
if(arrayInput[iIterator] > currentMax){
currentMax = arrayInput[iIterator];
}
}
printf("The Maximum Element is: %d\n",currentMax);
return 0;

Veermata Jijabai Technological Institute

Page 11

Parallel Computing Lab


}Output:

Conclusion: Thus, I have implemented OpenMP program for finding largest element
in an array

Veermata Jijabai Technological Institute

Page 12

Parallel Computing Lab

Practical 3(d)
Aim:
To write an OpenMP program for PI calculation
Program:
#include<stdio.h>
#include<omp.h>
int main(){
int num_steps=10000,i;
double aux,pi,step = 1.0/(double) num_steps,x=0.0,sum = 0.0;
#pragma omp parallel private(i,x,aux) shared(sum)
{
#pragma omp for schedule(static)
for (i=0; i<num_steps; i=i+1){
x=(i+0.5)*step;
aux=4.0/(1.0+x*x);
#pragma omp critical
sum = sum + aux;
}
}
pi=step*sum;
printf("The Value of PI is %lf\n",pi);
return 0;
}
Output:

Conclusion: Thus, I have implemented OpenMP program for calculating value of PI

Veermata Jijabai Technological Institute

Page 13

Parallel Computing Lab

MPI Programming

Veermata Jijabai Technological Institute

Page 14

Parallel Computing Lab

Practical 4(a)
Aim:
To write a simple MPI program for calculating Rank and Number of processor.
Program:
#include <mpi.h>
#include <stdio.h>
int main (int argc, char* argv[])
{
int rank, size;
MPI_Init (&argc, &argv);
/* starts MPI */
MPI_Comm_rank (MPI_COMM_WORLD, &rank);
/* get current process id */
MPI_Comm_size (MPI_COMM_WORLD, &size);
/* get number of processes
*/
printf( "Hello world from process %d of %d\n", rank, size );
MPI_Finalize();
return 0;
}
Output:

Conclusion: Thus, I have implemented MPI program for calculating rank and
number of processors

Veermata Jijabai Technological Institute

Page 15

Parallel Computing Lab

Practical 4(b)
Aim:
To write a MPI program for PI calculation
Program:
#include "mpi.h"
#include <stdio.h>
#include <math.h>
int main( int argc, char *argv[] )
{
int n, myid, numprocs, i;
double PI25DT = 3.141592653589793238462643;
double mypi, pi, h, sum, x;
MPI_Init(&argc,&argv);
MPI_Comm_size(MPI_COMM_WORLD,&numprocs);
MPI_Comm_rank(MPI_COMM_WORLD,&myid);
while (1) {
if (myid == 0) {
printf("Enter the number of intervals: (0 quits) ");
scanf("%d",&n);
}
MPI_Bcast(&n, 1, MPI_INT, 0, MPI_COMM_WORLD);
if (n == 0)
break;
else {
h = 1.0 / (double) n;
sum = 0.0;
for (i = myid + 1; i <= n; i += numprocs) {
x = h * ((double)i - 0.5);
sum += (4.0 / (1.0 + x*x));
}
mypi = h * sum;
MPI_Reduce(&mypi, &pi, 1, MPI_DOUBLE, MPI_SUM, 0,
MPI_COMM_WORLD);
if (myid == 0)
printf("pi is approximately %.16f, Error is %.16f\n",
pi, fabs(pi - PI25DT));
}
}
Veermata Jijabai Technological Institute

Page 16

Parallel Computing Lab


MPI_Finalize();
return 0;
}
Output:

Conclusion: Thus, I have implemented MPI program for calculating value of PI

Veermata Jijabai Technological Institute

Page 17

Parallel Computing Lab

Practical 4(c)
Aim:
To write an Advanced MPI program that has a total number of 4 processes. The
process with rank 0 should send VJTI letter to all the processes using MPI_Scatter
call
Program:
#include "mpi.h"
#include <stdio.h>
#include <stdlib.h>
#define SIZE 4
int main (int argc, char *argv[])
{
int numtasks, rank, sendcount, recvcount, source;
char sendbuf[SIZE][SIZE] = {
{'V','J','T','I'},
{'V','J','T','I'},
{'V','J','T','I'},
{'V','J','T','I'}};
char recvbuf[SIZE];
MPI_Init(&argc,&argv);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &numtasks);
if (numtasks == SIZE) {
source = 0;
sendcount = SIZE;
recvcount = SIZE;
MPI_Scatter(sendbuf,sendcount,MPI_CHAR,recvbuf,recvcount,
MPI_CHAR,source,MPI_COMM_WORLD);
printf("rank= %d Results: %c %c %c %c\n",rank,recvbuf[0],
recvbuf[1],recvbuf[2],recvbuf[3]);
}
else
printf("Must specify %d processors. Terminating.\n",SIZE);
MPI_Finalize();
}
Veermata Jijabai Technological Institute

Page 18

Parallel Computing Lab

Output:

Conclusion: Thus, I have implemented advanced MPI program for Scattering VJTI
to all the processes be root process using MPI_Scatter Call

Veermata Jijabai Technological Institute

Page 19

Parallel Computing Lab

Practical 4(d)
Aim:
To write an Advanced MPI program to find maximum value in array of six integers
with 6 processes and print the result in root process using MPI_Reduce call.
Program:
#include "mpi.h"
#include <stdio.h>
#include <stdlib.h>
#define SIZE 4
int main (int argc, char *argv[])
{
int rank,numtasks,array[6] = {100,600,300,800,250,720},i,inputNumber;
MPI_Init(&argc,&argv);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &numtasks);
printf("Local Input for process %d is %d\n",rank,array[rank]);
inputNumber = array[rank];
int maxNumber;
MPI_Reduce(&inputNumber, &maxNumber, 1, MPI_INT, MPI_MAX, 0,
MPI_COMM_WORLD);
// Print the result
if (rank == 0) {
printf("Maximum of all is: %d\n",maxNumber);
}
MPI_Finalize();
}

Veermata Jijabai Technological Institute

Page 20

Parallel Computing Lab

Output:

Conclusion: Thus, I have implemented advanced MPI program for finding the
maximum of all the elements in an array of 6 elements using 6 processes and
understood the use of MPI_Reduce Call

Veermata Jijabai Technological Institute

Page 21

Parallel Computing Lab

Practical 4(e)
Aim:
To write a MPI program for Ring topology
Program:
#include <stdio.h>
#include "mpi.h"
int main(int argc,char *argv[])
{
int
MyRank, Numprocs, Root = 0;
int
value, sum = 0;
int
Source, Source_tag;
int
Destination, Destination_tag;
MPI_Status status;
/* Initialize MPI */
MPI_Init(&argc,&argv);
MPI_Comm_size(MPI_COMM_WORLD,&Numprocs);
MPI_Comm_rank(MPI_COMM_WORLD,&MyRank);
if (MyRank == Root){
Destination = MyRank + 1;
Destination_tag = 0;
MPI_Send(&MyRank, 1, MPI_INT, Destination, Destination_tag,
MPI_COMM_WORLD);
}
else{
if(MyRank<Numprocs-1){
Source = MyRank - 1;
Source_tag = 0;
MPI_Recv(&value, 1, MPI_INT, Source, Source_tag,
MPI_COMM_WORLD, &status);
sum = MyRank + value;
Destination = MyRank + 1;
Destination_tag = 0;
MPI_Send(&sum, 1, MPI_INT, Destination, Destination_tag,
MPI_COMM_WORLD);
Veermata Jijabai Technological Institute

Page 22

Parallel Computing Lab


}
else{
Source = MyRank - 1;
Source_tag = 0;
MPI_Recv(&value, 1, MPI_INT, Source, Source_tag,
MPI_COMM_WORLD, &status);
sum = MyRank + value;
}
}
if (MyRank == Root)
{
Source = Numprocs - 1;
Source_tag = 0;
MPI_Recv(&sum, 1, MPI_INT, Source, Source_tag,
MPI_COMM_WORLD, &status);
printf("MyRank %d Final SUM %d\n", MyRank, sum);
}
if(MyRank == (Numprocs - 1)){
Destination = 0;
Destination_tag = 0;
MPI_Send(&sum, 1, MPI_INT, Destination, Destination_tag,
MPI_COMM_WORLD);
}
MPI_Finalize();
}

Veermata Jijabai Technological Institute

Page 23

Parallel Computing Lab

Output:

Conclusion: Thus, I have implemented MPI program for ring topology

Veermata Jijabai Technological Institute

Page 24

Parallel Computing Lab

Numerical
Computing
Programming

Veermata Jijabai Technological Institute

Page 25

Parallel Computing Lab

Practical 5(a)
Aim:
To write any one numerical computing programming for implementing Trapezoid
Rule with MPI
Program:
#include <stdio.h>
/* We'll be using MPI routines, definitions, etc. */
#include <mpi.h>
void Get_data(int p, int my_rank, double* a_p, double* b_p, int* n_p);
double Trap(double local_a, double local_b, int local_n,
double h); /* Calculate local area */
double f(double x); /* function we're integrating */
int main(int argc, char** argv) {
int
my_rank; /* My process rank
*/
int
p;
/* The number of processes */
double
a;
/* Left endpoint
*/
double
b;
/* Right endpoint
*/
int
n;
/* Number of trapezoids
*/
double
h;
/* Trapezoid base length */
double
local_a; /* Left endpoint my process */
double
local_b; /* Right endpoint my process */
int
local_n; /* Number of trapezoids for */
/* my calculation
*/
double
my_area; /* Integral over my interval */
double
total; /* Total area
*/
int
source; /* Process sending area
*/
int
dest = 0; /* All messages go to 0
*/
int
tag = 0;
MPI_Status status;
/* Let the system do what it needs to start up MPI */
MPI_Init(&argc, &argv);
Veermata Jijabai Technological Institute

Page 26

Parallel Computing Lab


/* Get my process rank */
MPI_Comm_rank(MPI_COMM_WORLD, &my_rank);
/* Find out how many processes are being used */
MPI_Comm_size(MPI_COMM_WORLD, &p);
Get_data(p, my_rank, &a, &b, &n);
h = (b-a)/n; /* h is the same for all processes */
local_n = n/p; /* So is the number of trapezoids */
/* Length of each process' interval of
* integration = local_n*h. So my interval
* starts at: */
local_a = a + my_rank*local_n*h;
local_b = local_a + local_n*h;
my_area = Trap(local_a, local_b, local_n, h);
/* Add up the areas calculated by each process */
if (my_rank == 0) {
total = my_area;
for (source = 1; source < p; source++) {
MPI_Recv(&my_area, 1, MPI_DOUBLE, source, tag,
MPI_COMM_WORLD, &status);
total = total + my_area;
}
} else {
MPI_Send(&my_area, 1, MPI_DOUBLE, dest,
tag, MPI_COMM_WORLD);
}
/* Print the result */
if (my_rank == 0) {
printf("With n = %d trapezoids, our estimate\n",
n);
printf("of the area from %f to %f = %.15f\n",
a, b, total);
}
/* Shut down MPI */
MPI_Finalize();
return 0;
Veermata Jijabai Technological Institute

Page 27

Parallel Computing Lab


} /* main */
/*-----------------------------------------------------------------* Function: Get_data
* Purpose:
Read in the data on process 0 and send to other
*
processes
* Input args: p, my_rank
* Output args: a_p, b_p, n_p
*/
void Get_data(int p, int my_rank, double* a_p, double* b_p, int* n_p) {
int
q;
MPI_Status status;
if (my_rank == 0) {
printf("Enter a, b, and n\n");
scanf("%lf %lf %d", a_p, b_p, n_p);
for (q = 1; q < p; q++) {
MPI_Send(a_p, 1, MPI_DOUBLE, q, 0, MPI_COMM_WORLD);
MPI_Send(b_p, 1, MPI_DOUBLE, q, 0, MPI_COMM_WORLD);
MPI_Send(n_p, 1, MPI_INT, q, 0, MPI_COMM_WORLD);
}
} else {
MPI_Recv(a_p, 1, MPI_DOUBLE, 0, 0, MPI_COMM_WORLD, &status);
MPI_Recv(b_p, 1, MPI_DOUBLE, 0, 0, MPI_COMM_WORLD, &status);
MPI_Recv(n_p, 1, MPI_INT, 0, 0, MPI_COMM_WORLD, &status);
}
} /* Get_data */
/*-----------------------------------------------------------------* Function: Trap
* Purpose:
Estimate a definite area using the trapezoidal
*
rule
* Input args: local_a (my left endpoint)
*
local_b (my right endpoint)
*
local_n (my number of trapezoids)
*
h (stepsize = length of base of trapezoids)
* Return val: Trapezoidal rule estimate of area from
*
local_a to local_b
*/
double Trap(
double local_a /* in */,
double local_b /* in */,
Veermata Jijabai Technological Institute

Page 28

Parallel Computing Lab


int local_n /* in */,
double h
/* in */) {
double my_area; /* Store my result in my_area */
double x;
int i;
my_area = (f(local_a) + f(local_b))/2.0;
x = local_a;
for (i = 1; i <= local_n-1; i++) {
x = local_a + i*h;
my_area = my_area + f(x);
}
my_area = my_area*h;
return my_area;
} /* Trap */
/*-----------------------------------------------------------------* Function: f
* Purpose: Compute value of function to be integrated
* Input args: x
*/
double f(double x) {
double return_val;
return_val = x*x + 1.0;
return return_val;
} /* f */

Veermata Jijabai Technological Institute

Page 29

Parallel Computing Lab

Output:

Conclusion: Thus, I have implemented MPI program for Trapezoidal Rule


(Numerical Computing problem)

Veermata Jijabai Technological Institute

Page 30

Parallel Computing Lab

Practical 5(b)
Aim:
To write any one numerical computing programming for implementing Gaussian
Filter with MPI
Theory:
In electronics and signal processing, a Gaussian filter is a filter whose impulse
response is a Gaussian function (or an approximation to it). Gaussian filters have the
properties of having no overshoot to a step function input while minimizing the rise
and fall time. This behavior is closely connected to the fact that the Gaussian filter
has the minimum possible group delay. It is considered the ideal time domain filter,
just as the sinc is the ideal frequency domain filter. These properties are important in
areas such as oscilloscopes and digital telecommunication systems.
In two dimensions, it is the product of two such Gaussians, one per direction:

where x is the distance from the origin in the horizontal axis, y is the distance from
the origin in the vertical axis, and is the standard deviation of the Gaussian
distribution.
Shape of the impulse response of a typical Gaussian filter

Conclusion: Thus, I have studied about Gaussian filter and its implementation details

Veermata Jijabai Technological Institute

Page 31

Parallel Computing Lab

CUDA
Programming

Veermata Jijabai Technological Institute

Page 32

Parallel Computing Lab

Practical 6(a)
Aim:
To write a simple CUDA Program for Hello World.
Theory:
'CUDA ' stands for Compute Unified Device Architecture, it is a parallel computing
platform and programming model created by NVIDIA and implemented by the
graphics processing units (GPUs) that they produce. CUDA gives developers direct
access to the virtual instruction set and memory of the parallel computational
elements in CUDA GPUs.
Using CUDA, the GPUs can be used for general purpose processing (i.e., not
exclusively graphics); this approach is known as GPGPU. Unlike CPUs, however,
GPUs have a parallel throughput architecture that emphasizes executing many
concurrent threads slowly, rather than executing a single thread very quickly.
Program:
#include <stdio.h>
#include <cuda.h>
#include <sys/time.h>
#include <assert.h>
__global__ void kernel (void)
{
}
int main(void){
kernel<<input parameters>>();
printf(Hello, World);
return 0;
}
Conclusion: Thus, I have studied basic about CUDA programming and implemented
Hello World program

Veermata Jijabai Technological Institute

Page 33

Parallel Computing Lab

Practical 6(b)
Aim:
To write a CUDA program for Matrix addition
Program:
#include <stdlib.h>
#include <stdio.h>
#include <sys/time.h>
#include <string.h>
#include <cuda.h>
#include <assert.h>
const int N = 4;
const int blocksize = 2;
__global__ void add_matrix_on_gpu( float* a, float *b, float *c, int N )
{
int i = blockIdx.x * blockDim.x + threadIdx.x;
int j = blockIdx.y * blockDim.y + threadIdx.y;
int index = i + j*N;
if ( i < N && j < N )
c[index] = a[index] + b[index];
}
void add_matrix_on_cpu(float *a, float *b, float *d)
{
int i;
for(i = 0; i < N*N; i++)
d[i] = a[i]+b[i];
}
int main()
{
float *a = new float[N*N];
float *b = new float[N*N];
float *c = new float[N*N];
float *d = new float[N*N];
for ( int i = 0; i < N*N; ++i ) {
a[i] = 1.0f; b[i] = 3.5f; }
Veermata Jijabai Technological Institute

Page 34

Parallel Computing Lab

/*

printf("Matrix A:\n");
for(int i=0; i<N*N; i++)
{
printf("\t%f",a[i]);
if((i+1)%N==0)
printf("\n");
}
printf("Matrix B:\n");
for(int i=0; i<N*N; i++)
{
printf("\t%f",b[i]);
if((i+1)%N==0)
printf("\n");
}

*/
struct timeval TimeValue_Start;
struct timezone TimeZone_Start;
struct timeval TimeValue_Final;
struct timezone TimeZone_Final;
long
time_start, time_end;
double
time_overhead;
float *ad, *bd, *cd;
const int size = N*N*sizeof(float);
cudaMalloc( (void**)&ad, size );
cudaMalloc( (void**)&bd, size );
cudaMalloc( (void**)&cd, size );
cudaMemcpy( ad, a, size, cudaMemcpyHostToDevice );
cudaMemcpy( bd, b, size, cudaMemcpyHostToDevice );
dim3 dimBlock( blocksize, blocksize );
dim3 dimGrid( N/dimBlock.x, N/dimBlock.y );
gettimeofday(&TimeValue_Start, &TimeZone_Start);
add_matrix_on_gpu<<<dimGrid, dimBlock>>>( ad, bd, cd, N );
Veermata Jijabai Technological Institute

Page 35

Parallel Computing Lab


gettimeofday(&TimeValue_Final, &TimeZone_Final);
cudaMemcpy( c, cd, size, cudaMemcpyDeviceToHost );
add_matrix_on_cpu(a,b,d);
time_end = TimeValue_Final.tv_sec * 1000000 + TimeValue_Final.tv_usec;
time_start = TimeValue_Start.tv_sec * 1000000 + TimeValue_Start.tv_usec;
time_overhead = (time_end - time_start)/1000000.0;
/*

printf("result is:\n");
for(int i=0; i<N*N; i++)
{
printf("\t%f%f",c[i],d[i]);
if((i+1)%N==0)
printf("\n");
}

*/
for(int i=0; i<N*N; i++)
assert(c[i]==d[i]);
printf("\n\t\t Time in Seconds (T)

: %lf\n\n",time_overhead);

cudaFree( ad ); cudaFree( bd ); cudaFree( cd );


delete[] a; delete[] b; delete[] c, delete[] d;
return EXIT_SUCCESS;
}
Conclusion: Thus, I implemented CUDA program for matrix addition

Veermata Jijabai Technological Institute

Page 36

Parallel Computing Lab

Practical 6(c)
Aim:
To write a CUDA program for prefix Sum
Program:
#include<stdio.h>
#include<cuda.h>
#include <assert.h>
#include<sys/time.h>
#define N 5
#define BLOCKSIZE 5
__global__ void PrefixSum(float *dInArray, float *dOutArray, int arrayLen, int
threadDim)
{
int tidx = threadIdx.x;
int tidy = threadIdx.y;
int tindex = (threadDim * tidx) + tidy;
int maxNumThread = threadDim * threadDim;
int pass = 0;
int count ;
int curEleInd;
float tempResult = 0.0;
while( (curEleInd = (tindex + maxNumThread * pass)) < arrayLen )
{
tempResult = 0.0f;
for( count = 0; count <= curEleInd; count++)
tempResult += dInArray[count];
dOutArray[curEleInd] = tempResult;
pass++;
}
__syncthreads();
}//end of Prefix sum function
void PrefixSum_cpu(float *x_h, float *z_h)
{
int i;
Veermata Jijabai Technological Institute

Page 37

Parallel Computing Lab


for(i=0; i<N; i++)
{
if(i==0)
z_h[i]=x_h[i];
else
z_h[i]=z_h[i-1]+x_h[i];
}
}
int main()
{
float *x_h, *y_h, *z_h;
float *x_d, *y_d;
int i;
struct timeval TimeValue_Start;
struct timezone TimeZone_Start;
struct timeval TimeValue_Final;
struct timezone TimeZone_Final;
long
time_start, time_end;
double
time_overhead;
size_t size = N*sizeof(float);
x_h = (float *)malloc(size);
y_h = (float *)malloc(size);
z_h = (float *)malloc(size);
cudaMalloc((void **)&x_d,size);
cudaMalloc((void **)&y_d,size);
for(i=0; i<N; i++)
{
x_h[i] = (float) i+1;
}
cudaMemcpy(x_d,x_h,size,cudaMemcpyHostToDevice);
dim3 dimBlock(BLOCKSIZE,BLOCKSIZE);
dim3 dimGrid(1,1);
Veermata Jijabai Technological Institute

Page 38

Parallel Computing Lab


gettimeofday(&TimeValue_Start, &TimeZone_Start);
PrefixSum<<<dimGrid, dimBlock>>>(x_d, y_d, N, BLOCKSIZE);
gettimeofday(&TimeValue_Final, &TimeZone_Final);
cudaMemcpy(y_h,y_d,size,cudaMemcpyDeviceToHost);
PrefixSum_cpu(x_h,z_h);
for(i = 0; i < N; i++)
assert(y_h[i]==z_h[i]);
time_end = TimeValue_Final.tv_sec * 1000000 + TimeValue_Final.tv_usec;
time_start = TimeValue_Start.tv_sec * 1000000 + TimeValue_Start.tv_usec;
time_overhead = (time_end - time_start)/1000000.0;
printf("\n\t\t Time in Seconds (T)

: %lf\n\n",time_overhead);

free(x_h);
free(y_h);
free(z_h);
cudaFree(x_d);
cudaFree(y_d);
return 0;
}
Conclusion: Thus, I have implemented CUDA program for prefix sum

Veermata Jijabai Technological Institute

Page 39

Parallel Computing Lab

Practical 6(d)
Aim:
To write a CUDA program for Matrix Transpose
Program:
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <cuda.h>
#include <assert.h>
#include <sys/time.h>
const int N = 8;
const int blocksize = 4;
__global__ void transpose_naive( float *out, float *in, const int N ) {
unsigned int xIdx = blockDim.x * blockIdx.x + threadIdx.x;
unsigned int yIdx = blockDim.y * blockIdx.y + threadIdx.y;
if ( xIdx < N && yIdx < N ) {
unsigned int idx_in = xIdx + N * yIdx;
unsigned int idx_out = yIdx + N * xIdx;
out[idx_out] = in[idx_in];
}
}
void mat_trans_cpu(float *a, float *c)
{
int mn = N*N;
/* N rows and N columns */
int q = mn - 1;
int i = 0;
/* Index of 1D array that represents the matrix */
do
{
int k = (i*N) % q;
while (k>i)
k = (N*k) % q;
if (k!=i)
{
c[k] = a[i];
Veermata Jijabai Technological Institute

Page 40

Parallel Computing Lab


c[i] = a[k];
}
else
c[i] = a[i];
} while ( ++i <= (mn -2) );
c[i]=a[i];
/* Update row and column */
/*
matrix.M = N;
matrix.N = M;*/
}
int main()
{
float *a = new float[N*N];
float *b = new float[N*N];
float *c = new float[N*N];
int i;
for ( i = 0; i < N*N; ++i ) {
a[i] = drand48();
}
struct timeval TimeValue_Start;
struct timezone TimeZone_Start;
struct timeval TimeValue_Final;
struct timezone TimeZone_Final;
long
time_start, time_end;
double
time_overhead;
/*
for ( i = 0; i < N*N; i++)
{
printf("\t%f",a[i]);
if(((i+1)%N == 0))
printf("\n");
}
*/
float *ad, *bd ;
const int size = N*N*sizeof(float);
cudaMalloc( (void**)&ad, size );
cudaMalloc( (void**)&bd, size );
cudaMemcpy( ad, a, size, cudaMemcpyHostToDevice );
Veermata Jijabai Technological Institute

Page 41

Parallel Computing Lab

dim3 dimBlock( blocksize, blocksize );


dim3 dimGrid( N/dimBlock.x, N/dimBlock.y );
gettimeofday(&TimeValue_Start, &TimeZone_Start);
transpose_naive<<<dimGrid, dimBlock>>>( bd, ad, N );
gettimeofday(&TimeValue_Final, &TimeZone_Final);
cudaMemcpy( b, bd, size, cudaMemcpyDeviceToHost );
mat_trans_cpu(a,c);
/* To display uncomment this section */
/*
printf("result matrix\n");
for ( i = 0; i < N*N; ++i ){
printf("\t%f%f",b[i],c[i]);
if( ((i+1)%N == 0))
printf("\n");
}
*/
time_end = TimeValue_Final.tv_sec * 1000000 + TimeValue_Final.tv_usec;
time_start = TimeValue_Start.tv_sec * 1000000 + TimeValue_Start.tv_usec;
time_overhead = (time_end - time_start)/1000000.0;
for(i=0; i<N*N; i++)
assert(b[i]==c[i]);
printf("\n\t\t Time in Seconds (T)

: %lf\n\n",time_overhead);

cudaFree( ad ); cudaFree( bd );
delete[] a; delete[] b, delete[] c;
return EXIT_SUCCESS;
}
Conclusion: Thus, I have implemented CUDA program for matrix transpose

Veermata Jijabai Technological Institute

Page 42

Parallel Computing Lab

Practical 6(e)
Aim:
To write a CUDA program for vector addition
Program:
#include <stdio.h>
#include <cuda.h>
#include <stdlib.h>
#include <assert.h>
#include <sys/time.h>
#define N 4096 // size of array
__global__ void vectorAdd(int *a,int *b, int *c)
{
int tid = blockIdx.x * blockDim.x + threadIdx.x;
if(tid < N){
c[tid] = a[tid]+b[tid];
}
}
int main(int argc, char *argv[])
{
int T = 10, B = 1; // threads per block and blocks per grid
int a[N],b[N],c[N]; // vectors, statically declared
int *dev_a, *dev_b, *dev_c;
printf("Size of array = %d\n", N);
do {
printf("Enter number of threads per block (1024 max, comp. cap. 2.x ");
scanf("%d",&T);
printf("\nEnter number of blocks per grid: ");
scanf("%d",&B);
if (T * B < N) printf("Error T x B < N, try again\n");
} while (T * B < N);
cudaEvent_t start, stop; // using cuda events to measure time
float elapsed_time_ms;
cudaMalloc((void**)&dev_a,N * sizeof(int));
cudaMalloc((void**)&dev_b,N * sizeof(int));
Veermata Jijabai Technological Institute

Page 43

Parallel Computing Lab


cudaMalloc((void**)&dev_c,N * sizeof(int));
for(int i=0;i<N;i++) { // load arrays with some numbers
a[i] = i;
b[i] = i*2;
}
cudaMemcpy(dev_a, a , N*sizeof(int),cudaMemcpyHostToDevice);
cudaMemcpy(dev_b, b , N*sizeof(int),cudaMemcpyHostToDevice);
cudaMemcpy(dev_c, c , N*sizeof(int),cudaMemcpyHostToDevice);
cudaEventCreate( &start ); // instrument code to measure start time
cudaEventCreate( &stop );
cudaEventRecord( start, 0 );
vectorAdd<<<B,T>>>(dev_a,dev_b,dev_c);
cudaMemcpy(c,dev_c,N*sizeof(int),cudaMemcpyDeviceToHost);
cudaEventRecord( stop, 0 ); // instrument code to measure end time
cudaEventSynchronize( stop );
cudaEventElapsedTime( &elapsed_time_ms, start, stop );
//

for(int i=0;i<N;i++) {
printf("%d+%d=%d\n",a[i],b[i],c[i]);
assert(c[i]==(a[i]+b[i]));
}
printf("Time to calculate results: %f ms.\n", elapsed_time_ms);
cudaFree(dev_a);
cudaFree(dev_b);
cudaFree(dev_c);
return 0;

}
Conclusion: Thus, I have implemented CUDA program for vector addition

Veermata Jijabai Technological Institute

Page 44

Parallel Computing Lab

Practical 6(f)
Aim:
To write a CUDA program for vector multiplication
Program:
#include <stdio.h>
#include <cuda.h>
#include <sys/time.h>
#include <assert.h>
__global__ void mult_vect(float * x, float * y, float * z, int n)
{
int idx= blockIdx.x * blockDim.x + threadIdx.x;
if(idx < n)
{
z[idx] = x[idx] * y[idx];
}
}
int main()
{
float *x_h, *y_h, *z_h;
float *x_d, *y_d, *z_d;
int n= 20,i;
size_t size= n * sizeof(float);
struct timeval TimeValue_Start;
struct timezone TimeZone_Start;
struct timeval TimeValue_Final;
struct timezone TimeZone_Final;
long
time_start, time_end;
double
time_overhead;
/* allocating memory on CPU */
x_h= (float *)malloc(size);
y_h= (float *)malloc(size);
z_h= (float *)malloc(size);
Veermata Jijabai Technological Institute

Page 45

Parallel Computing Lab

/* allocating memory on Device */


cudaMalloc( (void**)&x_d, size );
cudaMalloc( (void**)&y_d, size );
cudaMalloc( (void**)&z_d, size );
/* Initialization of Vectors */
for(i=0; i < n; i++)
{
x_h[i]= (float) i;
y_h[i]= (float) i;
}
/* Copying from Host to Device */
cudaMemcpy(x_d, x_h, size, cudaMemcpyHostToDevice);
cudaMemcpy(y_d, y_h, size, cudaMemcpyHostToDevice);
cudaMemcpy(z_d, z_h, size, cudaMemcpyHostToDevice);
int block_size= 4;
int num_blocks= (n + block_size - 1) / block_size;
gettimeofday(&TimeValue_Start, &TimeZone_Start);
/* kernel launching */
mult_vect <<<num_blocks, block_size>>> (x_d, y_d, z_d, n);
gettimeofday(&TimeValue_Final, &TimeZone_Final);
/* Copying from Device to Host */
cudaMemcpy(x_h, x_d, size, cudaMemcpyDeviceToHost);
cudaMemcpy(y_h, y_d, size, cudaMemcpyDeviceToHost);
cudaMemcpy(z_h, z_d, size, cudaMemcpyDeviceToHost);
time_end = TimeValue_Final.tv_sec * 1000000 + TimeValue_Final.tv_usec;
time_start = TimeValue_Start.tv_sec * 1000000 + TimeValue_Start.tv_usec;
time_overhead = (time_end - time_start)/1000000.0;
/* Checking whether the result is correct or not */
for(i = 0; i < n ; i++)
assert(z_h[i]==(x_h[i] * y_h[i]));
Veermata Jijabai Technological Institute

Page 46

Parallel Computing Lab

printf("\n\t\t Time in Seconds (T)

: %lf\n\n",time_overhead);

free(x_h);
free(y_h);
free(z_h);
cudaFree(x_d);
cudaFree(y_d);
cudaFree(z_d);
return 0;
}
Conclusion: Thus, I have implemented CUDA program for vector multiplication

Veermata Jijabai Technological Institute

Page 47

Parallel Computing Lab

Practical 7
Case study for Differentiate between CUDA
programming and OPEN CL programming
Performance
The first feature is Performance. Both CUDA and OpenCL are fast, and on GPU
devices they are much faster than the CPU for data-parallel codes, with 10X speedups
commonly seen on data-parallel problems.
Both CUDA and OpenCL can fully utilize the hardware. Performance depends upon a
slew of variables, including hardware type, algorithm type, and code quality.
Scalability
With respect to Scability, there are some other interesting developments of note. The
first is that there is new technology in CUDA called GPUDirect that is aimed at
reducing memory transfer overheads when communicating between multiple GPUs.
It has optimizations to reduce overhead by allowing peer-to-peer memory transfers
between GPUs on the same PCI express bus. It also has optimization to reduce the
overhead of moving data from GPU memory to a network interface card. This is
certainly an interesting development, but it is too new for us to say if it offers enough
benefit to be an important technology.
The second interesting development is in mobile GPU computing. OpenCL has
quickly become the most pervasive way to do GPU computing on mobile devices,
including smartphones and tablets. Companies like ARM, Imagination Technologies,
Freescale, Qualcomm, Samsung, and others are all enabling their mobile GPUs to run
OpenCL codes. There are more mobile devices sold each year than there are PCs, so
this is a huge community that is beginning to put its support behind OpenCL. At
AccelerEyes, we have done several GPU consulting projects on mobile GPUs and are
believers that there is big benefit to accelerating apps, especially computer vision and
video processing apps, directly on the phone or tablet.
Portability
The third feature is Portability. This is perhaps the most reconizable difference
between CUDA and OpenCL. CUDA only runs on NVIDIA GPUs, while OpenCL is
the open industry standard and runs on AMD, Intel, NVIDIA, and other hardware
devices.
Veermata Jijabai Technological Institute

Page 48

Parallel Computing Lab

Also, with respect to portability, CUDA does not provide CPU fallback. Currently,
developers using CUDA typically put if-statements in their code that distinguish
between the presense or absense of a GPU device at runtime. In contract, with
OpenCL, CPU fallback is supported and makes code maintenance much easier.
Community
The fourth feature is Community. This is the feature that encompasses support,
longevity, commitment, etc. As those things are hard to measure, we put together a
proxy. It is interesting to look at the number of forum topics on NVIDIAs CUDA
forms at nearly 27,000 and AMDs OpenCL forums at 4,000. Also, the neutral 3rd
party site Stackoverflow has tags for CUDA and OpenCL, with the number of CUDA
tags being over 3X the number of OpenCL tags. As you would expect, there are many
more people doing CUDA programming today due to the great investment NVIDIA
has put into building the ecosystem for GPU computing.
Programmability
The fifth and final feature is Programmability. Both CUDA and OpenCL are lowlevel. It is time consuming to do GPU kernel development in either of those
platforms. The bulk of that time is often spent in redesigning algorithms to exploit
data-parallelism.
Libraries really make all the difference in GPU computing. To compare and contract
CUDA versus OpenCL, it is important to look at the comparison of libraries
available. OpenCL has a better library support as compared to that of CUDA

Veermata Jijabai Technological Institute

Page 49

Parallel Computing Lab

Mini ProjectDijkstrer

Veermata Jijabai Technological Institute

Page 50

Parallel Computing Lab


Project Name:
Implementation of Dijkstras Algorithm in MPI
Theory:
Dijkstrer is a C application which calculate the shortest path of a graph using
Dijkstra's Algorithm. Dijkstrer uses MPI library to implement cluster functionality
where every member of the cluster calculates each of a node's childs. Dijkstrer will
use each process to calculate each child of every node. This means that when a node
has fewer childs connected than the number of processes, then some process( the
highers ) will not be executed.Use number of processes equal to the maximum
number of childs connected to a node.
There are many ways of implementing graphs. Dijkstrer uses two structures to
represent nodes and acnes. Each node keeps a list with references to every acne
connected to him. It's acne has a node member which points to the node connected.
By this way we are able to represent a path. By using recursion we are travelling
through the shortest path and calculating the childs of each node.
Compile & Run:
Dijkstrer requires MPICH2 library to be installed. Once you have installed it you can
compile Dijkstrer using the C compiler of your choice.
Example of compiling under Windows using gcc:
gcc -o dijkstrer.exe -I "C:\path\to\MPICH2\include" -L
"C:\path\to\MPICH2\lib" dijkstrer.c "C:\path\to\MPICH2\lib\mpi.lib
Then, once you have your cluster app and running copy the executable to each of the
cluster's members and run it. Dijkstrer will read the graph from a file defined in the
cmd or the default one which is dijkstra.dat.
Files content should be in the format:
nN
startNode endNode distance
startNode endNode distance
.
.
.
startNode endNode distance
where n is the total number of Nodes and N is the total number of Acnes.
The file must also be copied in every cluster's member in the same path.
Veermata Jijabai Technological Institute

Page 51

Parallel Computing Lab

Example of running Dijkstrer under Windows using MPICH2's mpiexec:


mpiexec -hosts 2 <host1> 1 <host2> 2 c:\dijkstrer.exe c:\dijkstra.dat
Don't forget to copy the executables and the data file.
Each of your cluster's member will create a process which will calculate a child in
every node.
Program:
#include "mpich2/mpi.h"
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
typedef struct _node node;
typedef struct _acne acne;
int myid, numprocs;
/*
* There are two structures.
* 'Node' represent each node having childrens acnes represent each connected to him
nodes.
* 'Acne' represent each acne connecting two Nodes.
*/
struct _node {
int distance; // distance from the first node
struct _acne *children[100]; // going to nodes
int edgeno; // how many nodes are connected to
int name;
};
struct _acne {
int value; // distance between these edges
int from; // acne connecting from
int to; // to
struct _node *edge;
};
// Get the node with the minimum distance so we can go on #Dijkstra's Algorithm
int getMinDistanceNode(struct _acne *array[], int size){
Veermata Jijabai Technological Institute

Page 52

Parallel Computing Lab


int minnum=array[0]->edge->distance, minnode=0, j;
for(j=0; j< size;j++){
if(array[j]->edge->distance<minnum){
minnode = j;
minnum = array[j]->edge->distance;
}
}
return minnode;
}
// Get the maximum number of childs connented to one node
int getMaxEdgeNo(struct _node array[], int size){
int maxnum=array[0].edgeno, j;
for(j=0; j< size;j++){
if(array[j].edgeno > maxnum){
maxnum = array[j].edgeno;
}
}
return maxnum;
}
/*
* For every node we are going in (the parameter), we search its connected childs
nodes and
* calculate their distances. then we are going in again( recursion ) in the node with
the
* minimum distance
*
* */
int traverse(struct _node *node){
// the node with edgeno=0 is the last node
if(node->edgeno<=0)
return 0;
if (myid == 0) {
Veermata Jijabai Technological Institute

Page 53

Parallel Computing Lab


printf("start of node %d\n", node->name);
fflush(stdout);
printf("\tchildrens: %d\n", node->edgeno);
fflush(stdout);
}
int i;
// Since in every node, every process calculates each child, we don't want more
processes that the number of childs of each node to be executed
if(myid<node->edgeno){
// Dijkstra's algorithm
if(node->distance + node->children[myid]->value < node>children[myid]->edge->distance ){
node->children[myid]->edge->distance = node->distance + node>children[myid]->value;
}
// Every process should broadcast her results to the others, so they can
collaborate peacefully
// Let the world know my results
for (i = 0; i < node->edgeno; i++) {
MPI_Bcast( &node->children[i]->edge->distance, 1, MPI_INT, i,
MPI_COMM_WORLD );
}
printf("\tprocess id %d calculates node %d with distance %d\n", myid,
node->children[myid]->edge->name, node->children[myid]->edge->distance);
fflush(stdout);
}
if (myid == 0) {
printf("end of node\n-----------------------\n");
fflush(stdout);
printf("\tminnode :%d\n", node->children[getMinDistanceNode(node>children, node->edgeno)]->edge->name);
fflush(stdout);
Veermata Jijabai Technological Institute

Page 54

Parallel Computing Lab


}
// Called recursion
traverse(node->children[getMinDistanceNode(node->children, node>edgeno)]->edge);
return 0;
}
int main(int argc, char *argv[]){
// MPI initialization
MPI_Init(&argc, &argv);
MPI_Comm_size(MPI_COMM_WORLD, &numprocs);
MPI_Comm_rank(MPI_COMM_WORLD, &myid);
FILE *ifp;
int from, to, distance;
int nodesNo = 0;
int edgesNo = 0;
int acnesCtr = 0; // count the acnes
int previous = 0;
int childsCtr = 0; // count childs for every node
char *file;
if(argc==2)
file = argv[argc-1];
else
file = "dijkstrer.dat";
ifp = fopen(file, "r");
if (ifp == NULL) {
fprintf(stderr, "Can't open input file!\n");
MPI_Finalize();
exit(1);
}
/*
* Files content should be in the format
*nN
* startNode endNode distance
* startNode endNode distance
Veermata Jijabai Technological Institute

Page 55

Parallel Computing Lab


*.
*.
*.
* startNode endNode distance
*
* where n is the total number of Nodes and N is the total number of Acnes
*
*/
// read the first line
fscanf(ifp, "%d %d", &nodesNo, &edgesNo);
struct _node nodes[nodesNo];
struct _acne acnes[edgesNo];
while (fscanf(ifp, "%d %d %d", &from, &to, &distance) == 3) { /* read from
file */
/*
* Then for every line we create a node and also
* an acne to place the connected edges.
*
*/
if (previous != from)
childsCtr = 0;
nodes[from].distance = 65000; // infinite
nodes[from].name = from;
acnes[acnesCtr].from = from;
acnes[acnesCtr].to = to;
acnes[acnesCtr].value = distance;
acnes[acnesCtr].edge = &nodes[to];
nodes[from].children[childsCtr] = &acnes[acnesCtr];
nodes[from].edgeno = childsCtr + 1;
acnesCtr++;
childsCtr++;
previous = from;
}
Veermata Jijabai Technological Institute

Page 56

Parallel Computing Lab


// set firsts node's distance
nodes[0].distance = 0;
// set last node
nodes[nodesNo - 1].distance = 65000;
nodes[nodesNo - 1].edgeno = 0;
nodes[nodesNo - 1].name = nodesNo - 1;
//
int maxnum = getMaxEdgeNo(nodes, sizeof(nodes)/sizeof(*nodes));
if(numprocs<maxnum){
fprintf(stderr, "Number of processes should be greater that the maximum
number of edges: %d!\n", maxnum);
MPI_Finalize();
exit(1);
}
// get in calculation
traverse(&nodes[0]);
// print out the results
if(myid == 0){
int j;
for(j=0; j< (sizeof(nodes)/sizeof(*nodes));j++){
printf("Node: %d, Distance from the root node: %d, Number of
edges: %d\n", nodes[j].name, nodes[j].distance, nodes[j].edgeno);
fflush(stdout);
}
}
MPI_Finalize();
return 0;
}
Conclusion:
Thus, Mini project - Dijkstrer is implemented using the MPI

Veermata Jijabai Technological Institute

Page 57

You might also like