0% found this document useful (0 votes)

15 views

Assignment 1 - Big Data System

SRS(Sounding Reference Signal. It is a signal configured in the NR protocol for uplink channel detection. Because the TDD system has reciprocity between uplink and downlink channels, SRS can measure downlink channel status, and FDD can only test uplink channel status. The gNodeB configures two types of SRSs (codebook and antennaSwitching) for UEs in accordance with different usage. The codebook and antennaSwitching are respectively used for uplink and downlink channel measurement estimation. Bot

Uploaded by

Bhagwan Singh

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

15 views

Assignment 1 - Big Data System

Uploaded by

Bhagwan Singh

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 6

Assignment 1

CC ZG522 –
Big Data Systems

Submitted By:
Bhagwan Singh
2023MT03527@wilp.bits-pilani.ac.in

1
Assignment-I
Scenario: You are working with a dataset that is too large to process on a single machine.
You decide to simulate a parallel data processing environment using Python's
multiprocessing library.

Requirements: Write a Python script that simulates the partitioning of data and its
parallel processing. The script should divide a large array of numbers into smaller chunks,
distribute these chunks across multiple processes for summing the numbers, and then
aggregate the results. Discuss how this simulation relates to the concepts of Shared Memory
and Message Passing.

Submission Guidelines
Deadline: 14th March 12.00 am
Format: Submit your assignment as a compressed archive (.zip) containing all your source
code, reports, and documentation. Each sub task should be clearly labeled and contain its
own README file with instructions on running the code and understanding the output.

Submission Portal:
Upload your assignment through the designated online learning platform.

Plagiarism Policy:
Originality of your work is crucial. Plagiarism will result in a zero mark for the entire
assignment. You may discuss concepts with peers, but the final submission must be your own
work.

Late Submission: Late submissions may be accepted with a penalty, as specified in the
course syllabus.

NOTE: Its individual assignment and not group assignment.

Assignment Overview:-
This Python script uses the multiprocessing package in Python to simulate
simultaneous data processing. A big array of numbers is divided into smaller pieces by the
code, which then distributes them among several processes for parallel summation. The next
step is to aggregate the results to get the entire sum.

2
READ ME

Solution & Code Expectation:

Functions and Arguments :-
 chunk_sum(chunk):
"""Function decleration to evaluate the sum of a chunk of Nos and their process id."""
 parallel_sum(data, num_processes):
 Splits the data into chunks based on the specified number of processes.
 Creates a multiprocessing pool with the given number of processes.
 Distributes chunks to processes and calculates the sum in parallel.
 Aggregates the results to obtain the total sum.
Main Execution
 Generates an example dataset: a large array of random numbers using NumPy.
 Specifies the number of processes to simulate parallelism.
 Calls the parallel_sum function to simulate parallel processing and prints the total sum
using parallel computation.
Running the Code:-
 Open a terminal or command prompt.
 Navigate to the directory containing the script.
 Run the script using the following command:

Python script that simulates parallel data processing using the multiprocessing library
Solution Code:-
 Below is a Python script that simulates parallel data processing using the multiprocessing
library. The script divides a large array of numbers into smaller chunks, distributes these
chunks across multiple processes for summing the numbers, and then aggregates the results.

Assignment-1_BDS.py
Part-1 : Data Partitioning
import multiprocessing
import numpy as np

def chunk_sum(chunk):

3
"""Function decleration to evaluate the sum of a chunk of Nos and their process id."""
processID = multiprocessing.current_process().name
print(f'processing Chunk in {processID}')
return sum(chunk)

def parallel_sum(data, num_processes):

"""Algorithm to simulate parallel processing."""

# Split the big data array into chunks
chunk_size = len(data) // num_processes
chunks = [data[i:i + chunk_size] for i in range(0, len(data), chunk_size)]

# Create a multiprocessing pool function to compute the parallelprocess Algorithm

with multiprocessing.Pool(processes=num_processes) as pool:
# Distribute chunks to processes and calculate the sum in parallel
results = pool.map(chunk_sum, chunks)

# Aggregate the results

total_sum = sum(results)
return total_sum

Part : 2 - Simulation Execution:

if __name__ == "__main__":
# Considering an example of data: a large array of numbers
large_array = np.random.randint(1, 100, 1000000)

# Total Number of processes[CPU] to simulate parallelism

num_processes = 4

# Simulate parallel processing result

result = parallel_sum(large_array, num_processes)

print(f"Total sum using parallel processing: {result}")

Explanation:

 The script begins by generating a big array, large_array, of random numbers.

 After that, a function called chunk_sum is defined to determine the total of a chunk of
numbers.
 Parallel processing is simulated by another defined function called parallel_sum. The
input for this function is the huge array and the required number of processes.
 Depending on the number of processes supplied, the data (big array) is broken into
smaller parts.
 A multiprocessing pool is created, and the chunks are distributed among the processes
to calculate the sum of each chunk in parallel.
 The results from each process are then aggregated to obtain the total sum.

Final Code :
4
 Through the use of the multiprocessing package, the Python programs imitate simultaneous
data processing. The Python code are enclosed below from GDB Python Compiler.
import multiprocessing
import numpy as np

def chunk_sum(chunk):
"""Function decleration to evaluate the sum of a chunk of Nos and their
process id."""
processID = multiprocessing.current_process().name
print(f'processing Chunk in {processID}')
return sum(chunk)

def parallel_sum(data, num_processes):

"""Algorithm to simulate parallel processing."""
# Split the big data array into chunks
chunk_size = len(data) // num_processes
chunks = [data[i:i + chunk_size] for i in range(0, len(data), chunk_size)]

# Create a multiprocessing pool function to compute the parallelprocess

Algorithm
with multiprocessing.Pool(processes=num_processes) as pool:
# Distribute chunks to processes and calculate the sum in parallel
results = pool.map(chunk_sum, chunks)