Assignment 1 - Big Data System
Assignment 1 - Big Data System
CC ZG522 –
Big Data Systems
Submitted By:
Bhagwan Singh
2023MT03527@wilp.bits-pilani.ac.in
1
Assignment-I
Scenario: You are working with a dataset that is too large to process on a single machine.
You decide to simulate a parallel data processing environment using Python's
multiprocessing library.
Requirements: Write a Python script that simulates the partitioning of data and its
parallel processing. The script should divide a large array of numbers into smaller chunks,
distribute these chunks across multiple processes for summing the numbers, and then
aggregate the results. Discuss how this simulation relates to the concepts of Shared Memory
and Message Passing.
Submission Guidelines
Deadline: 14th March 12.00 am
Format: Submit your assignment as a compressed archive (.zip) containing all your source
code, reports, and documentation. Each sub task should be clearly labeled and contain its
own README file with instructions on running the code and understanding the output.
Submission Portal:
Upload your assignment through the designated online learning platform.
Plagiarism Policy:
Originality of your work is crucial. Plagiarism will result in a zero mark for the entire
assignment. You may discuss concepts with peers, but the final submission must be your own
work.
Late Submission: Late submissions may be accepted with a penalty, as specified in the
course syllabus.
Assignment Overview:-
This Python script uses the multiprocessing package in Python to simulate
simultaneous data processing. A big array of numbers is divided into smaller pieces by the
code, which then distributes them among several processes for parallel summation. The next
step is to aggregate the results to get the entire sum.
2
READ ME
Python script that simulates parallel data processing using the multiprocessing library
Solution Code:-
Below is a Python script that simulates parallel data processing using the multiprocessing
library. The script divides a large array of numbers into smaller chunks, distributes these
chunks across multiple processes for summing the numbers, and then aggregates the results.
Assignment-1_BDS.py
Part-1 : Data Partitioning
import multiprocessing
import numpy as np
def chunk_sum(chunk):
3
"""Function decleration to evaluate the sum of a chunk of Nos and their process id."""
processID = multiprocessing.current_process().name
print(f'processing Chunk in {processID}')
return sum(chunk)
Explanation:
Final Code :
4
Through the use of the multiprocessing package, the Python programs imitate simultaneous
data processing. The Python code are enclosed below from GDB Python Compiler.
import multiprocessing
import numpy as np
def chunk_sum(chunk):
"""Function decleration to evaluate the sum of a chunk of Nos and their
process id."""
processID = multiprocessing.current_process().name
print(f'processing Chunk in {processID}')
return sum(chunk)
if __name__ == "__main__":
# Considering an example of data: a large array of numbers
large_array = np.random.randint(1, 100, 1000000)
Output:
5
Insight on the Output
The script generates a large array of random numbers and simulates parallel processing with the
specified number of processes. The final output displays the total sum obtained through parallel
processing.