Parallel Computing
Parallel Computing
FacebookTwitterWhatsAppLinkedInRedditGoogle BookmarksShare
Parallel processing is a mode of operation where the task is executed
simultaneously in multiple processors in the same computer. It is meant to
reduce the overall processing time. In this tutorial, you’ll understand the
procedure to parallelize any typical logic using python’s multiprocessing
module.
Contents
1. Introduction
2. How many maximum parallel processes can you run?
3. What is Synchronous and Asynchronous execution?
4. Problem Statement: Count how many numbers exist between a given range in
each row
Solution without parallelization
5. How to parallelize any function?
6. Asynchronous Parallel Processing
7. How to Parallelize a Pandas DataFrame?
8. Exercises
9. Conclusion
1. Introduction
Parallel processing is a mode of operation where the task is executed
simultaneously in multiple processors in the same computer. It is meant to
reduce the overall processing time.
However, there is usually a bit of overhead when communicating between
processes which can actually increase the overall time taken for small tasks
instead of decreasing it.
In python, the multiprocessing module is used to run independent parallel processes
by using subprocesses (instead of threads). It allows you to leverage multiple
processors on a machine (both Windows and Unix), which means, the processes
can be run in completely separate memory locations.
By the end of this tutorial you would know:
1. How to structure the code and understand the syntax to enable parallel
processing using multiprocessing ?
2. How to implement synchronous and asynchronous parallel processing?
3. How to parallelize a Pandas DataFrame?
4. Solve 3 different usecases with the multiprocessing.Pool() interface.
# Prepare data
np.random.RandomState(100)
data = arr.tolist()
data[:5]
"""Returns how many numbers lie within `maximum` and `minimum` in a given
`row`"""
count = 0
for n in row:
if minimum <= n <= maximum:
count = count + 1
return count
results = []
print(results[:10])
#> [3, 1, 4, 4, 4, 2, 1, 1, 3, 3]
pool = mp.Pool(mp.cpu_count())
pool.close()
print(results[:10])
#> [3, 1, 4, 4, 4, 2, 1, 1, 3, 3]
import multiprocessing as mp
count = 0
for n in row:
count = count + 1
return count
pool = mp.Pool(mp.cpu_count())
pool.close()
print(results[:10])
#> [3, 1, 4, 4, 4, 2, 1, 1, 3, 3]
import multiprocessing as mp
pool = mp.Pool(mp.cpu_count())
pool.close()
print(results[:10])
#> [3, 1, 4, 4, 4, 2, 1, 1, 3, 3]
import multiprocessing as mp
pool = mp.Pool(mp.cpu_count())
results = []
"""Returns how many numbers lie within `maximum` and `minimum` in a given
`row`"""
count = 0
for n in row:
count = count + 1
def collect_result(result):
global results
results.append(result)
pool.close()
results.sort(key=lambda x: x[0])
print(results_final[:10])
#> [3, 1, 4, 4, 4, 2, 1, 1, 3, 3]
import multiprocessing as mp
pool = mp.Pool(mp.cpu_count())
results = []
pool.close()
pool.join()
print(results[:10])
#> [3, 1, 4, 4, 4, 2, 1, 1, 3, 3]
import multiprocessing as mp
pool = mp.Pool(mp.cpu_count())
results = []
print(results[:10])
#> [3, 1, 4, 4, 4, 2, 1, 1, 3, 3]
import pandas as pd
import multiprocessing as mp
print(df.head())
#> 0 1
#> 0 8 5
#> 1 5 3
#> 2 3 4
#> 3 4 4
#> 4 7 9
We have a dataframe. Let’s apply the hypotenuse function on each row, but
running 4 processes at a time.
To do this, we exploit the df.itertuples(name=False) . By setting name=False , you are
passing each row of the dataframe as a simple tuple to the hypotenuse function.
# Row wise Operation
def hypotenuse(row):
print(output)
def sum_of_squares(column):
Now comes the third part – Parallelizing a function that accepts a Pandas
Dataframe, NumPy Array, etc. Pathos follows the multiprocessing style of: Pool >
Map > Close > Join > Clear. Check out the pathos docs for more info.
import numpy as np
import pandas as pd
import multiprocessing as mp
def func(df):
return df.shape
cores=mp.cpu_count()
pool = Pool(cores)
pool.close()
pool.join()
pool.clear()
8. Exercises
Problem 1: Use Pool.apply() to get the row wise common items in list_a and list_b .
list_a = [[1, 2, 3], [5, 6, 7, 8], [10, 11, 12], [20, 21]]
list_b = [[2, 3, 4, 5], [6, 9, 10], [11, 12, 13, 14], [21, 24, 25]]
Show Solution
Problem 2: Use Pool.map() to run the following python scripts in parallel.
Script names: ‘script1.py’, ‘script2.py’, ‘script3.py’
Show Solution
Problem 3: Normalize each row of 2d array (list) to vary between 0 and 1.
list_a = [[2, 3, 4, 5], [6, 9, 10, 12], [11, 12, 13, 14], [21, 24, 25, 26]]
Parallel processing can increase the number of tasks done by your program which reduces the
overall processing time. These help to handle large scale problems.
In this section we will cover the following topics:
Introduction to parallel processing
Multi Processing Python library for parallel processing
IPython parallel framework
Using the standard multiprocessing module, we can efficiently parallelize simple tasks by
creating child processes. This module provides an easy-to-use interface and contains a set of
utilities to handle task submission and synchronization.
Process
By subclassing multiprocessing.process, you can create a process that runs independently. By
extending the __init__ method you can initialize resource and by
implementing Process.run() method you can write the code for the subprocess. In the below
code, we see how to create a process which prints the assigned id:
To spawn the process, we need to initialize our Process object and
invoke Process.start() method. Here Process.start() will create a new process and will invoke
the Process.run() method.