0% found this document useful (0 votes)
80 views

08 - Mixedprogramming: 1 Mixed Programming

The document discusses the performance of Python code compared to compiled languages like C and Fortran. It shows examples of summing integers, string concatenation, and type conversions in Python, NumPy, and with generators. Compiled C and Fortran versions are shown to be faster, with C being the fastest. NumPy can sometimes approach compiled speeds. The reasons Python is generally slower are explained as its dynamic typing requiring more operations during execution compared to static typed languages like C.

Uploaded by

giordano mancini
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
80 views

08 - Mixedprogramming: 1 Mixed Programming

The document discusses the performance of Python code compared to compiled languages like C and Fortran. It shows examples of summing integers, string concatenation, and type conversions in Python, NumPy, and with generators. Compiled C and Fortran versions are shown to be faster, with C being the fastest. NumPy can sometimes approach compiled speeds. The reasons Python is generally slower are explained as its dynamic typing requiring more operations during execution compared to static typed languages like C.

Uploaded by

giordano mancini
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 41

08_MixedProgramming

May 12, 2017

1 Mixed Programming
Section ??
Section ??
Section ??
Section ??
Section ??
Section 7
Section ??
Section ??

2 The zeroth law of optimization


2.0.1 "Premature optmization is the root of all evil" - Donald Knuth

3 How much time it takes?


The high-level nature of Python makes it very easy to program, read, and reason about code.
Many programmers report being more productive in Python.
Sometimes, however, we need to do many calculations over a lot of data. No matter how fast
computers are, or will get shortly, there will always be cases where you still need the code to be as
fast as you can get it.
Even with NumPy’s fast vectorized calculations, there are still times when either the vector-
ization is too complex, or it uses too much memory. It is also sometimes just easier to express the
calculation with a simple loop.

In [1]: %load_ext cython

In [2]: import numpy as np


import scipy as sp
import re
import matplotlib.pyplot as plt
import random
import cython
import math

%matplotlib inline

1
from IPython.display import Image
from re import search

But, how we do measure the time it takes to complete a set of commands?

In [3]: bigN = int(1e6)


global bigN

In [4]: %time sum(list(range(bigN)))

CPU times: user 72 ms, sys: 24 ms, total: 96 ms


Wall time: 94.2 ms

Out[4]: 499999500000

We can see a breakdown of the CPU time in user and sys and the also the Wall time needed
to complete the operation. You can see by yourself why wall times are different between two
measurements of the same operations.
By the way, the time Python module can be used in the same way:

import time
start_time = time.time()
def somefunc(bla,bla):
for i in range(iterations):
#do stuff
return result
elapsed = time.time() - start_time
per_iter = elapsed / iterations

Since this is a common operation, IPython provides the %time magic function:
but we have already shown that %timeit is a better tool to make estimates.

3.0.1 Summing integers


let’s see how much it takes to compute the sum of the first n integers in Python:

In [5]: %%timeit
j = 0
for i in range(bigN): j += i

10 loops, best of 3: 59.7 ms per loop

In [6]: %timeit sum(list(range(bigN)))

10 loops, best of 3: 34 ms per loop

2
In [7]: %%timeit
NN = np.arange(bigN)
np.sum(NN)

1000 loops, best of 3: 1.4 ms per loop

using built-in iterators is better, but numpy is much faster

3.0.2 What about string concatenation?


In [8]: mystring = 'qwertyuiopasdfghjklzxcvbnm'

In [9]: %%timeit
s = ''
for i in range(int(1e4)):
s += mystring

1000 loops, best of 3: 1.02 ms per loop

In [10]: %%timeit
s = ''
for i in range(int(1e4)):
s = "".join((s, mystring))

10 loops, best of 3: 84.5 ms per loop

3.0.3 Conversion between types


In [11]: mylist = [random.randint(99,120) for i in range(int(1e5))]

In [12]: %%timeit
s = list()
for i in mylist:
s.append(chr(i))

100 loops, best of 3: 14 ms per loop

In [13]: %%timeit
s = list()
s = [chr(i) for i in mylist]

100 loops, best of 3: 9.16 ms per loop

If we do not need every element at the same time ...

3
In [14]: %%timeit
s = (chr(i) for i in mylist)

The slowest run took 7.07 times longer than the fastest. This could mean that an intermediate re
1000000 loops, best of 3: 347 ns per loop

... generators can be a good idea

3.1 Exercises
1. Find another way to join the strings and compare the three methods (may be you can make
a graph).
2. Find another way of converting integers to ASCII characters.

3.1.1 Solutions
In [15]: %%timeit
#solution 1
r = list(mystring)
for i in range(int(1e4)):
r.append(mystring)
s = ''.join(r)

1000 loops, best of 3: 990 µs per loop

In [16]: %%timeit
#solution 2
s = map(chr,mylist)

The slowest run took 6.34 times longer than the fastest. This could mean that an intermediate re
10000000 loops, best of 3: 188 ns per loop

In [17]: %%timeit
#solution 3
s = list(map(chr,mylist))

100 loops, best of 3: 7.44 ms per loop

generator expression is fast but builtin map is still better

3.2 what about compiled languages?


Let’s write a simple Python script to estimate π and implement it in C and Fortran. We estimate
π as: ∫ 1
1
2
= arctg(1) − arctg(0) = π/4
0 1+x

4
In [18]: def simple_pi(num_iter):
mysum = .0
step = 1./num_iter

for i in range(num_iter):
x = (i+0.5)*step
mysum += 4./(x*x + 1)

#print(mysum*step)
return mysum*step

In [19]: simple_pi(bigN)

Out[19]: 3.1415926535897643

In [20]: %%timeit
simple_pi(bigN)

1 loop, best of 3: 273 ms per loop

In [21]: def numpy_pi(num_iter):


grid = np.linspace(0,1,num=num_iter,endpoint=False)
x = 4./(1. + grid**2)
pi = (1./num_iter)*np.sum(x)
return pi

In [22]: numpy_pi(bigN)

Out[22]: 3.1415936535896263

In [23]: %timeit numpy_pi(bigN)

100 loops, best of 3: 10.7 ms per loop

Let us see a Fortran version:

In [24]: !cat simple_pi.f

program simple_pi
implicit none
real*8 mysum, step, x
integer i, numiter
character*10 l

call getarg(1,l)
read (l,'(I10)') numiter

mysum = 0.d0

5
step = 1.d0 / dble(numiter)

Do 10 i = 0, numiter-1
x = (dble(i)+0.5d0)*step
mysum = mysum + 4.d0/(x*x + 1.d0)
10 Continue

write(6,*) mysum*step

end program simple_pi

In [25]: !gfortran -Wall -o fpi simple_pi.f

In [26]: %%timeit
%%bash -s $bigN
./fpi $1 > /dev/null

100 loops, best of 3: 10.8 ms per loop

Note the user and sys values and the cost of opening a BASH subprocess. By the way, notice
how we can pass a variable as a BASH positional argument...

In [27]: %%capture fbench


%%timeit
%%bash -s $bigN
./fpi $1 >& /dev/null

In [28]: print(fbench.stdout)

100 loops, best of 3: 10.3 ms per loop

What about C?

In [29]: !cat simple_pi.c

#include <stdlib.h>
#include <stdio.h>
#include <math.h>

int main (int argc, char **argv) {

int num_steps = argv[1] != NULL ? atoi(argv[1]) : 1E6;


int i;

double step, x, PI, sum=.0;

6
step = 1.0/(double) num_steps;

for (i=0;i< num_steps; i++){


x = (i+0.5)*step;
sum = sum + 4.0/(1.0+x*x);
}
PI = step * sum;

printf("PI: %f\n",PI);
return 0;
}

In [30]: %%bash
gcc -Wall -o cpi simple_pi.c
In [31]: %%bash -s $bigN
./cpi $1
PI: 3.141593

In [32]: %%capture cbench


%%timeit
%%bash -s $bigN
./cpi $1 >& /dev/null
In [33]: print(cbench)
100 loops, best of 3: 9.81 ms per loop

So, it’s not a matter or language (or compiler). [Note the %capture IPython magic]
To wrap up:
1. Python is slower than a compiled language
2. Numpy (when we can use it) can sometimes perform approximately like compiled code (if
take into account the overhead of a BASH subprocess)
We have also shown that is fundamental to choose the proper data structure (e. g. map vs
generator vs list) to do a certain job, to avoid a catastrophic performance. More tips available
Section ??.
Finally, using a compiler requires some learning. More information about the GNU compiler
collection can be found here: https://gcc.gnu.org/
and a quick tutorial is found in this Section ??

4 Why Python is slow?


If you like to be pendantic, we should remember that Python is a standard with different implemen-
tations as the Section ?? remembers us.
But then, why running a for loop in CPython is slower than C?

7
4.0.1 Duck typing
Take the difference of C variable vs a Python one at runtime:

• the C variable is just a pointer to some array location with few definite properties
• the Python variable is a "box" containing an instance of a subclass of object

when the code executes, the Python interpreter decides what properties the variable may have
depending on context. The following C code:

int i,j,k;
i = 2;
j = 2;
k = i*j;
return k

means the CPU will use approximately seven cycles.


The Python equivalent:

i = 2
j = 2
k = i*j
return k

involves the following operations on the PyObject (see below) C structures:

• creating the objects i, j and k


• assigning the appropriate attibutes
• selecting the method for the ’*’ operator
• calculating the result and the appropriate type for k

and each of these operations means that the interpreter creates or uses something like that:

typedef struct _object {


_PyObject_HEAD_EXTRA
Py_ssize_t ob_refcnt;
struct _typeobject *ob_type;
} PyObject;

In [34]: Image(filename="box.png")

Out[34]:

8
4.0.2 Bytecode
From Wikipedia:
Bytecode, also known as p-code (portable code), is a form of instruction set designed for ef-
ficient execution by a software interpreter. Unlike human-readable source code, bytecodes are
compact numeric codes, constants, and references (normally numeric addresses) which encode
the result of parsing and semantic analysis of things like type, scope, and nesting depths of pro-
gram objects. They therefore allow much better performance than direct interpretation of source
code.

4.0.3 Interpreted language


Again, from Wikipedia, an interpreter may be a program that either 1. executes the source code
directly 2. translates source code into some efficient intermediate representation (code) and im-
mediately executes this 3. explicitly executes stored precompiled code made by a compiler which
is part of the interpreter system
Note that, in principle, almost any language can be compiled or interpreted, it depends on
whether you have (or write) a compiler program or an interpreter one. What does the python
(using CPython on Linux) command do when you execute python somescript.py ?

In [35]: %%bash
cat hello_mod.py
echo "++++++"
cat hello_script.py

def hello():
print("Hello, World!")
++++++
import hello_mod

9
hello_mod.hello()

quit()

In [36]: !python3 ./hello_script.py

Hello, World!

In [37]: !ls */*.pyc

__pycache__/hello_mod.cpython-35.pyc swig/swigmc.pyc

In [140]: #!head __pycache__/*.pyc

the .pyc is a bytecode, so python is in case 2 above. The Python interpreter loads .pyc files before
.py files, so if they’re present, it can save some time by not having to re-compile the Python source
code.
But how this affects execution speed?
Compared to an interpreter, a good compiler can look ahead and optimize the code (remove re-
dundant operation, unrolling small loops). This may be guided by user selected switches, yielding
a significant speed-up. However, compilation may be a difficult task by itself, requiring knowl-
edge of the platform and compiler being used.
The Section ?? module allows you to dissassemble your Python bytecode.

4.0.4 Locality
One of the key features of Numpy is the locality of data that allows to access to rows and columns
as C (or Fortran) plus the decoration that make possible all the fancy stuff. The values in the array
are contiguous and have a common size.

In [39]: Image(filename="array.png")

Out[39]:

10
A list object in CPython is represented by the following C structure. ob_item is an array of
pointers to the list elements. allocated is the number of slots allocated in memory

typedef struct {
PyObject_VAR_HEAD
PyObject **ob_item;
Py_ssize_t allocated;
} PyListObject;

each element in the Python list is found by a pointer to a buffer of pointers, each of which
points to a Python object which encapsulates its own data. Thus, operations like append or pop are
cheap but running over all the elements in the list can be costly since it involves a great deal of
referencing operations and (likely) a lot of copies to and from the memory.

5 Profiling Python
In software engineering, program profiling, software profiling or simply profiling, a form of dy-
namic program analysis (as opposed to static code analysis), is the investigation of a program’s
behavior using information gathered as the program executes. The usual purpose of this analysis
is to determine which sections of a program to optimize - to increase its overall speed, decrease its
memory requirement or sometimes both.
IPython provides an interface to the cProfile Python module, using the %prun magic com-
mand which analyses the time spent in each call of Python block:

In [40]: from numpy.linalg import eigvals


def run_experiment(niter=100):
K = 100

11
results = []
for _ in range(niter):
mat = np.random.randn(K, K)
max_eigenvalue = np.abs(eigvals(mat)).max()
results.append(max_eigenvalue)
return results

In [41]: %prun -l 5 run_experiment()

4004 function calls in 0.684 seconds

Ordered by: internal time List reduced from 33 to 5 due to restriction <5>
ncalls tottime percall cumtime percall filename:lineno(function) 100 0.629 0.006 0.635 0.006
linalg.py:832(eigvals) 100 0.047 0.000 0.047 0.000 {method ’randn’ of ’mtrand.RandomState’ ob-
jects} 300 0.002 0.000 0.002 0.000 {method ’reduce’ of ’numpy.ufunc’ objects} 100 0.002 0.000 0.003
0.000 linalg.py:214(_assertFinite) 1 0.001 0.001 0.684 0.684 :2(run_experiment)
where the various lines are:

• ncalls: number of calls

• tottime: total time spent in a given function (excluding calls to sub-functions)

• percall: tottime/ncalls

• cumtime: total time spent, including calls

• percall: cumtime/primitive calls

Note the use of _ in the above loop. It is a convention for a throwaway variable that can be
discarded as in

In [42]: var1, var2, _ = ("useful", "veryuseful","pointless")

6 Using all those cores


We speak of parallel computing whenever a number of compute elements (cores or CPUs) solve a
problem in a cooperative way. All current computers, from laptops to supercomputer architec-
tures are now based on multicore architectures, meaning that they have several to a many compute
elements whose capability can be partially shared to solve a task.
There are two fundamental reasons that may justify an effort in learning parallel computing:

1. Using a single core makes your calculation too slow; here slow may vary from “over lunch” to
"before dissertation". Using multiple cores speeds up your calculation (we’ll see an example
of that).

2. Your main memory per core/CPU is not enough: bigger problems (with more complicated
physics, more particles, etc.) may be solved using memory from multiple CPUs (not covered
here).

12
Example: matrix-matrix multiplication is time intensive for big (dense) matrices. However,
each row-column dot product is independent from each other and so can be given to a core with-
out the need to communicate between cores mid-task. This can be done by generating multiple
processes (as the MPI libraries do) or multiple threads (as pthreads in Unix does). In the following
we will take just a glimpse of threads. But what is a process or thread?

Processes A process is a set of independent executions that run in a memory space separated
from other processes. It has a private virtual address space, environment variables and OS identi-
fiers. A process may split its sequence of executions in one or more threads.

Threads A thread is a sequence of executions that be scheduled in a process. Multiple threads


belonging to the same process will share the same environment variables, OS indentifier (PID in
GNU/Linux) and virtual address space.

6.0.1 Race conditions


When you execute a script, your computer will allocate some memory for the process (see top
command). Each program is isolated from the others (normally). Inside a process there can be
multiple "threads" of execution. These threads share the underlying memory of the process, and
can each be assigned to different cores.
However, having shared memory can cause problems. For example, imagine trying to sum a
trillion numbers and saving it to a local variable "x". At some point, one of your lines of code will
look something like x = x + item. However, if multiple threads grab the old value of "x" at the
same time, add some number to it, and try to update the variable, you will get mistakes called
"race conditions" which will cause you to get an incorrect sum. What’s even worse is that the
simpler the task, the higher the probability of these happening due to the frequency that these
variables are being addressed or referenced.

In [43]: Image(filename="thread_en.png")

Out[43]:

13
6.0.2 OpenMP
From wikipedia:
Section ?? is an application programming interface (API) that supports multi-platform shared
memory multiprocessing programming in C, C++, and Fortran, on most platforms, processor ar-
chitectures and operating systems, including Solaris, AIX, HP-UX, Linux, OS X, and Windows. It
consists of a set of compiler directives, library routines, and environment variables that influence
run-time behavior.
GCC supports OpenMP with the -fopenmp switch. Note that many versions of
BLAS/LAPACK use OpenMP multithreading and should do Numpy/Scipy wrappers when com-
piled against OpenMP enabled libraries.
OpenMP allows to parallelize CPU intensive blocks of code (for loops) with very little effort
using pre-processor directives; the do/while loop in calcPI above will be split in sub-loops on each
core that will hold a fraction of the estimation of π:

#pragma omp parallel for private(ix,iy,x,y,r2) reduction(+:inside)


for(i=0;i<iter;i++){
...
}

but how many cores do we have?

In [44]: !ls /sys/devices/system/cpu/

cpu0 cpu3 cpu6 cpuidle kernel_max offline power


cpu1 cpu4 cpu7 intel_pstate microcode online present

14
cpu2 cpu5 cpufreq isolated modalias possible uevent

You can check the load on each core with htop (or using a desktop applet or a utility such as
conky)

In [45]: !cat pi_omp.c

#include <stdlib.h>
#include <stdio.h>
#include <math.h>
#include <omp.h>

int main (int argc, char **argv) {

int num_steps = argv[1] != NULL ? atoi(argv[1]) : 1E6;


int i;

double step, x, sum=.0;

step = 1.0/(double) num_steps;

#pragma omp parallel for private(x) reduction(+:sum)


for (i=0;i< num_steps; i++){
x = (i+0.5)*step;
sum = sum + 4.0/(1.0+x*x);
}

printf("PI: %f\n",step * sum);

return 0;
}

In [46]: %%bash
gcc -Wall -O3 -fopenmp -o pi_omp pi_omp.c

In [47]: bigN = int(1e8)


global bigN

In [48]: %%bash -s $bigN


export OMP_NUM_THREADS=1
./pi_omp $1

PI: 3.141593

The OMP_NUM_THREADS environment variable sets the number of threads to use. We can
see how the speed up scales with the number of threads.

15
In [49]: %%capture t1
%%timeit
%%bash -s $bigN
export OMP_NUM_THREADS=1
./pi_omp $1 >& /dev/null

In [50]: %%capture t2
%%timeit
%%bash -s $bigN
export OMP_NUM_THREADS=2
./pi_omp $1 >& /dev/null

In [51]: %%capture t4
%%timeit
%%bash -s $bigN
export OMP_NUM_THREADS=4
./pi_omp $1 >& /dev/null

In [52]: %%capture t6
%%timeit
%%bash -s $bigN
export OMP_NUM_THREADS=6
./pi_omp $1 >& /dev/null

In [53]: %%capture t8
%%timeit
%%bash -s $bigN
export OMP_NUM_THREADS=8
./pi_omp $1 >& /dev/null

In [54]: pat = r'(.*):\s(.*)\sms'


results = [search(pat,str(i)).group(2) for i in (t1,t2,t4,t6,t8)]
x = np.array((1,2,4,6,8))
y = np.array(list(map(float,results)))
plt.plot(x,y,marker='s',color='k',ls='-',label="OpenMP test")
plt.xlabel("Number of cores")
plt.ylabel("Elapsed time (ms)")

Out[54]: <matplotlib.text.Text at 0x7f23db584f28>

16
between 6 and 8 cores the speed-up is very low and the code is hitting a barrier (at least, where
I run the nb)

7 Cython
In [55]: Image(filename="cy_logo.png")

Out[55]:

The fundamental nature of Cython can be summed up as follows: Cython is Python with C
data types. This means that Cython can handle at the same type:

• almost any native Python code

17
• intermixed C and Python variable and commands

what happens: the Cython compiler reads in a .pxy file and produces a .c file; the C file is
compiled; the resulting module is linked against the CPython library, and used by the interpreter
More in detail (from the documentation):
Cython is an optimising static compiler for both the Python programming language and the extended
Cython programming language. The Cython language is a superset of the Python language that addition-
ally supports calling C functions and declaring C types on variables and class attributes. This allows the
compiler to generate very efficient C code from Cython code. The C code is generated once and then compiles
with all major C/C++ compilers.
So, that’s your world without Cython’

In [56]: Image(filename="wo_cython.png")

Out[56]:

With Cython:

In [57]: Image(filename="wcy.png")

Out[57]:

18
7.0.1 Summing integers
What if we want to use Cython? Generally speaking, this should include a compilation step,
converting the Cython code to C. In the following, we will take advantage of the %%cython cell
magic to pass compilation options to cython.

In [58]: %%cython
def mymultiply(int a,int b):
cdef int c = a*b
return c
try:
mymultiply("w","q")
except TypeError as e:
print(e)
finally:
print(mymultiply(3.,4.))

an integer is required
12

cdef may be used to declare C types and structures (and also union and enum types). cdef
may also be used to declare functions and classes (known as extensions, behave like a builtin) with

19
C attributes which are then callable from C. cpdef puts a Python wrapper around a C function
definition and makes it callable from C and Python.

• def func(int x):


caller passes Python objects for x
function converts to C int on entry
implicit return type always object

• cdef int func(int x):


caller converts arguments as required
function receives C int for x
arbitrary return type, defaults to object

• cpdef int func(int x):


a C version of the function is generated as with cdef
a Python version is also generated and available to the interpreter (converting Ctypes to objects when
returning functions)

In [59]: bigN = int(1e6)

In [60]: %%timeit
%%cython
n = int(1e6)
cdef int j = 0
cdef int i
for i in range(n):
j += i

The slowest run took 27.66 times longer than the fastest. This could mean that an intermediate r
1000 loops, best of 3: 171 µs per loop

In [61]: %%timeit
j = 0
for i in range(bigN): j += i

10 loops, best of 3: 60.3 ms per loop

Note that C variables are not python objects, i. e. they are typed:

In [62]: %%cython
cdef int n
n = 10.

Error compiling Cython file:


------------------------------------------------------------
...
cdef int n

20
n = 10.

------------------------------------------------------------

/home/gmancini/.cache/ipython/cython/_cython_magic_45136ee03b41a20bcca5f6fda90449af.pyx:2:4: Can

raises an error, while

In [63]: %%cython
cdef double d
d = 10
does an automatic casting

7.0.2 Calculate π, Cython version(s)


In [64]: %%cython
def cy_simple_pi(int niter=int(1e6)):
"""
another version of arctg integration
using Cython
"""

cdef double s, mysum=.0, step=1./niter


cdef int i=0

#everything down here is a C variable


for i in range(niter):
x = (i+0.5)*step
mysum += 4./(x*x + 1.)

return step*mysum
In [65]: %%bash
gcc -O3 -ffast-math -pipe simple_pi.c -o cpi
gfortran -O3 -ffast-math -pipe simple_pi.f -o fpi
In [66]: %%capture cypi
%timeit cy_simple_pi(bigN)
In [67]: %%capture simplepi
%timeit simple_pi(bigN)
In [68]: %%capture num_pi
%timeit numpy_pi(bigN)
In [69]: %%capture f95_pi
%%timeit
%%bash -s $bigN
./fpi $1 >& /dev/null

21
In [70]: %%capture c_pi
%%timeit
%%bash -s $bigN
./cpi $1 >& /dev/null

In [71]: pat = r'(.*):\s(.*)\sms'


times = [search(pat,str(i)).group(2) for i in (simplepi,num_pi,f95_pi,c_pi,cypi)]
times = list(map(float,times))
scale = 200
coloUr = ("k","b","r","g","m")
for point,method in enumerate(("Pure Python","Numpy","F77","C","Cython")):
plt.scatter(point+1,times[point],label=method,c=coloUr[point],s=scale,edgecolor=Non
plt.legend()
plt.ylabel("Elapsed time (ms)")

Out[71]: <matplotlib.text.Text at 0x7f23b4e90240>

Which seems to imply that if i is declared as a cdef integer type, it will optimise this into a pure C
loop. Remember the cost of opening a BASH subprocess.

In [74]: %%cython -a
def cy_simple_pi(int niter=int(1e6)):
"""
another version of arctg integration
using Cython
"""

22
cdef double s, mysum=.0, step=1./niter
cdef int i=0

#everything down here is a C variable


for i in range(niter):
x = (i+0.5)*step
mysum += 4./(x*x + 1.)

return step*mysum

Out[74]: <IPython.core.display.HTML object>

Invoking Cython (had we built a module using python setup.py ...) generates about 1000 lines
of C code which is then compiled and used when running cpi.cpi(). I.e. Cython has generate a
valid C code which exposes itself to Python and is not so bad compared to C.
Let’s see what happens with MonteCarlo

In [75]: def pi_mc_py(niter):

i = 0
inside = 0
for i in range(niter):
x = 2.*random.random() - 1
y = 2.*random.random() - 1
r = x*x + y*y
if r<=1:
inside+=1
i+=1

return 4.0*inside / niter

In [76]: %%capture mcpy


%timeit pi_mc_py(bigN)

In [77]: def pi_mc_numpy(npoints):


x = 2.*np.random.rand(npoints)-1
y = 2.*np.random.rand(npoints)-1
r = x**2+y**2
return 4.*(r[r<=1]).shape[0]/npoints

In [78]: %%capture npmc


%timeit pi_mc_numpy(bigN)

In [79]: %%bash
gcc -O3 -Wall -ffast-math -pipe -o mc_cpi pimc.c

In [80]: %%capture mcc


%%timeit
%%bash -s $bigN
./mc_cpi $1 >& /dev/null

23
In [81]: %%cython
import random
def pi_mc_cy(int niter=int(1e6)):
"""
another version of MonteCarlo integration using Cython
"""

cdef double x, y, r, PI
cdef int i=0, inside = 0

#everything down here is a C variable


for i in range(niter):
x = 2.*random.random() - 1
y = 2.*random.random() - 1
r = x*x + y*y
if r<=1:
inside+=1
i+=1

return 4.0*inside / niter

In [82]: %%capture mccy


%timeit pi_mc_cy(bigN)

In [83]: times = [search(pat,str(i)).group(2) for i in (mcpy,npmc,mcc,mccy)]


times = list(map(float,times))
scale = 200
coloUr = ("k","b","r","g")
for point,method in enumerate(("Pure Python","Numpy","C","Cython")):
plt.scatter(point+1,times[point],label=method,c=coloUr[point],s=scale,edgecolor=Non
plt.legend()
plt.ylabel("Elapsed time (ms)")

Out[83]: <matplotlib.text.Text at 0x7f23b4e07e80>

24
To sum up, just using %%cython and cdef yields performance an order of magnitude less than
C, but we have Numpy not pure Python. It is possible to use Numpy arrays and Cython?

Numpy arrays and memoryviews

In [84]: %%cython
import numpy as np
def pi_mc_np_cy(int niter=10000000):
"""
another version of MonteCarlo integration using Cython
"""
cdef double PI,inside

x = 2.*np.random.rand(niter)-1
y = 2.*np.random.rand(niter)-1
r = x*x + y*y
r = r[r<=1.]
inside = r.shape[0]

return 4.0*inside/niter

In [85]: %timeit pi_mc_np_cy()

1 loop, best of 3: 388 ms per loop

25
Why is that? The reason is that working with NumPy arrays incurs substantial Python over-
heads. We can do better by using Cython’s typed memoryviews, which provide more direct access
to arrays in memory. Like a standard numpy array view (e. g. a slice object) a memoryview stores
information on a memory location without holding it. Being typed, the cython compiler can access
it as it would with a standrd C array, avoiding interpreter overhead.
When using them, the first step is to create a NumPy array and then declare a memoryview
and bind it to the NumPy array.

In [86]: %%cython
import numpy as np
from numpy cimport float_t
def pi_mc_np_cy_mv(int niter=10000000):
"""
another version of MonteCarlo integration
using Cython and typed memoryviews
"""
cdef double PI, inside

x = 2.*np.random.rand(niter)-1
y = 2.*np.random.rand(niter)-1
cdef float_t [:] X = x
cdef float_t [:] Y = y

for i in range(niter):
if X[i]*X[i] + Y[i]*Y[i]<=1.:
inside += 1.

return 4.*inside/niter

In [87]: %%capture mv
%timeit pi_mc_np_cy_mv(bigN)

In [88]: times.append(search(pat,str(mv)).group(2))
print(times)
scale = 200
colour = ("k","b","r","g","c")
for point,method in enumerate(("Pure Python","Numpy","C","Cython","MemoryViews")):
plt.scatter(point+1,times[point],label=method,c=colour[point],s=scale,edgecolor=Non
plt.legend()
plt.ylabel("Elapsed time (ms)")

[644.0, 39.9, 27.8, 221.0, '34.1']

Out[88]: <matplotlib.text.Text at 0x7f23b4985588>

26
Note that memoryviews supports a number of operations, including copy, in analogy with
arrays: new_mv[:] = old_mv
or new_mv[...] = old_mv for all dimensions

In [94]: %%cython
cdef extern from "string.h":
int strlen(char *c)
def get_len(char *message):
return strlen(message)

In [95]: get_len(bytearray("www",encoding="ascii"))

Out[95]: 3

7.0.3 The Global Interpreter Lock


Python can use threads, but they can’t be used to increase performance: only one Python instruc-
tion is allowed to run at any time. This is the Global Interpreter Lock (GIL). Each time a thread
executes a Python statement, a lock is acquired which prevents other threads to run until it is re-
leased. The GIL avoids conflicts between threads, simplifying the implementation of the CPython
(Jython and IronPython have no GIL) interpreter. Despite this limitation, threads can still be used
to provide concurrency in situations where the lock can be released, such as in time-consuming
I/O operations or in C extensions.

Release the GIL and using OpenMP Cython supports native parallelism with OpenMP; it is
also possible to MPI (e.g. with mpi4py) and it should be possible to exploit Cython to interface

27
with C code using MPI or pthreads (not easy). To use this kind of parallelism, we must release the
GIL. Note that the GIL is released whenever using low-level C-code (Numpy comes to mind).
To use OpenMP within Cython you have to:

1. Look into the cython.parallel module

2. Release the GIL before a block of code: with nogil: # This block of code is executed after
releasing the GIL

with gil:
\#this block of code will enable the GIL in a no-GIL context (kind of omp atomic or serial)

3. Parallelize for-loops with prange: from cython.parallel import prange

Prange is a (sort of) short-hand Cython equivalent for:

#pragma omp parallel for private(i)


for(i=1;i<N;i++){
...
}

Note that any in-place operation is automatically taken to be a reduction variable, which means
that the thread local values are combined after all threads have completed. Further, the index
variable is always lastprivate, i.e. it will hold the value of the last.

In [96]: %%cython --compile-args=-fopenmp --compile-args=-O3 --compile-args=-pipe --compile-args

from cython.parallel import prange

def pi_cy_omp(int niter=int(1e8),int NT=1):


"""
another version of arctg integration using Cython and realising the GIL
"""
cdef double x, mysum=.0, step=1./niter
cdef int i
# this is the equivalent of #pragma omp or !$OMP
for i in prange(niter,nogil=True,schedule='static',num_threads=NT):
x = (i+0.5)*step
mysum += 4.0/(x*x + 1.0)

return mysum*step

In [97]: pi_cy_omp(int(1e8))

Out[97]: 3.1415926535904264

In [98]: %%capture t1
%timeit pi_cy_omp(int(1e8),1)

28
In [99]: %%capture t2
%timeit pi_cy_omp(int(1e8),2)
In [100]: %%capture t4
%timeit pi_cy_omp(int(1e8),4)
In [101]: %%capture t6
%timeit pi_cy_omp(int(1e8),6)
In [102]: %%capture t8
%timeit pi_cy_omp(int(1e8),8)
In [104]: x = np.array((1,2,4,6,8))
pat = r'(.*):\s(.*)\s(s|ms)'
cy = list()
for i in (t1,t2,t4,t6,t8):
res = search(pat,str(i))
cy.append(float(res.group(2)))
if res.group(3) is "s":
cy[-1] = cy[-1]*1e3
cy = list(map(float,cy))
plt.plot(x,results,marker='s',color='k',ls='-',label="OpenMP test")
plt.plot(x,cy,marker='o',color='r',ls=':',label="Cython OpenMP test")
plt.legend()
plt.xlabel("Number of cores")
plt.ylabel("Elapsed time (ms)")
cy
Out[104]: [2660.0, 1370.0, 775.0, 529.0, 399.0]

29
8 Exercise
Section ??: given a diagonally dominant matrix A and a vector b, solve:

Ax = b
by iterating:
(k)
( k +1)
bi − ∑i̸= j Aij x j
xi =
Aii
.
Use Python/Numpy to create a random matrix A such as that Aii >= 2. ∗ ∑ j̸=i A j and a ran-
dom vector b; then implement the Jacobi solver using:

1. a pure CPython solution


2. a Numpy one
3. a Cython one
4. a Cython + Numpy (+ OpenMP optionally) one

You can test that the algorithm converges by calculating every (nth) step:

∑ ( xi
( k +1) (k)
conv = − x i )2
i

and comparing against a preset tolerance; also you may test if it yields the correct result at the
end of the loop by computing the error as:

erri = bi − ∑ Aij
j

and

∑i erri2
err =
n
Possible testing parameters can be:

• dimension of x and b 1000


• tolerance int(1e5)

8.0.1 Solutions
Pure Python Version

In [105]: def gen_jac(dim=1000,tol=1e-10):


"""
generate starting data for Jacobi solver exercise
"""

30
A = np.random.rand(dim*dim) + tol
A.shape = (dim,dim)
b = np.random.rand(dim) + tol
x_0 = np.random.rand(dim) + tol
for i in range(dim):
A[i,i] = 1.5*(np.sum(A[i,:i])+np.sum(A[i,i:]))
return A,b,x_0

def solv_jac(A,b,xold,tol=1e-8,kmax=100):
"""
Jacobi solver
"""
dim = A.shape[0]
xnew = np.ones(dim)
k = 0
conv = kmax

while (k<kmax and conv>tol):


for i in range(dim):
xtmp = .0
for j in range(dim):
if i != j:
xtmp += xold[j] * A[i,j]
xnew[i] = (b[i] - xtmp)/A[i,i]
conv = .0
for i in range(dim):
conv += (xnew[i] - xold[i])**2

xold = np.copy(xnew)
k += 1

return k,conv,xnew

def test_jac(A,b,x,debug=False):
"""
test Jacobi solver
"""
dim = A.shape[0]

mysum = np.zeros(dim)
for i in range(dim):
mysum[i] = b[i] - np.sum(A[i,:]*x)

err = math.sqrt(np.sum(mysum**2)/dim)
if not debug:
return err
else:
return err,mysum

31
In [106]: A,b,x_0 = gen_jac(1000)

In [107]: nk, conv, xnew = solv_jac(A,b,x_0)


err = test_jac(A,b,xnew)
print(nk,err,conv)

32 0.0013954521625608946 7.78475077906e-09

In [108]: %timeit solv_jac(A,b,x_0)

1 loop, best of 3: 19.2 s per loop

Numpy Version

In [109]: def solv_jac_np(A,b,xold,tol=1e-8,kmax=100):


"""
Numpy Jacobi solver
"""
dim = A.shape[0]
xnew = np.empty(dim)

k = 0
conv = kmax

D = np.zeros((dim,dim))
d = np.diag(A)
np.fill_diagonal(D,d)
R = A - D

while (k<kmax and conv>tol):

xnew = (b - np.dot(xold,R))/d
conv = np.sum((xnew - xold)**2)

xold = np.copy(xnew)
k += 1

return k,conv,xnew

In [110]: A,b,x_0 = gen_jac(10000)

In [111]: nk, conv, xnew = solv_jac_np(A,b,x_0)


err = test_jac(A,b,xnew)
print(nk,err,conv)

35 0.004884897156685974 7.44462729776e-09

32
In [112]: %timeit solv_jac_np(A,b,x_0)
1 loop, best of 3: 3.16 s per loop

Cython version
In [113]: %%cython
import numpy as np
from numpy cimport float_t
def solv_jac_cy(A,b,xinit,double tol=1e-8,int kmax=100):
"""
Cython Jacobi solver
"""
xold = np.copy(xinit)

cdef int dim = A.shape[0]


cdef int k = 0, i, j
cdef double conv = kmax, xtmp

cdef float_t [:] bb = b


cdef float_t [:] Xold = xold
cdef float_t [:] Xnew = np.ones(dim)
cdef float_t [:,:] AA = A

while (k<kmax and conv>tol):


conv = .0
for i in range(dim):
xtmp = 0.
for j in range(i):
xtmp += Xold[j] * AA[i,j]
for j in range(i+1,dim):
xtmp += Xold[j] * AA[i,j]
Xnew[i] = (bb[i] - xtmp)/AA[i,i]
conv += (Xnew[i] - Xold[i])*(Xnew[i] - Xold[i])

Xold[...] = Xnew
k += 1
return k,conv,np.array(Xnew)
In [114]: nk, conv, xnew = solv_jac_cy(A,b,x_0)
err = test_jac(A,b,xnew)
print(nk,err,conv)
35 0.004313748271923786 7.444846214863868e-09

In [115]: %timeit solv_jac_np(A,b,x_0)


1 loop, best of 3: 3.11 s per loop

33
Cython/Numpy version
In [116]: %%cython --compile-args=-fopenmp --link-args=-fopenmp

import numpy as np
from cython.parallel import prange
import cython
from numpy cimport float_t

@cython.cdivision(True)
@cython.boundscheck(False)
def solv_jac_cy_np(A,b,xinit,int NT,double tol=1e-8,int kmax=100):
"""
Cython/Numpy Jacobi solver using OpenMP
"""
xold = np.copy(xinit)

cdef int dim = A.shape[0]


cdef int k = 0, i, j
cdef double conv = kmax, xtmp

cdef float_t [:] bb = b


cdef float_t [:] Xold = xold
cdef float_t [:] Xnew = np.ones(dim)
cdef float_t [:,:] AA = A

while (k<kmax and conv>tol):


conv = .0
for i in prange(dim,nogil=True,num_threads=NT,schedule="static"):
xtmp = 0.
for j in range(i):
xtmp = xtmp + Xold[j] * AA[i,j]
for j in range(i+1,dim):
xtmp = xtmp + Xold[j] * AA[i,j]
Xnew[i] = (bb[i] - xtmp)/AA[i,i]
conv += (Xnew[i] - Xold[i])*(Xnew[i] - Xold[i])
Xold[...] = Xnew
k += 1

return k,conv,np.array(Xnew)

In [117]: nk, conv, xnew = solv_jac_cy_np(A,b,x_0,2)


err = test_jac(A,b,xnew)
print(nk,err,conv)

35 0.004313748271923786 7.444846214863865e-09

In [118]: %timeit solv_jac_np(A,b,x_0,2)

34
1 loop, best of 3: 1.25 s per loop

In [119]: %timeit solv_jac_np(A,b,x_0,4)

1 loop, best of 3: 1.16 s per loop

In [120]: %timeit solv_jac_np(A,b,x_0,6)

1 loop, best of 3: 1.09 s per loop

In [121]: %timeit solv_jac_np(A,b,x_0,8)

1 loop, best of 3: 1.08 s per loop

8.0.2 Other relevant topics (not covered)


1. Profile code line by line
2. Dissect a code that is not your own
3. Limit memory footprint
4. How to build a module
5. numexpr
6. Pyximport
7. Using classes with Cython
8. Cython and C++
9. You can write Python extensions directly in C using the API

9 For those about Fortran: F2PY


The purpose of the Section ?? –Fortran to Python interface generator– project is to provide a con-
nection between Python and Fortran languages. F2PY is a Python package (with a command
line tool, f2py and a module, f2py2e that facilitates creating/building Python C/API extension
modules.
Let’s write a Fortran subroutine

In [122]: !cat jac_solv.f95

module fjac
contains
subroutine jacsolv(kmax,A,b,xold,conv,xnew,order)
implicit none

integer :: order
real(8), dimension(0:order-1,0:order-1) :: A
real(8), dimension(0:order-1) :: xold,b
real(8), dimension(0:order-1) :: xnew

35
!f2py intent(in) order
!f2py intent(out) xnew
!f2py depend(order) xnew

integer :: i, k, kmax
!f2py intent(in,out) kmax
real(8) :: conv, tol
!f2py intent(out) conv
real(8), dimension(0:order-1) :: dd, xtmp
real(8), dimension(0:order-1, 0:order-1) :: R, D

conv = kmax
tol = 1e-8

xtmp = xold
xnew = 1.
forall (i=0:size(dd)-1) D(i,i) = A(i,i)
forall (i=0:size(dd)-1) dd(i) = 1./A(i,i)
R = A-D

do k=1,kmax
conv = 0.
do i = 0,order-1
xnew(i) = (b(i) - DOT_PRODUCT(xtmp,R(i,:)) )*dd(i)
end do
conv = SUM((xnew-xtmp)*(xnew-xtmp))

if (conv <= tol) then


exit
end if

xtmp = xnew

end do

kmax = k

end subroutine
end module fjac

This is quite standard Fortran 95 code, with the exception of the !f2py lines which are prepro-
cessor directives (like the #pragma ones).
Instead of directly calling gfortran we use f2py3 to generate the appropriate wrapping code
and then compile:

In [123]: %%capture f2py_out


! f2py3.5 -c --fcompiler=gnu95 --compiler=unix jac_solv.f95 -m fjac

36
In [124]: !ls -rt

hello_script.py __pycache__
hello_mod.py jac_solv.f95
thread_en.png swig
simple_pi.f 08_MixedProgramming.slides.html
simple_pi.c pi_omp
pi_omp.c cpi
wcy.png fpi
wo_cython.png mc_cpi
cy_logo.png Untitled.ipynb
box.png 08_MixedProgramming.ipynb
array.png fjac.cpython-35m-x86_64-linux-gnu.so
pimc.c

In [125]: import fjac

In [126]: fjac.__doc__

Out[126]: "This module 'fjac' is auto-generated with f2py (version:2).\nFunctions:\nFortran 90/9

In [127]: A.shape[0]

Out[127]: 10000

In [128]: nk, conv, xnew = fjac.fjac(100,A,b,x_0,A.shape[0])


print(nk,conv)

35 7.444846214863856e-09

In [129]: %timeit nk, conv, xnew = fjac.fjac(100,A,b,x_0,A.shape[0])

1 loop, best of 3: 20.4 s per loop

10 The simple Wrapper Interface Generator (SWIG)


From the documentation:
Section ?? is a software development tool that connects programs written in C and C++ with
a variety of high-level programming languages. SWIG is used with different types of target lan-
guages including common scripting languages such as Javascript, Perl, PHP, Python, Tcl and Ruby.
When using SWIG you have to create an interface file (.i) which is similar to a header file and
tells to the swig program how to relate Python objects to C/C++ types in the module. The .i file is
used in the compilation step to produce C/C++ files encapsulating C/C++ files for the high level
language.
Even if "less flexible" than Cython or F2py is also a more general and sometimes easier tool
since to use and also let you to interface the same low-level code with multiple interfaces.
A simple example:

37
%module testswig
%{
#include "test.h"
%}
%include "test.h"

the %module specifies the name of the module to be generated from this wrapper file. The
code between the %{ %} is placed, verbatim, in the C output file.
An interface for SWIG and Numpy Section ?? and can be now found Section ??. To estimate π
using MonteCarlo by passing Numpy arrays to an underlying C function we need:

1. a SWIG interface file (.i)


2. a C module
3. a Makefile

In [130]: %cd swig

/home/gmancini/Dropbox/Calcolo/08_Mixed/swig

In [131]: !cat swigmc.c

#include <stdio.h>
#include <stdlib.h>
#include "swigmc.h"

double calcpi(double *x, double *y, int npoints){

int i, inside=0;
double r2, pi;

for(i=0;i<npoints;i++){
r2 = x[i]*x[i] + y[i]*y[i];

if(r2<=1.0){
inside++;
}
}

pi = (4.0* ((double) inside))/ ((double) npoints);


printf("%f %f\n",x[0],y[0]);
return pi;
}

In [134]: !cat swigmc.h

double calcpi(double *x, double *y, int npoints);

38
In [135]: !cat swigmc.i

%module swigmc
%{
#define SWIG_FILE_WITH_INIT
#include <stdio.h>
#include <stdlib.h>
#include "swigmc.h"
%}

%include "numpy.i"

%init %{
import_array();
%}

%apply (int DIM1, double* IN_ARRAY1) {(int npx, double* x)};


%apply (int DIM1, double* IN_ARRAY1) {(int npy, double* y)};

%inline %{
double swigpi(int npx, double* x, int npy, double* y){
printf("%f %f\n",x[0],y[0]);
double PI;
PI = calcpi(x,y,npx);
return PI;
}
%}

In [136]: !cat Makefile

#Makefile for xdrlib with python SWIG interface


#Giordano Mancini Sept 2012

# Set compiler flags


CC = gcc
CFLAGS = -O3 -ffast-math -pipe -fPIC
F77 = gfortran
FFLAGS = $(CFLAGS)

SHELL = /bin/sh

OBJECTS = swigmc.o
INCLUDE = swigmc.h

CPPFLAGS = -I/usr/include/python2.7 -I/usr/lib/python2.7/site-packages/numpy/core/include/

SWIG = swig

39
SWIGOPT = -python -Wall
SWIGOBJS = swigmc_wrap.o

all: clean swigmc

swigmc: $(OBJECTS) $(SWIGOBJS)


$(CC) -shared -o _swigmc.so $(OBJECTS) $(SWIGOBJS)

swigmc_wrap.c: swigmc.i
$(SWIG) $(SWIGOPT) swigmc.i

swigmc_wrap.o: swigmc_wrap.c
$(CC) $(CFLAGS) $(CPPFLAGS) -c swigmc.c swigmc_wrap.c

clean:
rm -f *.so *.o *.pyc swigmc_wrap.c *py *gch

.SUFFIXES : .c .h .o

.c.o:
$(CC) $(INCLUDE) $(CFLAGS) -c $*.c

In [137]: !make clean && make

rm -f *.so *.o *.pyc swigmc_wrap.c *py *gch


rm -f *.so *.o *.pyc swigmc_wrap.c *py *gch
gcc swigmc.h -O3 -ffast-math -pipe -fPIC -c swigmc.c
swig -python -Wall swigmc.i
gcc -O3 -ffast-math -pipe -fPIC -I/usr/include/python2.7 -I/usr/lib/python2.7/site-packages/nump
In file included from /usr/include/python2.7/numpy/ndarraytypes.h:1777:0,
from /usr/include/python2.7/numpy/ndarrayobject.h:18,
from /usr/include/python2.7/numpy/arrayobject.h:4,
from swigmc_wrap.c:3029:
/usr/include/python2.7/numpy/npy_1_7_deprecated_api.h:15:2: warning: #warning "Using deprecated
#warning "Using deprecated NumPy API, disable it by " \

gcc -shared -o _swigmc.so swigmc.o swigmc_wrap.o

In [138]: !cat test.PY

import numpy as np
import swigmc

npoints = int(1e7)
x = 2.*np.random.rand(npoints)-1
y = 2.*np.random.rand(npoints)-1
print(x[0],y[0])

40
pi = swigmc.swigpi(x,y)
print(pi)

In [139]: %%bash
#module load python-2.7
python test.PY

(-0.89016441899965737, -0.060196036224869243)
-0.890164 -0.060196
-0.890164 -0.060196
3.1418348

41

You might also like