## Jupyter Snippet P4M TheNeedForSpeed

Jupyter Snippet P4M TheNeedForSpeed

# Python - The Need for Speed

Python was originally developed by Guido van Rossum as a high level general purpose scripting language in the 1990s. It’s strength has been in particular that it is easy to learn and that algorithms can be coded quickly (python has sometimes been described as executable pseudo-code). In recent years it has been increasingly used also for scientific computing and mathematics. Here one quickly reaches a point where the time spent waiting for a program to run becomes much longer than the time spent writing the code. Hence the need for speed.

In this tutorial a number of ways of speeding python up will be explored. These will be tried in the context of a simple numerical algorithm. We will start with a simple but slow implementation and then look at what makes it slow and how to speed things up.

## Method 0: Gaussian Elimination - A simplistic implementation

Consider one of the basic mathematical algorithms that most students learn in their first year of university mathematics: solving a set of simultaneous linear equations using the Gaussian Elimination algorithm. Below is the framework for a function to implement this method. Note the use of comments to describe the intent and usage of the function. Preconditions and post-conditions, in the form of assert statements, are used to help with debugging. Including such comments and conditions as a matter of course is a good practice, even for code that you don’t intend to maintain long term, as it can save you a lot of time hunting for bugs or when you invariably end up re-using code for other purposes than originally planned.

While this is not the main point of the exercise, you may also want to think about things such as numerical stability and how to deal with any errors. To keep things simple here we are going to assume that the input arguments are a dense matrix $A$ in the form of a dictionary, right hand side vector $b$ and we return the solution to $A x = b$ as a simple list.

Hint: Here is a pseudo-code of a very basic Gaussian elimination algorithm

INPUT: A,b
U := A  # so the input is not modified
x := b
for i = 1,...,n:  # using 1 indexing here as in typical math. notation
# assuming U_ii != 0 here, could add code to cater for this exception
Divide row i of U by U_ii  (this makes U_ii==1)
for k = i+1,...,n:
subtract U_ki times row i from row k
x_k := x_k - U_ki * x_i
U is now upper triangular
for i = n,n-1,...,1: # back substitution
x_i := (x_i - sum( U_ik * x_k for k=i+1,...,n) ) / U_ii
OUTPUT: x

def gaussSimple(A,b):
"""Solve the set of equations A x = b and return x
Input: b = list of length n, and A = dictionary, with A[i,j] being the entry for every row i, column j
Output: A list of length n giving the solution. (This version throws an assertion failure if there is no solution)
Note: The inputs are not modified by this function."""
n = len(b)
N = range(n)
assert (len(A) == n*n), "Matrix A has wrong size"  # simple check of inputs
assert all( (i,j) in A for i in N for j in N), "Cannot handle sparse matrix A"
U = dict(A.items()) # make a copy of A before we transform it to upper triangular form
x = b[:]  # copy so we don't modify the orignal

## for the most basic version we want to:
## For every row i
##    Eliminate all entries in column i below i by subtracting some multiple of row i
##    Update the right hand side (in x) accordingly
##    return [] if A does not have full rank (we come across a coefficient of zero)
## Back-substitute to replace x with the actual solution

error = max( abs(sum(A[i,j]*x[j] for j in N)-b[i]) for i in N)
assert (error < 1e-5 ), f"Incorrect solution: out by {error}" # check that we have reasonable accuracy
return x



To test this we are going to generate some random data. This will also allow us to see how fast it runs as the size of the problem increases

from random import random
def randProb(n):
"Returns a randomly generated n x n matrix A (as a dictionary) and right hand side vector b (as a list)."
n = int(n)
assert n > 0
N = range(n)
return dict( ((i,j), random()) for i in N for j in N), [random() for i in N]

A,b = randProb(3)
gaussSimple(A,b)


## Execution timing

Now lets see how fast this thing goes. There are a couple of things to think about when we talk about timing:

• Is it elapsed time (wall-clock) time or CPU time we are measuring? The two should be nearly the same if we are using a single threaded program on a computer that is not fully loaded. However, once we have multiple threads in our program, or multiple other programs competing for a limited number of CPUs, the results could be quite different
• Random variation - running the same code several times can result in small random variations in run-time due to a range of factors (cache misses, variable CPU clock speed, computer load, etc) even when the functiuon we are testing is entirely deterministic and run with the same inputs
• Garbage collection: One of the things that make Python convenient is that we generally don’t have to worry about memory management. However if the automated garbage collection happens to cut in during the function we are testing this can make a big difference.

For our tests here we are going to use the timeit module that takes care of the last two points by turning of garbage collection and making repeated tests easy. We will specifically ask it to measure CPU time using the time.process_time() function but you can experiment with using time.perf_counter() function which is the default.

In addition, to get a feeling for how the runtime increases as we double the size of the data, we also print the ratio between successive tests with increasing data. We reduce the random variation between tests of different methods by always generating data with the same random number seed.

Since we are going to make use of the speed test functions repeatedly we are going to write this out as a separate module that we can import into any notebook or python program.

%%writefile 'gausstest.py'
"""This module contains a function for testing the speed of a function which solve Ax=b
usage:   import gausstest
gausstest.speed(myfunction) # optional second argument, maximum size n
Here myfunction takes arguments A,b,"""
import timeit,time,gc
from statistics import mean,stdev
from random import seed,random

def randProb(n):
"Returns a randomly generated n x n matrix A (as a dictionary) and right hand side vector b (as a list)."
n = int(n)
assert n > 0
N = range(n)
return dict( ((i,j), random()) for i in N for j in N), [random() for i in N]

def speed(method,maxSize=400):
seed(123456) # fix some arbitrary seed so we keep generating the same data
randP = lambda n : randProb(n) # add randProb to locals() namespace
prev,n = 0.0, 50
gc.disable()
while n <= maxSize:
gc.collect() # manual cleanout of garbage before each repeat
t = timeit.repeat(stmt="method(A,b)",setup=f"A,b=randP({n})",
timer=time.process_time,repeat=5,number=1,globals=locals())
print("%4d %10.4f σ=%.2f sec" % (n,mean(t),stdev(t)),"(x %.2f)"%(mean(t)/prev) if prev > 0 else "")
prev = mean(t)
n *= 2
gc.enable()


Now lets test our function systematically.

import gausstest
print("Simple test of a single 3-dimensional instance: ",
gaussSimple(*gausstest.randProb(3)),"\n")
# Note that in the above the '*' means we are passing the tuple of outputs
# from randProb as successive arguments to gaussSimple
gausstest.speed(gaussSimple) # full test


### Discussion - Complexity

• What is the theoretical complexity of this algorithm?
• Does the practical performance match the theoretical expectation?
• What can be done to make this implementation better?
• More robust numerically? - (Left as exercise to interested students)
• Faster? - What makes it slow?

## Method 1: changing the data structures

As discussed, part off the reason for the slow time is that python treates every variable and every element of generic data structures like lists and dictionaries, as an “Any” type that could contain anything. That means that in executing every step of the algorithm, the python interpreter has to got through a vast list of “if” statements to check every possible type to determine what actually needs to be done at this point of the code.

In our next version, we are going to replace the dictionary and list structures with array data structures. These are much more like the arrays found in Java/C/Fortran/… just a big chunk of memory with each element containing the same type. The basic usage for this data type is

from array import array
x = array('i',range(10))


This would initialise an array of integers containing the numbers 0,…,9. An array behaves much like a list in python but contains only elements of one basic type (integer, char, float). For our purposes we will want to create array('d',[]) where ’d' stands for double (C style 64-bit floating point number) and [] should be replaced by a suitable initialiser rather than the empty list. A

Write a function gaussArray that uses array data structures for all data types (matrix, right hand side and x). To fit the matrix into a single dimensional array, you need to do a bit of index arithmetic (and use range with appropriate step size). Alternatively (if you prefer) you could use the numpy version of array.

from array import array
#Solution algorithm
def gaussArray(A,b):
"""Solve the set of equations A x = b and return x
Input: b = list of length n, and A = dictionary, with A[i,j] being the entry for every row i, column j
Output: An array of length n giving the solution.
Note: The inputs are not modified by this function."""


import gausstest
gausstest.speed(gaussArray)


### Discussion Question:

Where does the speed-up of this method over the previous one come from? What makes this method quite slow still?

## Method 2: Using numpy

For numeric computation, a “standard” set of libraries is the numpy module and friends. This provides vectors and matrices together with basic linear algebra routines implented largely in C++ and Fortran. These laregy mimick the type of basic functionality built into matlab. Many other python packages are built on top of this so that the numpy data types have become a defacto standard for any kind of scientific computing with python.

We could use numpy to solve our equations directly using the numpy.linalg.solve routine. You may want to try this briefly. However, here we are mainly interested here in seeing whether writing our own algorithm on top of the basic matrix methods is going to be faster.

Basic usage for numpy: For convenience we import numpy as np (a fairly common practice in the python community)

• Create matrices using something like A = np.array( [ [1,2,3],[4,5,6] ] ) to create a 2 x 3 matrix. If all elements are integer this will be int64 type otherwise float64. You can query the dimensions with A.shape
• Matrices can be indexed using A[0,2] == 3 or using slices such as A[0,1:3] == np.array([2,3.0])
• Arithmetic operations are overloaded to do mostly do what you would expect. The only significant exception is multiplication. When multiplying two arrays or matrices together we get the Schur product, that is the result has each of the corresponding elements from the input multiplied together (the operands must have the same dimensions). To get the normal inner product or matrix procut use np.matmul. E.g. np.matmul( np.array([[3,0],[0,2]]), A) or A.dot(x) to get an inner product that is effectively the same as matrix multiply if A is a matrix and x is a vector or matrix (but behaves slightly differently for other types of A & x).
• Matrices can be built up using np.hstack, np.vstack (horizontal & vertical stacking)
• To reshape a numpy array use np.reshape, this is particularly useful to turn 1 dimensional matrices into 2 dimensional matrices. V = np.reshape(v,(len(v),1)) will turn an array v of length n into a n x 1 matrix V.

The task for this method is to write a Gaussian elimination method using matrix multiplications. In a first year maths course you have probably seen that the elementary row operations of Gaussian elimination can be represented by pre-multiplying the matrix we are reducing with a suitable elementary matrix (an identity matrix with one off-diagonal element). So we can rewrite the algorithm to set up a suitable elementary matrix for each row reduction and pre-multiplying our matrix with this. For example for a 5 x 5 matrix $A$, to subtract 3 times the 2nd row from the fourth row we would pre-multiply $A$ by $$E_{r4-3\times r2}=\begin{bmatrix}1& & & &\& 1 & & &\ &&1&&\&-3&&1&\&&&&1\end{bmatrix}$$

There is only one problem: Naive matrix multiplication requires $O(n^3)$ operations. If we just replace out inner loop with such a matrix multiplication, the overal complexity will be $O(n^5)$ - clearly that would be a bad idea. To fix this we are going to use two “tricks”

1. Collapsing all of the elementary matrices into a single square matrix that carries out all of the row reductions below the current element. For “zeroing” out the column below element $(i,i)$ this looks like an identity matrix with non-zero elements only below element $(i,i)$. This means we are only carrying out $O(n)$ matrix multiplications.
2. Using sparse matrices. Since the matrices we are using are mostly zero, we only want to store, and more importantly multipy with, the non-zero entries of the matrix. This reduces the cost of matrix multiplications from $O(n^3)$ to $O(n^2)$, as the sparse matrix only has at most 2 non-zero entries per row.

Note: sparse matrices are found not in numpy itself but in scipy.sparse where there are multiple formats, depending on whether we are storing matrices by row or column or with some (block) diagonal structure. Here it makes sense to use the row based format (as when we are pre-multiplying with our special matrix, each entry in the answer is the inner product of a column of the right hand side with a row of our special matrix). For a compressed row sparse matrix the internal storage representation is:

• An array start of length number of rows + 1, with start[i] containing the first index of non-zero elements of row i (and start[i+1] giving the end)
• An array col of length number of non-zeros where col[j] is the column of the j’th non-zero entry in the matrix
• An array data of length number of non-zeros so that A[i,j] == data[k] if col[k]==j and start[i]<=k<start[i+1]

We can set such a sparse matrix up by either providing the start, col and data arrays explicitly, or in various other formats as described here.

import numpy as np
def npSolve(A,b):
"just to check that numpy really is faster"
# write an implementation that calls np.linalg.solve
gausstest.speed(npSolve,800)

import numpy as np
from scipy.sparse import csr_matrix
def gaussSparse(A,b):
"Implementation of Gaussian elimination with numpy and sparse matrices"
# write an implementation that uses csr_matrix
gausstest.speed(gaussSparse,1600) # this should be fast enough to allow running up to 1600x1600


## Method 3 - Call an external library

At this point you might think - for this task Python is more trouble than it’s worth. Why not just write the critical parts in a real programming language (say C++)? Just use Python to do what it’s good at: pre-processing data, reporting, quick-and-dirty hacking etc. Fortunately, calling an external library from Python is not hard. Here I will provide you a simple C callable library that implements the basic gaussian elimination and it will be your task to interface to this. Similar patterns are also useful if you are using a commercial library (no source code available) or if you inhert some code from a colleague who still codes in Fortran.

For starters here is the C++ code

%%writefile gauss.cpp

extern "C" {  // initial declaration as C style function (don't want C++ function name mangling)
const char *hello(const char *world); // simple test function
double check(const double **A,int i,int j); // returns A[i][j], check if arguments are passed correctly
double gauss(int n,const double **A,const double *b,double *x); // compute x : A x = b, for n x n matrix A
// assumes that x is already allocated in the right size - returns the maximum error
}
#include <stdio.h>
#include <vector>
#include <math.h>
const char*hello(const char *world) { static char buf[1028]; sprintf(buf,"Hello %s",world); return buf; }

double check(const double **A,int i,int j) { // check if arguments are passed correctly
return A[i][j];
}

double gauss(int n,const double **A,const double *b,double *x) // compute x : A x = b, for n x n matrix A & return max error
{
std::vector<std::vector<double> > U(n);
for(int i=0; i<n; ++i){   // copy input data into U
U[i].resize(n+1);
for(int j=0; j<n; ++j) U[i][j]=A[i][j];
U[i][n]=b[i];
}
for(int i=0; i<n; ++i)  // do the row reduction
for(int j=i+1; j<n; ++j){
const double mult = U[j][i]/U[i][i];
U[j][i] = 0;
for(int k=i+1; k<=n; ++k)
U[j][k] -= mult * U[i][k];
}
for(int i=n-1; i>=0; --i){ // back-substitution
x[i] = U[i][n];
for(int j=i+1; j<n; ++j) x[i] -= U[i][j]*x[j];
x[i] /= U[i][i];
}
double error=0;
for(int i=0; i<n; ++i){
double sum=-b[i];
for(int j=0; j<n; ++j) sum += A[i][j]*x[j];
if(fabs(sum)>error) error = fabs(sum);
}
return error;
} // end function gauss()


Lets compile this code. If you are under windows and are having trouble with this, there is a pre-compiled .dll on the website (just put it in the same folder as this notebook)

import subprocess
# run: a simple function to execute a command (like os.system) and capture the output
run = lambda cmd: subprocess.run(cmd.split(),stdout=subprocess.PIPE,stderr=subprocess.STDOUT).stdout.decode("utf-8")
print(
run("g++ -o gauss.so -fPIC -shared -O3 -Wall gauss.cpp -lm"), # compile
run("ls -l gauss.so")) # check the library exists

#### Note: in Jupyter notebooks we could just as easily do this with the 'magic' ! commands below
# This is just a bit of magic go compile the C++ code. If you are doing this on windows you would get a DLL (assuming
# you have a proper windows installation that includes enough of cygwin to make windows look like a proper operating system :-)
!g++ -o gauss.so  -fPIC -shared -O3 -Wall gauss.cpp -lm
# you should now have a file gauss.so in the same directory as the notebook
!ls -l gauss.so  # just checking that the file is here and has the correct timestamp


### Calling an external library

Most of the magic required for interacting with the compiled library can be found in the ctypes module. The key elements are:

• cdll.LoadLibrary("./gauss.so") allows you to load a library. There are some subtle differences between windows (loading a .dll library) and Linux (calling a .so library) but it should work on any system (also Macs). Note: a library can only be loaded once. A second call to LoadLibrary with the same filename will return a link to the same (previously loaded) library. If you change the C++ source code an recompile you need to restart the kernel (or change the filename
• Converting to the correct types: ctypes defines standard ctypes. Many things will convert automatically for example python bytes string to const char *
• When working with C types explicitly you may need to convert. E.g. i=cint(3) and i.value (to go from python int to cint and back again)
• Sometimes we need to specifically set things up correctly. We can use ctypes.c_void_p to get a generic pointer type (void *) or use POINTER(c_int) to create a pointer to an integer (in this case).
• Arrays can be declared directly by multiplying a type by an integer size and using an initaliser list. For example c_int*3 is the same type as int[3] in C/C++. So (c_int*3)() constructs an unintialised array and (c_int * 3)([0,1,2]) will create an initialised array. Just like in C, you can call a function expecting POINTER(c_int) with a an argument of type c_int*3, for example.
• You may need to declare functions including their return type so that python needs how to call the function. For example if you had loaded a library lib containing a function called func then
• lib.func.argtypes = [c_char_p, c_double] says that func(char *, double) is the signature of the arguments
• lib.func.restype = c_char says func returns a char (rather than the default int)
• Alternatively numpy provides a simple way to access the pointer to the underlying memory buffer. If x is a numpy array then x.ctypes.data contains the pointer. See numpy.ndarray.ctypes for more information
• Yet another option is to modify the C code to have a set of simple helper functions for setting up the inputs and querying the outputs with all helper functions just receiving or returning basic values (int or double) that can be passed easily.

Have a go at calling the compiled library. To get started you might want to test that you can load and call simple functions. For example try calling the hello(b"world") function from the library or use the check(A,i,j) function to see if you can pass a matrix and return element A[i][j].

# insert your code here


Now try writing a small wrapper to pass the standard equation data into the C program and return the result as a list

# solution
from ctypes import *

def gaussC(A,b):
"Solve by using the gauss() function from gauss.so/gauss.dll"


import gausstest
gausstest.speed(gaussC,1600) # how fast will this go?


### Discussion

Why is C fast? Can python ever be as fast as C?

## Method 4 - Using Numba

What about compiling Python to give it the speed of C/C++/Fortran or at least something close to it? There are a number of projects that are working towards this goal. Some examples of this include PyPy, PyRex (no longer supported?) and Numba. Each of these make some attempt to compile python to native machine code. Note that there are also some other python implementation such as Jython and IronPython that target the Java Virtual Machine and .NET, but this is more about compatibility with a different software ecosystem than speed. Another issue that can cause confusion is that Python “byte compiles” modules - this is quite different to compiling to machine code with a language such as C or Fortran. It really just turns verbose ASCII source code to a much denser binary format that is still interpreted but much quicker for the interpreter to read (for example it doesn’t contain any of the comments from the original source).

The difficulty with trying to compile Python, rather than interpreting it, is that Python was not designed for this. Hence most attempts at producing a fast, compiled version of Python tends to only be compatible with a subset of python. Here we will use the Numba package, because it is in very active development, can be mixed seamlessly with interpreted Python code, and particularly targets scientific computing applications.

How does Numba work? It does a just-in-time (JIT) compilation of marked functions into machine code. That is we include a special library and “tag” functions to be compiled to machine code the first time that they are run. Such tagged functions can only use a subset of the python language (or else the numba package will fall back to running same code as the interpreter)

Basic usage:

from numba import * #or just import jit, int32,float64

@jit( float64(int32,float64[:,:]), nopython=True)
def f(n,M):
return M[n,n]


What does all that mean?

• @jit is a special decorator for the function that says this function should be just-in-time compiled by numba. In fact this is the only part that is really necessay, the rest of the lines in paranthesis could be left out.
• The first optional argument to @jit provides a type specification for the function. In this case it says that the function returns a float64 (64-bit floating point number or double in C). It takes two arguments the first of which is a 32-bit signed integer (as opposed to say a uint8), the second argument is a numpy matrix. If this is not specified, numba will “guess” the types based on the arguments to the function the first time it is called.
• The same function may be compiled multiple times with different types (either by being called with different arguments or by specifying a list of type declarations). Note that the compile time makes the first time that the function is called relatively slow compared to subsequent calls
• nopython=True tells the compiler that it should not accept any python code that it cannot compile (where it would need to fall back to the python interpreter). Without this @jit will always succeed in “compiling” the function but will potentially produce a function that is no faster than the standard interpreted python code. With the flag set, the compiler will raise an error if the function includes elements of python that it can’t handle.

That should be enough to get you started. See the Numba manual for more information. Go ahead and write a version of the Gaussian elimination solver using Numba - not that you will need a wrapper function to translate the dictionary into numpy arrays first.

import numba
from numba import jit,float64
import numpy as np

def numbaSolve(A,b):
"""Just-in-time compiled gauss-elimination function"""
# write your code here to compute x
return x

def gaussNumba(A,b):
"wrapper to call the function compiled with numba"
# convert argument types and call numbaSolve
return x


gausstest.speed(gaussNumba,1600) # how does this compare with the C++ version?


The strategy employed by Numba is virtually identical to that used in Julia to produce fast code. If you have a project that is entirely focussed on heavy-duty computational work, you may want to give this ago. For small amounts of numerical computation in a larger python project, Numba is entirely adequate though (and still improving rapidly due to ongoing development). For algorithms that are based less on numerical computations with vectors and matrices, you may also run into the limits of what Numba can’t compile - this may require rewriting an algorithm in a slightly less natural (more fortran-ish) way.

Note also that Numba does not compile all of the libraries that you import (though it supports a reasonable subset of the numpy libraries). So if you import some module X and call X.foo() this function will still run slowly.

## Method 5 - Two cores are better than one

Computers are getting more and more cores with every generation (the maxima server has 32, the mathprog group’s server has 72, even your phone probably has 4). So why are we only using 1 core in our python program so far?

### Discussion questions:

• What’s the difference between Multi-threading vs Multi-processing?
• What is Python’s GIL?
• GPU based parallelisation

### Suggested approach

Use Numba library with @jit(parallel=True,nopython=True) (can also explicity set nogil=True for any function to be called in a parallel thread. Indicate that loops are to be done in parallel by using numba.prange(n) instead of range(n): that is a loop for i in prange(n) will be carried out automaticallly by multiple threads. They automatically synchronise (wait for all of the threads to finish) at the end of the for-loop.

When doing the parallisation consider:

• Want reasonable chunks of work in each thread - synchronisation costs time
• Need to avoid conflicts where multiple threads want to write to the same memory: simultaneous read is fine. Avoid needing to use locks
• Limit to 2 threads - partly because of the number of people on the server, and because unless we significantly increase the size of matrices there will be very limited benefit from using many more. To enforce this use:
import os
import numba


Note that this has to be done before you load numba for the first time. If you are working in a notebook where you have used numba previously, please re-start the kernel. If you are working on a command line version of python you could also set the NUMBA_NUM_THREADS environmental variable before you even start python.

# please restart the kernel before executing this for the first time
# (the thread limit is only enforced when numba is first loaded)
import os
import numba
from numba import jit,float64,prange
import numpy as np

def parSolve(A,b):
"Parallel gaussian elimination using numba"
# write parallel version of numbaSolve() here
return x

def gaussPar(A,b):
"Call the parallel version compiled with numba"
# write wrapper function to call parSolve()
return x

import gausstest
gaussPar(*gausstest.randProb(5)) # compile & test code


#### Measuring performance of parallel code

The default speed test will only report CPU time used. This doesn’t tell you whether it has been used by 1 CPU or several. In fact, due to the overhead of parallelisation we expect the CPU time to go up - though hopefully not by very much. At the same time the elapsed time should go down.

gausstest.speed(gaussPar,1600)  # by default measures CPU need to test ellapsed time as well

## measure the cost of parallel threads
import numpy as np
import timeit,time,gc
from statistics import mean,stdev
from random import seed,random

def parspeed(method,maxSize=400):
from gausstest import randProb
seed(123456) # fix some arbitrary seed so we keep generating the same data
prev,n = 0.0, 50
gc.disable()
while n <= maxSize:
gc.collect() # manual cleanout of garbage before each repeat
#CPU = -time.process_time()
T = timeit.repeat(stmt="method(A,b)",setup=f"A,b=randProb({n})",
# need timer to be subtractable so make it a vector
timer=lambda : np.array([time.perf_counter(),time.process_time()]),
repeat=5,number=1,globals=locals())
#CPU += time.process_time()
CPU = [ x[1] for x in T]
t = wall = [x[0] for x in T]
nThread = [ c/s for c,s in zip(CPU,wall)]
print("%4d elapsed %10.4f σ=%.2f sec" % (
n,mean(t),stdev(t)),"(x %.2f)"%(mean(t)/prev) if prev > 0 else " "*8,

parspeed(gaussPar,1600)  # need to test ellapsed time as well