More on Numba

Following my recent post Making Finite Element Analysis go faster … I have been having a closer look at the options in the Numba just-in-time compiler for improving the performance of Python code.

The Numba docs include a series of short code examples illustrating the main options. I have re-run these with pyxll based interface functions, so the routines can be called from Excel as a user defined function (UDF), and return the execution time for any specified number of iterations.

Typical code for timing a function is:

def time_ident_npj(n):
    x = np.arange(n)
    stime = time.perf_counter()
    y = ident_npj(x)
    return time.perf_counter()-stime

Note that the calls to the time function must be outside the function being timed, since Numba does not support the time function.

The fist example from the Numpy docs compared evaluation of an array function using Numpy arrays and Python loops, with or without Numba:

def ident_np(x):
    return np.cos(x) ** 2 + np.sin(x) ** 2

def ident_loops(x):
    r = np.empty_like(x)
    n = len(x)
    for i in range(n):
        r[i] = np.cos(x[i]) ** 2 + np.sin(x[i]) ** 2
    return r

Results from these functions are shown below, with times as reported in the Numba article, and as found with my code:

Using Numpy arrays, the Numba function was only slightly faster for me, and was slightly slower as shown in the Numba article. This is not surprising since the Python code had only a single call to the Numpy function, which is already C compiled code, so there is little scope for improving performance.

The function using Python loops was very much slower, and my results were slightly slower than the time in the Numba article. Presumably this is related to using different versions of Python. Adding the Numba decorator reduced the execution time for my code by a factor of 170, and the result was slightly faster than the Numpy function with Numba decorator.

The next examples looks at the effect of the Numba “fastmath = True” and “parallel – True” options:

def do_sum(A):
    acc = 0.
    # without fastmath, this loop must accumulate in strict order
    for x in A:
        acc += np.sqrt(x)
    return acc

def do_sum_fast(A):
    acc = 0.
    # with fastmath, the reduction can be vectorized as floating point
    # reassociation is permitted.
    for x in A:
        acc += np.sqrt(x)
    return acc
def do_sum_parallel(A):
    # each thread can accumulate its own partial sum, and then a cross
    # thread reduction is performed to obtain the result to return
    n = len(A)
    acc = 0.
    for i in prange(n):
        acc += np.sqrt(A[i])
    return acc

@njit(parallel=True, fastmath=True)
def do_sum_parallel_fast(A):
    n = len(A)
    acc = 0.
    for i in prange(n):
        acc += np.sqrt(A[i])
    return acc

Results for these functions were:

For this case the Numba compiled code was over 300 times faster than plain Python. The “fastmath = True” option was only of limited benefit in my case, although the Numba article results show a speed up of more than two times. Setting “parallel = True” increased performance by more than 10 times, with “fastmath = True ” again only providing a small further gain. With both options applied, the Numba compiled code was almost 4000 times faster than the plain Python for this case.

This raises the question as to why with more complex code the speed gain from using Numba is often much smaller. This will be examined in more detail in a later post, but the main reason is that if Numba is set to revert to Python mode if there is code it cannot compile (nopython = False), then the resulting code can easily be almost all Python based. The same effect is found using the alternative @jit or @njit decorators. The @njit decorator result in all the Python code being compiled, and will raise an error if any of the code cannot be compiled by Numba. The alternative @jit decorator will switch to Python mode if any code cannot be compiled, but with much reduced (if any) speed improvement. Examples from my own code that will raise an error with @njit, or will not be fully compiled with @jit include:

  • Use of the time function
  • Checking the data type of a variable, such as: if type(x) == tuple: …
  • Statements such as “StartSlope = EndSlope”, where EndSlope has not yet been defined at compile time.
This entry was posted in Excel, Link to Python, NumPy and SciPy, PyXLL, UDFs and tagged , , , , , , . Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.