… and a Numpy trap.

I last looked at using the just-in-time compiler Numba with Python some time ago (Two Timers), but I was prompted to take another look by an article in Medium that claimed Python Can Be Faster Than C++.

The article compared times for a short function finding all the prime numbers between 1 and 10,000,000, using Python, C++, and Python with Numba, and finding that the Numba function was almost 60 times faster than pure Python, and about 3 times faster than C++. Using similar code, modified to be called from Excel, I compared the pure Python code below with two variants using Numba.

```
@xl_func()
@xl_arg('num', 'int')
def is_prime(num):
if num == 2: return True
if num <= 1 or not num % 2: return False
for div in range(3,int(math.sqrt(num)+1),2):
if not num % div: return False
return True
```

This was modified to use Numba:

```
@xl_func()
@xl_arg('num', 'int')
@njit(fastmath=True, cache=True)
def is_prime2(num):
if num == 2: return True
if num <= 1 or not num % 2: return False
for div in range(3,int(math.sqrt(num)+1),2):
if not num % div: return False
return True
```

And two alternatives were used to call the is_prime2 function:

```
def Call_is_prime2(N):
for i in range(N):
is_prime2(i)
return
@njit(fastmath=True, cache=True, parallel=True)
def Call_is_prime2N(N):
for i in prange(N):
is_prime2(i)
return
```

Calling Numba at the higher level (with the line @njit(fastmath=True, cache=True, parallel=True)) allows the parallel=True option to be set.

Results from these functions were:

- Pure Python: 58.6 sec
- Numba applied to low level function: 4.2 sec
- Numba with parallel=True: 0.58 sec

so the Numba code with parallel=True was 100 x faster than the pure Python code.

The next example compares the efficiency of Numba with using Numpy “ufuncs” that automatically operate on an entire array, rather than requiring a loop to operate on each individual element. The code below generates an array of random integers, then finds the inverse of each value:

```
@xl_func
@xl_return('numpy_array')
def timeinv(its = 1000000):
its = np.int(its)
res = np.zeros((2,3))
res2 = np.zeros((its,4))
np.random.seed(0)
vals = np.random.randint(1, 100, size = its)
stime = time.perf_counter()
inv = np.zeros(its)
for i in range(its):
inv[i] = 1.0/vals[i]
res[0,0] = time.perf_counter()-stime
```

Because “vals” is a Numpy array, the loop calculating the inverse values can be replaced with a single statement:

```
stime = time.perf_counter()
npinv = 1.0/vals
res[0,1] = time.perf_counter()-stime
```

This is compared with calling a function with the Numba @njit decorater:

```
stime = time.perf_counter()
jitinv = inv2(vals, its)
res[0,2] = time.perf_counter()-stime
return res
@xl_func
@xl_arg('its', 'int')
@njit(fastmath=True, cache=True)
def inv2(vals, its):
inv = np.zeros(its)
for i in range(its):
inv[i] = 1.0/vals[i]
return inv
```

In this case using the Numpy ufunc and the Numba function gave similar results:

- On an array of 100,000 integers the Numpy function reduced the computation time from 0.16 seconds to 0.12 milliseconds, about 1300 times faster. Using the Numba function was about 1100 times faster than plain Python.
- The difference was slightly reduced with larger arrays, with an array of 1 million integers being about 800 times faster with Numpy and 750 times faster with Numba.

As a further check, I reviewed the results of my previous benchmarks with Numpy and Numba:

The table shows times for VBA and Python functions to sum 7 arrays of sequential integers, ranging from 10 to 10 million values. As a check that the functions were working correctly the Python functions were modified to return the sum of the largest array in the first row, revealing that the Numpy code was returning the wrong results! The code with the problem was:

```
@xl_func
def numpysumn(k):
a = np.arange(k)
return a.sum()
```

It seems that the Numpy arange function uses 32 bit integers, even if the range extends outside the 32 bit range! Note that the Numba code calls the same Numpy function, but apparently changes the data type to at least 64 bit. To fix the Numpy function, the data type must be specified:

```
@xl_func
def numpysumn(k):
a = np.arange(k, dtype=np.int64)
return a.sum()
```

Results with the corrected Numpy function are shown below:

The bottom two rows of the table show the performance of each function (in iterations/sec) divided by the VBA function performance in the first row, and divided by the pure Python performance in the second row. The results show considerable differences compared with the Inverse function:

- The function using the Numba loop shows more than 3 times better performance than the Numpy function, or the Numpy+Numba code with parallel set to False.
- The Numba code using the Numpy arange function with parallel=True (last column) shows more than 35,000 times better performance than the pure Python code, and the value in the first row indicates it is returning the correct value for the sum!

It can be seen that the calculation time for the last column is almost the same for all iterations, suggesting that the Numba code recognises that the Numpy arange function returns a sequence of integers with uniform spacing, and applies a simple formula to find the sum, rather than looping through the entire range. If the array returned by arange is modified in any way, such as:

```
b = np.arange(0,k)
b[0] = 0
return np.sum(b)
```

The Numba code recognises that the array may not still have uniform spacing (even though in this case the first value is unchanged), and returns to looping through the array to find the sum:

The next post in this series will look at using Numba in more realistic tasks, such as forming stiffness matrices for frame analysis programs.

Pingback: Making Finite Element Analysis go faster … | Newton Excel Bach, not (just) an Excel Blog

Pingback: A Numpy trap – correction | Newton Excel Bach, not (just) an Excel Blog