Pyxll is a commercial Excel add-in linking Excel and Python. The latest version offers greatly improved performance in transferring data between Excel and Python as Numpy arrays, amongst many other new features and improvements.
To check what this means in practice, I have checked the time to invert a large matrix (2000 x 2000 cells) in Python using various options:
- Calling a Python macro from VBA. This passes data via Microsoft COM, which is inherently slower than the other methods.
- Calling the Python function as a user defined function (UDF), with five different data types:
- A Python tuple of variant data type.
- A tuple of float type.
- A Numpy array defined as floats.
- A Numpy array of undefined data type.
- Input as a Numpy array, with output as an object cache.
Parts of the input and output arrays, and times for these six options are shown in the screen-shot below:
Performance for the first option is very slow, with the data transfer taking about 20 times longer than the much more complex and computer intensive task of inverting the matrix.
The first two UDF options, passing the data as Python tuples, shows a great improvement, with the total execution time reducing to less than 2 seconds. The UDF passing the data as floats rather than variants was slightly faster, as would be expected.
The two UDFs passing data as Numpy arrays in both directions showed a further significant improvement, with the data transfer time being reduced by about half, for a total execution time of about 1 second. There was not a significant difference between the two UDFs, which is as expected since the default data type for Numpy arrays is float.
The final UDF reduced the data return time to close to zero, with a total execution time of just over half a second. The data returned to the spreadsheet is shown in the screen-shot below:
The return value of the UDF displays as ndarray@0, which is pointer to the full 2000 x 2000 array. Results from the cached array may be displayed with a second UDF, which in this case has been set to display the first 2 rows. The data return time is proportional to the number of rows displayed, with the full array taking about the same time as the Numpy array results.
In addition to the greatly improved performance when not all results are required in the spreadsheet, the cache object provides a simple and effective means to deal with large data sets exceeding 1 million rows.
I also started a check of the built-in Excel MInverse function with the same array used for the Python functions. After six minutes it was still calculating (using just one processor), and I gave up.
For practical engineering analysis matrices for linear algebra problems may be much larger than 2000×2000, but will normally be sparse, with most of the elements being zero. An example (taken from a 3D frame analysis of a large building structure) is shown below, with a matrix size of 14743 x 14743:
In spite of the much larger overall matrix size, the Scipy iterative sparse solver reduces the solution time to about 0.15 seconds, and the sparse input and single column output greatly reduce the data transfer time, with the UDF with Numpy arrays again giving by far the best results.