Matrix multiplications in NumPy are reasonably fast without the need for optimization. However, if every second counts, it is possible to significantly improve performance (even without a GPU).

Below are a collection of small tricks that can help with large (~4000x4000) matrix multiplications. I have used them to reduce inference time in a deep neural network from 24 seconds to less than one second. In fact, in one case, my optimized code on a CPU turned out to run faster than Tensorflow using a GPU (1 second vs 7 seconds).

Measure First

The first step is to measure everything. On Linux, Python’s time.time() provides a high resolution timer. Timing information will help direct your efforts.

In a neural network, most of the time will be spent with large matrix multiplications.

Extremely complex element-wise operations (such as chains of sigmoids) may have neglible performance impact when compared to a slow matrix multiplication. Until you measure the performance of each step in your algorithm, you don’t know what is affecting performance.

Reduce precision

Ensure your arrays have a dtype of numpy.float32, rather than the default numpy.float64. I’ve seen this have a four-fold improvement or more.

(It may be tempting to try further reductions to numpy.float16, but this backfires because the CPU and BLAS libraries do not work natively at this precision.)

Use BLAS directly

BLAS is a high-performance matrix library. Even though NumPy uses BLAS, I've noticed performance can be improved by calling BLAS directly. Perhaps this is simply because using direct calls to BLAS forces you to shape your data ready for use with BLAS.

Replace numpy.matmul with scipy.linalg.blas.sgemm(...) for float32 matrix-matrix multiplication and scipy.linalg.blas.sgemv(...) for float32 matrix-vector multiplication.

Use a faster BLAS

Check that you’re using OpenBLAS or Intel MKL.

An easy way to check is to look at your CPU usage (e.g., with top). If your matrix multiplications are using a single core, then you may be using a slow BLAS. You can get over 2x performance improvements by using a multi-core BLAS library such as OpenBLAS or Intel MKL.

Check data order

Check whether values in your matrices are stored in column major (order = 'F') or row major order (order = 'C').

You can check the order by examining the .flags property of your matrix (pay attention to the C_CONTIGUOUS and F_CONTIGUOUS flags).

For example:

> a = np.array([[1,2],[3,4]])
> a.flags
C_CONTIGUOUS : True
F_CONTIGUOUS : False
OWNDATA : True
WRITEABLE : True
ALIGNED : True
WRITEBACKIFCOPY : False
UPDATEIFCOPY : False

You can change the order by copying your matrix (numpy.copy(..., order='F')) or by a transpose.

Changing the order of your matrices can improve performance (BLAS typically works better with column major order).

Hint: C and F stand for the orders used in the C and Fortran programming languages. BLAS prefers F because BLAS is written in Fortran.

The data order can have dramatic impact on performance.

Factor out common subexpressions

Each iteration in a recurrent neural networks including LSTMs combines input with the output state of the previous iteration.

The input processing can be partially factored out.

i.e., rather than matmul(M, concatenate(x[i], h)), the matrix M can be factored out into M_x and M_h.

This means that temp = matmul(M_x, x[:]) can be precomputed in single large batch, then the iterative computation becomes temp[i] + matmul(M_h, h).

This results in (small) benefits.

Sparse vectors

The inference vectors in a neural networks trained with ReLU activations may be sparse.

This can be exploited, for example by using the sparse vector v as a filter for the columns of a matrix: M[:,v > 0.001].

I found that this sometimes resulted in a small improvement in run time.

SVD compression

A large matrix can be approximated by computing the Singular Value Decomposition (SVD). Computing an SVD is too slow to be done online. However, if one of your matrices is constant, then ‘precomputation’ can pay off.

Consider the multiplication y = matmul(A, x).

Such a multiplication can be approximated by two lower rank multiplications:

U, s, V = numpy.linalg.svd(A) # Very slow, so precompute!
rank = len(s) / 3   # Compression by a factor of 3
y = matmul(V[:rank,:],x)
y *= s[:rank]
y = matmul(U[:,:rank], y)

Hint: precompute and copy the views V[:rank,:], s[:rank], and U[:,:rank] and don't forget to cast the SVD down to float32.

Reducing a single 2000x2000 matrix multiplication to a 100x2000 followed by a 2000x100 multiplication (for example) can make a big difference!

I've found that reducing the rank of a matrix by a third or more can have negligible impact on the accuracy of a RNN yet produce dramatic improvements in performance.

Enough for now

Let me know if you have other suggestions/tricks to recommend. Happy optimizing!

Published 21 January 2018 by Benjamin Johnston.