Matrix multiplications in NumPy are reasonably fast without the need for optimization. However, if every second counts, it is possible to significantly improve performance (even without a GPU).
Below are a collection of small tricks that can help with large (~4000x4000) matrix multiplications. I have used them to reduce inference time in a deep neural network from 24 seconds to less than one second. In fact, in one case, my optimized code on a CPU turned out to run faster than Tensorflow using a GPU (1 second vs 7 seconds).
The first step is to measure everything.
On Linux, Python’s time.time()
provides a high resolution timer.
Timing information will help direct your efforts.
In a neural network, most of the time will be spent with large matrix multiplications.
Extremely complex element-wise operations (such as chains of sigmoids) may have neglible performance impact when compared to a slow matrix multiplication. Until you measure the performance of each step in your algorithm, you don’t know what is affecting performance.
Ensure your arrays have a dtype
of numpy.float32
, rather than the default numpy.float64
. I’ve seen this have a four-fold improvement or more.
(It may be tempting to try further reductions to numpy.float16
, but this backfires because the CPU and BLAS libraries do not work natively at this precision.)
BLAS is a high-performance matrix library. Even though NumPy uses BLAS, I've noticed performance can be improved by calling BLAS directly. Perhaps this is simply because using direct calls to BLAS forces you to shape your data ready for use with BLAS.
Replace numpy.matmul
with scipy.linalg.blas.sgemm(...)
for float32
matrix-matrix multiplication and scipy.linalg.blas.sgemv(...)
for float32
matrix-vector multiplication.
Check that you’re using OpenBLAS or Intel MKL.
An easy way to check is to look at your CPU usage (e.g., with top). If your matrix multiplications are using a single core, then you may be using a slow BLAS. You can get over 2x performance improvements by using a multi-core BLAS library such as OpenBLAS or Intel MKL.
Check whether values in your matrices are stored in column major (order = 'F'
) or row major order (order = 'C'
).
You can check the order by examining the .flags
property of your matrix (pay attention to the C_CONTIGUOUS
and F_CONTIGUOUS
flags).
For example:
> a = np.array([[1,2],[3,4]])
> a.flags
C_CONTIGUOUS : True
F_CONTIGUOUS : False
OWNDATA : True
WRITEABLE : True
ALIGNED : True
WRITEBACKIFCOPY : False
UPDATEIFCOPY : False
You can change the order by copying your matrix (numpy.copy(..., order='F')
) or by a transpose.
Changing the order of your matrices can improve performance (BLAS typically works better with column major order).
Hint: C and F stand for the orders used in the C and Fortran programming languages. BLAS prefers F because BLAS is written in Fortran.
The data order can have dramatic impact on performance.
Each iteration in a recurrent neural networks including LSTMs combines input with the output state of the previous iteration.
The input processing can be partially factored out.
i.e., rather than matmul(M, concatenate(x[i], h))
, the matrix M
can be factored out into M_x
and M_h
.
This means that temp = matmul(M_x, x[:])
can be precomputed in single large batch, then the iterative computation becomes temp[i] + matmul(M_h, h)
.
This results in (small) benefits.
The inference vectors in a neural networks trained with ReLU activations may be sparse.
This can be exploited, for example by using the sparse vector v
as a filter for the columns of a matrix: M[:,v > 0.001]
.
I found that this sometimes resulted in a small improvement in run time.
A large matrix can be approximated by computing the Singular Value Decomposition (SVD). Computing an SVD is too slow to be done online. However, if one of your matrices is constant, then ‘precomputation’ can pay off.
Consider the multiplication y = matmul(A, x)
.
Such a multiplication can be approximated by two lower rank multiplications:
U, s, V = numpy.linalg.svd(A) # Very slow, so precompute!
rank = len(s) / 3 # Compression by a factor of 3
y = matmul(V[:rank,:],x)
y *= s[:rank]
y = matmul(U[:,:rank], y)
Hint: precompute and copy the views V[:rank,:]
, s[:rank]
, and U[:,:rank]
and don't forget to cast the SVD down to float32
.
Reducing a single 2000x2000 matrix multiplication to a 100x2000 followed by a 2000x100 multiplication (for example) can make a big difference!
I've found that reducing the rank of a matrix by a third or more can have negligible impact on the accuracy of a RNN yet produce dramatic improvements in performance.
Let me know if you have other suggestions/tricks to recommend. Happy optimizing!
Published 21 January 2018 by Benjamin Johnston.