A bite of a “free lunch” optimizing numerical routines in Python
tl;dr: ask yourself these before you deadlift optimize code
I use Python pretty much daily, but I’m definitely not a Python wizard. Now that I’m done with converting the Python Spatial Analytics Library to use Python 3 compatible idioms, I took a break from refactoring more generally.
But, in doing two small tests/stabs at refactoring for speed & memory usage, I’m reminded why optimizing numerical code is so hard. I’ve heard “there ain’t no such thing as a free lunch,” so I’m usually wary about getting into performance optimization stuff, but these two recent instances really drive this point home for me again.
First off, know I’m not referring to optimization found by using pypy, Numba, Nuitka, or Cython. I’m not exactly sure where pypy currently sits on the scale of “optimizing” python, since it seems like the subset of numpy it covers is continuously fluctuating. But I know that when dealing with Numba and Cython, you’re typically looking at a significant amount of type-based refactoring, introducing special knowlege about objects to reduce the indirection and dispatch that make Python so slow.
No, I’m referring to the type of knock-down-drag-out fights with the theory or implementation of an algorithm. I tried (and failed) to do this for one bit of our code and lent a small hand to vet another. Since I’d talked to a few other grad students recently about how to write fast Python from the ground up, I figured a postmortem might help me remind myself to be more loath to conducting these kinds of deadlifting competitions.
Their PR contained two central optimizations, with one big realization. To compute Local Moran scores, we need to randomly reassign the neighbors of observations, and need to be sure to do this without replacement. That is, we can’t assign duplicate observations since, in theory, we’re using a random labling approach to estimate an empirical distribution for the test statistic.
Currently, we use
numpy.random.permutation to do this. In abstract, this never struck me as “slow” code, but I did recognize it as the bottleneck for the technique. To that end, I know a ton of attention has been paid by people tons smarter than me to make it fast (@sjsrey, @jlaura @lixun910 come immediately to mind). So, optimization from this was probably not going to come.
Hence why the PR was so surprising. Quickly, Xun noted the first speedboosting change wasn’t sound. It suggested using
numpy.random.randint to generate random indexes, rather than
permutation on the possible index. Since
randint returns a vector of potentially-duplicate random draws from a discrete uniform distribution, using them as IDs for the procedure amounts to sampling with replacement, which is statistically unsound here.
With that down, we the potential optimization submission rested on the second change. I hoped that maybe using
np.random.choice could be faster than using
np.random.shuffle. My (and I’m sure the submitter’s) intuition was that
np.random.choice might select random values the vector of indices without actually randomizing the entire vector. And, indeed, when you use
choice without replacement, it’s two orders of magnitude faster than
shuffling the entire vector and drawing the first k required observations. But, when
choice with replacement is used, again what’s needed for soundness of the technique, this extreme gain evaporates.
Computing extra inverses
In a section of our code, we need to construct a symmetric matrix where each term requires us to construct two matrix inverses. The way we do this currently is to construct all of the inverses on demand for a matrix of size t, which means we’re constructing O(t^2) inverses. It runs slowly, but it’s faster than the
R implementation the original authors were aiming to beat, so it’s not too concerning.
But, when the author was describing the code to me, it seemed pretty clear to me that we could construct only O(t) inverses by constructing all inverse matrices ahead of time and then looking them up, rather than computing an extra t on demand. Since the matrix wasn’t particularly large, this seemed like a very reasonable target for performance optimization.
So, I set off comparing the currently-implemented on-demand method against two alternatives. Both computed t fewer inversions and stored them. But, even if you leverage some rather speedy LU factorization from SciPy for the inversion, I was only seeing a 1.13X speedup. Not quite what I’d hoped.
So, why didn’t the O(t) algorithm outperform the O(t^2) algorithm more? because t is small. More importantly, t is frequently small, so the extra inversions are almost unnoticed. Profiling the code more thoroughly wouldn’t really catch this, since each inversion is still a savings, it’s just that the savings for our typical use cases is very small. While I was happy to have an implementation that used as much memory and provided a modest speedup, it wasn’t really worth it to go through the merge when the speedup didn’t make much of a dent in the computation time.
No secs for a byte
In both these cases, there’s this idea that, using some kind of trick, it may be possible to trade memory usage for speed. In the randomization case, it was this idea that it might be more efficient to only sample from a vector, rather than shuffling it entirely. In the estimation case, it was the thought that storing extra data would prevent having to construct more of it on demand.
And, in theory, these seem reasonable. But, computers have a very pecular sense of reasonableness, and I’m beginning to suspect I don’t share it.
So, before I consider this type of testing for what I’m calling “deadlift” optimization, I’ll remind myself of three things:
- Is this something that a user is going to encounter often enough to make it painfully slow? If you can optimize an obscure part of the code, you should probably be doing something more important (like working on your dissertation :)
- Is the optimization suggested enough for a user to notice? The difference between ~900 seconds and ~800 seconds seems like a lot. But, if you’re getting a coffee while the routine completes anyway, it’s not clear to me whether that tradeoff is important. Since you’re more likely to gain significant optimization advantages from using any of the type/compile-based optimizers anyway, I wonder whether it’s a waste of time to optimize the pure-python implementation, unless there’s an indirect benefit (i.e. learning) or the code is that critical where the circumstance should obviate this question.
- Is the optimization suggested easier to maintain than a static/type-based compiling version in Cython/Numba? In general, it’s undesirable to have separate, faster implementations that rely on extra dependencies. But, before chasing the performance dragon, you should really consider whether a compile-based solution would get you where your operational constraints are pushing you to be.
- Is the optimization potentially fast enough to warrant reimplementation? This is possibly the least important, and with stuff like laboratory, I’m hoping it continues to recede. But, again, you don’t want to spend more time than I did to realize a speedup smaller than 1.13X like I found.
imported from: yetanothergeographer