A Post-SciPy Chicago Update

After a bit of a whirlwind, going to SciPy and then relocating to Chicago for a bit, I figure I’ve collected enough thoughts to update on my summer of code project, as well as some of the discussion we’ve had in the library recently.

I’ve actually seen a lot of feedback on quite a bit of my postings since my post on handling burnout as a graduate student. But, I’ve been forgetting to tag posts so that they’d show up in the GSOC aggregator! Bummer!

The Great Divide

Right before SciPy, a contributor suggested that it might be a reasonable idea to split the library up into independent packages. Ostensibly motivated by this conversation on twitter, the suggestion highlighted a few issues (I think) with how PySAL operates, both on a normative level, on a proecedural level, and in our code. This is an interesting suggestion, and I think it has a few very strong benefits.

Lower Maintainence Surface

Chief among the benefits is that minimizing the maintainence burden makes academic developers much more productive. This is something I’m actually baffled by in our current library. I understand that technical debt is hard to overcome and that some parts of the library may not exist had we started now rather than five years ago. But, it’s so much easier to swap in ecosystem-standard packages than it is to continue maintaining code that few people understand. This is also much more true when you recognize that our library does, in many places, exhibit effective use of duck typing. The barrier to us using something like pygeoif or shapely as a computational geometry core is primarily mental, and conversion of the library to drop/wrap unnecessary code in cg, weights, and core would take less than a week of full-time work. And, it’d strongly lower the maintenance footprint of the library, which I think is a central benefit of the split package suggestion.

Clearer Academic Crediting

Plus, the idea that splitting up the library into many, more loosely-coupled packages seems like a stroke towards the R-style ecosystem, which is exactly what the linked twitter thread suggests. But, I think that R actually has some comfy structural incentives for the drivers of its ecosystem to do what they do. Since an academic can make a barely-maintained package that does some unique statistical operation and get a Journal of Statistical Software article out of it, the academic-heavy ecosystem in R is angled towards this kind of development. And, indeed, with a very small maintainence surface, these tiny packages get shipped, placed on a CV, and occasionally updated. Thus, the social incentives align to generate a particular technical structure, something I think Hadly overstates in that brief conversation as a product of object oriented programming. While OO isn’t a perfect abstraction, I’m kind of done with blaming OO for everything I don’t like, and I think that the claim that OO encourages monolithic packages is, on its face, not a necessary conclusion. It comes down to defining efficient interfaces between classes and exposing a consistent, formal API. I don’t really think it matters whether that API is populated or driven using functions & immutable data or objects & bound methods. Closures & Objects are two sides of the same coin, really. Mostly, though, thinking that the social & technical differences in R and Python package development can be explained through quick recourse to OO vs. FP (when I bet the majority of academic package developers don’t even deeply understand OO or FP) is flippant at best. I really think more of it is the structure of academic rewards, and the predominance of academics in the R ecosystem.

But that’s an aside. More generally, fragmenting the library would make it easier for new contributors to derive academic credit from their contributions.

Cleaner Dependency Logic

I think many of the library developers also feel limited by the strict adherence to a minimal set of dependencies, namely scipy and numpy. By splitting the package up into separate modules with potentially different dependency requirements, we legitimate contributors who want to provide new stuff with flashy new packages.

To be clear, I think the way we do this right now is somewhat frustrating. If a contribution is done using only SciPy & Numpy and is sufficiently integrated into the rest of the library, it gets merged into “core” pysal. If it uses “extra” libraries but is still relevant to the project, we merge it into a module, contrib. This catch-all module contains some totally complete code from younger contributors, like the spint module for spatial interaction models or my handler module for formula-based spatial regression interfaces, as well as code from long-standing contributors, like the viz module. But, it also contains incomplete remnants of prior projects, put in contrib to make sure they weren’t forgotten. And, to make matters worse, none of the stuff in contrib is used in our continuous integration framework. So, even if an author writes test suites, they’re not run routinely, meaning that the compatibility clock is ticking every time code is committed to the module. Since it’s not unittested and documentation & quality standards aren’t the same as the code in core, it’s often easier to write from scratch when something breaks. Thus, fragmenting the package would “liberate” packages in contrib that meet standards of quality for introduction to core but have extra dependencies.

But why is this necessary?

Of course, we can do much of what fragmentation provides technologically using soft dependencies. At the module level, it’s actually incredibly easy. But, I have also built tooling to do this at the class/function level, and it works great. So, this particular idea about having multiple packages doesn’t solve what I think is fundamentally a social/human problem.

The rules we’ve built around contribution do not actively support using the best tools for the job. Indeed, the social structure of two-tiered contribution, where the second tier has incredibly heterogeneous quality, intent, and no support for coverage/continuous integration testing, inhibits code reuse and magnifies not-invented-here syndrome intensely. We can’t exploit great packages like cytools, have largely avoided merging code that leverages improved computational runtimes (using numba & cython), and haven’t really (until my GSOC) programmed around pandas as a valid interaction method to the library.

Most of the barriers to this are, as a mentioned above, mental and social, not technical. Our code can be well-architected, even though we’ve implemented special structures to do things that are more commonly (sometimes more efficiently) solved in other packages or using other techniques.

And, there’s some freaking cool stuff going on involving PySAL. Namely, the thing that’s been animating me is its use in Carto’s Crankshaft, which integrates some PySAL tooling into a PL/Python plugin for Postgres. They’ll be exposing our API (or a subset of it) to users through this wrapper, and that feels super cool! So, we’ve got good things going for our library. But, I think that continued progress needs to address these primarily social concerns, because the code, technologically, I think is more sound than one could expect from full-time academic authors.

imported from: yetanothergeographer