tldr: the social and technical architecture of object-oriented programming encourages us to centralize our focus onto a few core implementations, objects, of the main concepts we use. This creates coordination problems when we want to extend or modify them; we have to work together and agree on how an object in the commons is modified. Alternatively, dynamic dispatch, a programming strategy that makes the proceedures we exececute depend on the types of their input, solves these social and technical challenges with very few tradeoffsin general. I recommend using it, if you have not, through the functools.singledispatch
library in Python, or the multipledispatch
library.
Technical Decisions have social consequences
In OOP, the human/social architecture of power inhibits extensibility. In order for the object to “do” a new thing, you have to extend the object, and we’re generally left with a few options on how to do this.
First, we can contribute: provide a patch (or ask the maintainer) to extend the target object. This is hard. It immediately poses a coordination problem between you and the organization that “owns” the object’s library. It’s socially expensive, in that you’re negotiating directly with the “owners” about implementation details, documentation, whether the functionality is “in scope” for the library, and whether the owners feel is appropriate for their maintenance burden. While contributors often can try to enter into the “owners” maintenance circuit, most “drive-by” contributions don’t really result in long-term participation, as maintaining a library is very different from authoring precise, small, well-defined functionality.
So, this means that there is a bit of an institutional incentive to develop not invented here syndrome and extend it yourself. You patch the object in your own package, and keep around your own special powered-up version of the class. Thinking locally, this makes a lot of sense. But, over time it presents some pretty severe challenges to an ecosystem. The most significant one, I think, is that in mature ecosystems, a concept has no canonical software implementation as an object. For example, if I’m trying to write a new analytics library using graph algorithms, which of the many numerous “graph” class implementations should I extend? networkx
, scipy.sparse.csgraph
, igraph
, pysal.weights
?
The politics of extension
We can pick the most used one, try to send everything into that canonical representation, and then extend or contribute or just pick one of the main representations and assume translators exist elsewhere in the ecosystem. Both options work in the short term, but can incur tons of “translation” costs for shifting data from one representation into another. Also, since your library is now entirely dependent on its parent class implementation, the implementation details of that class may seriously affect the performance and extensibility of your own library. And, this can leave you on the hook for maintaining your own Swiss army knives to convert objects to objects in the long term.1 The latter option is generally not a good idea, as it comes with all the costs of the first option but with a side of fragmentation and reduced interoperability.
The politics of dynamic dispatch
One excellent alternative is dynamic dispatch, like R and Julia do by default and is available in Python’s functools.singledispatch
standard library decorator or the multipledispatch
library. In this case, you have a diverse set of methods can reason concretely about the types of their input. They know exactly how to access their data in efficient methods, and there is no cost in fitting every shape of peg into the same round hole: you can deal directly with the data structures themselves. The way I think of it is that you extend functions to deal with objects of different types, rather than having to extend the classes to provide data (or functions) that you need.
Critically, it removes the social blocker on extension! I don’t have to ask the owner of the object to implement my functionality, or agree upon implementation details of the functionality itself. There are no translation costs to send every object into a single canonical type, and you’re able to focus only on the data you need to compute your final result: no big collection of object-to-object converter functions. And, if newer libraries with better representations come out, you can extend the implementations without having to switch to a new canonical representation or jeopardize older implementations. It’s great!
I was first exposed to the concept while learning Julia 0.5 (back in graduate school) and absolutely fell in love with how easy it was to develop new functionality within systems with dynamic dispatch. used this in the new implementations in pointpats
, the PySAL
library for statistical analysis of geographic point patterns, and absolutely love how it works. In pointpats
, it allows us to deal with many different representations of input geometries, appropriately computing things like area or point-in-polygon queries using the performant algorithms that the objects themselves have.
Thus, a function like new_statistic(graph, ...)
to work with graph.edges
can sit alongside of new_statistic(network, ...)
that operates on network.nodes
, and both might, at the end, send the final computation off to a third new_statistic()
implementation that works on very simple, highly-structured data components, like new_statistic(nodes, edges, ...)
. This avoids having to translate explicitly into canonical representations, and lets you extend a function’s API endlessly to accommodate new input types or representations. Alternatively, if you need to make new_statistic()
faster/distributed/JIT compiled/etc, you can propagate this to all of the inputs that new_statistic()
understands, or just speed up the highly-structured core components in new_statistic(nodes, edges, ...)
.
Method proliferation is a code smell
Some might say that you can get around using a pattern like this with something akin to the endless df.from_XXX
methods on pandas.DataFrame
; for any input data format, give me back a pandas.DataFrame
. This makes a lot of sense for constructing objects from other objects, but it’s hard to recommend the from_XXX
strategy when you want to write functions that just implement analytical capability. The from_XXX
methods make sense when dealing with file input/output, especially, because files generally have way less metadata than objects. However, for making new_statistic()
“know” about both graph
and network
objects, it seems a bit unnecessary for the user to have to specify new_statistic.from_graph
and new_statistic.from_network
when the relevant metadata about graph
and network
objects, such as their class, can be computed quickly in the interpreter.
Thus, I think that the next library I write from scratch will be implemented in this fashion, and I’ll continue to introduce it where it’s useful into libraries I contribute to. Fortunately, functools.singledispatch
lives in the standard Python library, so there are no “extra” dependencies for code written in this fashion, and the dispatching itself is very fast.
-
Ducktyping, while a neat idea and theoretically very useful, can be very challenging to pull off in a wider sense because names are different for different domains. One person’s
graph.edges
is another person’snetwork.nodes
, and they may have very different performance properties to construct, translate, or access depending on the internal representations of the objects. ↩︎