GSOC Call Notes, June 6 2016
Today, I had my weekly call with my mentors.
On today’s call, my mentors and I discussed a few things.
Testing & Merging of project code
Since many of the improvements I’m making to the library are module-by-module, I was advised to submit PRs when a logical unit of contribution is ready to ship. As I’ve already been trying to use Test-Driven Development principles for my project, writing a test of what I want the API changes to look like then writing to that specification, this is relatively simple: the module is ready to submit when the spec tests pass. So, now that much of the initial foray into the labelled array API is done, I can beging to connect the tests & submit PRs where possible.
Appropriate Targets for Labelled Array interfaces
We also discussed how to extend the labelled array interface to other
submodules. For example, I have a good idea how a consistent labelled array
interface could look for Map Classifiers in the exploratory spatial data
analysis module. Other elements of that module should also be relatively
straightfoward to implement on labelled arrays, since all that’s generally
needed is input interception: dataframe+column name needs to get correctly
parsed into numpy vectors. This is quite simple. The spatial regression module
also seems like a relatively straightforward place to add a labelled array
interface, using a similar strategy to what I’ve already been doing in
pysal.weights. Defining a
from_formula classmethod for spatial regression
modules would allow for specification of regressions using patsy & pandas
But, in other parts of the library, like
less clear as to what the labelled array interface should look like, so
I’ll have to gain some perspective there.
Remaining Confusion in weights construction
I’m running into some minor confusion because I’m trying to make a call like
from_dataframe(df, idVariable='FIPS') equivalent to
idVariable='FIPS'), and can’t figure out when PySAL considers things ordered
For background, a spatial weights object in PySAL encodes the spatial information in a geographic dataset, allowing estimation routines for various spatial statistics or spatial models. In doing this, it relates each observation to every other observation, using information about the spatial relationship between observations. In our library, these are used all over the place.
But, in building a new, abstract interface to the weights constructors, I got
quite confused. Particularly, I was expecting to be able to write a pair of
have similar signatures and generate similar results. Something like
from_dataframe(df, idVariable='FIPS') being equivalent to
from_shapefile(path, idVariable='FIPS'). Unfortunately, it’s somewhat
confusing to figure out how to make this work correctly, without making the API
incoonsistent. This is because PySAL handles
ids in weights objects across its
various weights construction functions and classes in different ways. I think,
overall, we expose four different variables or flags at different points in the
API that deal with how observations are indexed in a spatial weights object:
ids- ostensibly, a list of the ids to use corresponding to the input data, considered in almost every weighting function.
idVariable- a column name to get
idsfrom when constructing weights from file used in existing
from_shapefilefunctions to generate
id_order- a list of indices used to re-index the names contained in
idsin an arbitrary order, impossible to set from
from_shapefilefunctions but used in the weights class’s
id_order_set- a boolean property of the weights object denoting whether
id_orderhas been explicitly set.
To me, this is rather confusing, despite some conversation trying to flesh this out.
First, all lists in python are ordered. So, when a user passes a list of ids in
ids, its confusing that the order of this list is silently ignored. Second,
when we construct weights from shapefiles using an
idVariable, the resulting
weights object has some peculiar properties: the
id_order is set to the file
read order, but the
id_order_set flag is always
False. This is confusing for
a few reasons. First, shapefiles & dbf files are implicitly ordered, so a column
dbf should correspond exactly to the order in which shapes are read,
barring data corruption. So, if I use a column of the dataframe to index the
shapefile, this should be considered ordered. Second, our docstring below seems
to imply that either
id_order defaults to
lexicographic ordering, or
id_order has special
id_order : list An ordered list of ids, defines the order of observations when iterating over W if not set, lexicographical ordering is used to iterate and the id_order_set property will return False. This can be set after creation by setting the 'id_order' property.
But, one can easily generate an example where
id_order is not lex ordered and
import pysal as ps
Qref = ps.queen_from_shapefile(ps.examples.get_path('south.shp'), idVariable='FIPS')
This is important because, when we construct weights from Dataframes, we need to
make a decision about what gets picked as an index and how to treat that index.
Right now, I’ve made the executive decision to choose consistency in beahvior,
from_dataframe(df, idVariable='POLYGON_ID') will consider that column
to be ordered from the start. This means that the resulting weights will have
the same iteration order as weightsl
idVariable='POLYGON_ID'), but the dataframe call will set the
flag, while the shapefile classmethod does not.
imported from: yetanothergeographer