GSOC Call Notes, June 6 2016

I’ve had to take a break from the spatial hierarchical linear modeling kick I’ve been on recently to get back to some GSOC work.

Today, I had my weekly call with my mentors.

On today’s call, my mentors and I discussed a few things.

Testing & Merging of project code

Since many of the improvements I’m making to the library are module-by-module, I was advised to submit PRs when a logical unit of contribution is ready to ship. As I’ve already been trying to use Test-Driven Development principles for my project, writing a test of what I want the API changes to look like then writing to that specification, this is relatively simple: the module is ready to submit when the spec tests pass. So, now that much of the initial foray into the labelled array API is done, I can beging to connect the tests & submit PRs where possible.

Appropriate Targets for Labelled Array interfaces

We also discussed how to extend the labelled array interface to other submodules. For example, I have a good idea how a consistent labelled array interface could look for Map Classifiers in the exploratory spatial data analysis module. Other elements of that module should also be relatively straightfoward to implement on labelled arrays, since all that’s generally needed is input interception: dataframe+column name needs to get correctly parsed into numpy vectors. This is quite simple. The spatial regression module also seems like a relatively straightforward place to add a labelled array interface, using a similar strategy to what I’ve already been doing in pysal.weights. Defining a from_formula classmethod for spatial regression modules would allow for specification of regressions using patsy & pandas dataframes.

But, in other parts of the library, like region or spatial_dynamics, it’s less clear as to what the labelled array interface should look like, so I’ll have to gain some perspective there.

Remaining Confusion in weights construction

I’m running into some minor confusion because I’m trying to make a call like from_dataframe(df, idVariable='FIPS') equivalent to from_shapefile(path, idVariable='FIPS'), and can’t figure out when PySAL considers things ordered vs. unordered.

For background, a spatial weights object in PySAL encodes the spatial information in a geographic dataset, allowing estimation routines for various spatial statistics or spatial models. In doing this, it relates each observation to every other observation, using information about the spatial relationship between observations. In our library, these are used all over the place.

But, in building a new, abstract interface to the weights constructors, I got quite confused. Particularly, I was expecting to be able to write a pair of classmethods, say, Rook.from_shapefile() and Rook.from_dataframe(), that have similar signatures and generate similar results. Something like from_dataframe(df, idVariable='FIPS') being equivalent to from_shapefile(path, idVariable='FIPS'). Unfortunately, it’s somewhat confusing to figure out how to make this work correctly, without making the API incoonsistent. This is because PySAL handles ids in weights objects across its various weights construction functions and classes in different ways. I think, overall, we expose four different variables or flags at different points in the API that deal with how observations are indexed in a spatial weights object:

  1. ids - ostensibly, a list of the ids to use corresponding to the input data, considered in almost every weighting function.
  2. idVariable - a column name to get ids from when constructing weights from file used in existing from_shapefile functions to generate ids.
  3. id_order- a list of indices used to re-index the names contained in ids in an arbitrary order, impossible to set from from_shapefile functions but used in the weights class’s init
  4. id_order_set- a boolean property of the weights object denoting whether id_order has been explicitly set.

To me, this is rather confusing, despite some conversation trying to flesh this out.

First, all lists in python are ordered. So, when a user passes a list of ids in as ids, its confusing that the order of this list is silently ignored. Second, when we construct weights from shapefiles using an idVariable, the resulting weights object has some peculiar properties: the id_order is set to the file read order, but the id_order_set flag is always False. This is confusing for a few reasons. First, shapefiles & dbf files are implicitly ordered, so a column in the dbf should correspond exactly to the order in which shapes are read, barring data corruption. So, if I use a column of the dataframe to index the shapefile, this should be considered ordered. Second, our docstring below seems to imply that either id_order_set is False and id_order defaults to lexicographic ordering, or id_order_set is Trueand id_order has special structure:

id_order : list
An ordered list of ids, defines the order of
observations when iterating over W if not set,
lexicographical ordering is used to iterate and the
id_order_set property will return False.  This can be
set after creation by setting the 'id_order' property.

But, one can easily generate an example where id_order is not lex ordered and id_order_set is False:

import pysal as ps

Qref = ps.queen_from_shapefile(ps.examples.get_path('south.shp'), idVariable='FIPS')




[u'54029', u'54009', u'54069', u'54051', u'10003', ...]



This is important because, when we construct weights from Dataframes, we need to make a decision about what gets picked as an index and how to treat that index. Right now, I’ve made the executive decision to choose consistency in beahvior, so that from_dataframe(df, idVariable='POLYGON_ID') will consider that column to be ordered from the start. This means that the resulting weights will have the same iteration order as weightsl from_shapefile(filepath, idVariable='POLYGON_ID'), but the dataframe call will set the id_order_set flag, while the shapefile classmethod does not.

imported from: yetanothergeographer