RFC for GSOC Feedback

Clarifying a Core Data Model for PySAL

Gist here

From my perspective, I see my proposal for GSOC as containing two oblique foci. My hope in drafting this notebook for my fellow authors is to make this clear.

I think at this point, many people who use & develop PySAL would like to see better tooling for Pandas$\leftrightarrow$PySAL interaction. This was started by Luc, Dani, & myself in the pysal.pdio contrib module. At its most basic, pdio reads/writes dbfs using PySAL IO. In addition, all shapes in a shapefile are read in and stuffed into a column of the dataframe. Unlike Geopandas, this occurs on a standard Pandas dataframe, rather than constructing a special subclass.

In addition, interfaces between shapely and PySAL exist via the shapely_ext contrib module. The shapely extension provides an interface between pysal and shapely using geointerface at the individual polygon level.

My GSOC proposal’s main point was that I’d like to see PySAL gain a labelled array interface, and that a specification for this interface should make it easy to do GIS, regardless of source. How we want this done remains an open question. I think there are really two aims in my proposal that I’d like to come to a consensus on about proirities:

The directions are what I’d like to call:

  • NOGR: A pure python/cython GIS
  • Pandas as Idiom: A Tabular data idiom for PySAL

The NOGR concern perennially hits a wicket for PySALers. I’m sure those reading may have engaged in this no-win discussion before. The cycle is something like this:

  1. PySAL targets minimal dependencies from the scientific python stack.
  2. Spatial Analysis sometimes has tabular operations & GIS as a prerequisite.
  3. GeoPandas+Shapely/Pandas are the standard for GIS/tabular in Python.

The NOGR project is really providing a method of solving 2 within the constraints of 1.

The Pandas as Idiom direction leverages 3 within the constraints of 1 to provide a consistent labelled array interface.

These largely oblique to each other:

  • Pure Python GIS would not, by itself, define a tabular data model for PySAL

  • A tabular data model would make it much easier to interface with current GIS tools in python, but still means that users who need GIS would need GDAL.

Thus, I’d like the mentors’ & broader PySAL community’s perspectives on how whether & how urgently a NOGR GIS package is needed before commencing my GSOC work. This feedback will inform and direct what parts of the proposal get emphasized.

What we already have:

Just to be clear, though, I’d like to show exactly how useful the shapely extension is. I don’t really see it mentioned a lot, so I’d like to show what’s already possible using just it and the pdio-style PySAL + Pandas Dataframes:

In [2]:
import pysal as ps
from pysal.contrib import shapely_ext as psh
import shapely.geometry as sh
import shapely.ops as GIS
import numpy as np

In [3]:
data = ps.pdio.read_files(ps.examples.get_path(‘columbus.shp’))

Basic GIS:

We have wrappers for almost all of the GIS operations in Shapely. They all take/recieve PySAL shapes, but use shapely for the actual operations:

In [36]:
(set([x for x in dir(GIS) if not x.startswith(‘_’)]).difference(dir(psh)))


We can also use most shapely methods as standalone functions:

In [54]:
null = geom.Polygon()

In [89]:
[x for x in dir(null) if (not x.startswith(‘_’) and x in dir(psh))]


In [73]:
full_CO = psh.cascaded_union(data.geometry.values.tolist())

Since Shapely Polygons have a visual repr, I’ll be casting to shapely to quickly show the shapes. But, they’re still python shapes:

In [74]:

<pysal.cg.shapes.Polygon at 0x7f4a34285f28>

In [75]:


In [76]:
print(psh.contains(full_CO, geom.Point(0,0)))
print(psh.contains(full_CO, data.geometry[0]))


In [77]:
west_CO = (data[data.geometry.apply(lambda x: x.centroid[0] < 9)])

In [78]:
diff = psh.difference(full_CO, psh.unary_union(west_CO.geometry.tolist()))


In [79]:

<function pysal.contrib.shapely_ext.buffer>

In [80]:
buffer_query = psh.buffer(data.geometry[1], .6)
data[data.geometry.apply(lambda x: psh.intersects(buffer_query, x))]

0 0.309441 2.440629 2 5 1 5 80.467003 19.531 15.725980 2.850747 5.03 38.799999 44.070000 1.0 1.0 1.0 0.0 1000.0 1005.0 <pysal.cg.shapes.Polygon object at 0x7f4a3536b…
1 0.259329 2.236939 3 1 2 1 44.567001 21.232 18.801754 5.296720 4.27 35.619999 42.380001 1.0 1.0 0.0 0.0 1000.0 1001.0 <pysal.cg.shapes.Polygon object at 0x7f4a3536b…
2 0.192468 2.187547 4 6 3 6 26.350000 15.956 30.626781 4.534649 3.89 39.820000 41.180000 1.0 1.0 1.0 0.0 1000.0 1006.0 <pysal.cg.shapes.Polygon object at 0x7f4a3536b…
3 0.083841 1.427635 5 2 4 2 33.200001 4.477 32.387760 0.394427 3.70 36.500000 40.520000 1.0 1.0 0.0 0.0 1000.0 1002.0 <pysal.cg.shapes.Polygon object at 0x7f4a3536b…
4 0.488888 2.997133 6 7 5 7 23.225000 11.252 50.731510 0.405664 2.83 40.009998 38.000000 1.0 1.0 1.0 0.0 1000.0 1007.0 <pysal.cg.shapes.Polygon object at 0x7f4a3536b…
6 0.257084 2.554577 8 4 7 4 75.000000 8.438 0.178269 0.000000 2.74 33.360001 38.410000 1.0 1.0 0.0 0.0 1000.0 1004.0 <pysal.cg.shapes.Polygon object at 0x7f4a3536b…
7 0.204954 2.139524 9 3 8 3 37.125000 11.337 38.425858 3.483478 2.89 36.709999 38.709999 1.0 1.0 0.0 0.0 1000.0 1003.0 <pysal.cg.shapes.Polygon object at 0x7f4a3530c…

7 rows × 21 columns

Note that none of this raw shapely/pandas stuff uses any spatial indexing, which is something Geopandas uses.

In [81]:
US = ps.pdio.read_files(ps.examples.get_path(‘NAT.shp’))

In [82]:
states = US.groupby(‘STATE_NAME’).geometry.apply(psh.unary_union)

In [83]:

Alabama       <pysal.cg.shapes.Polygon object at 0x7f4a5d9e2...
Arizona       <pysal.cg.shapes.Polygon object at 0x7f4a351d1...
Arkansas      <pysal.cg.shapes.Polygon object at 0x7f4a3519b...
California    <pysal.cg.shapes.Polygon object at 0x7f4a351d1...
Colorado      <pysal.cg.shapes.Polygon object at 0x7f4a351ea...
Name: geometry, dtype: object

In [84]:
states.apply(lambda x: GIS.validate(sh.asShape(x)))

Alabama                 Valid Geometry
Arizona                 Valid Geometry
Arkansas                Valid Geometry
California              Valid Geometry
Colorado                Valid Geometry
Connecticut             Valid Geometry
Delaware                Valid Geometry
District of Columbia    Valid Geometry
Florida                 Valid Geometry
Georgia                 Valid Geometry
Idaho                   Valid Geometry
Illinois                Valid Geometry
Indiana                 Valid Geometry
Iowa                    Valid Geometry
Kansas                  Valid Geometry
Kentucky                Valid Geometry
Louisiana               Valid Geometry
Maine                   Valid Geometry
Maryland                Valid Geometry
Massachusetts           Valid Geometry
Michigan                Valid Geometry
Minnesota               Valid Geometry
Mississippi             Valid Geometry
Missouri                Valid Geometry
Montana                 Valid Geometry
Nebraska                Valid Geometry
Nevada                  Valid Geometry
New Hampshire           Valid Geometry
New Jersey              Valid Geometry
New Mexico              Valid Geometry
New York                Valid Geometry
North Carolina          Valid Geometry
North Dakota            Valid Geometry
Ohio                    Valid Geometry
Oklahoma                Valid Geometry
Oregon                  Valid Geometry
Pennsylvania            Valid Geometry
Rhode Island            Valid Geometry
South Carolina          Valid Geometry
South Dakota            Valid Geometry
Tennessee               Valid Geometry
Texas                   Valid Geometry
Utah                    Valid Geometry
Vermont                 Valid Geometry
Virginia                Valid Geometry
Washington              Valid Geometry
West Virginia           Valid Geometry
Wisconsin               Valid Geometry
Wyoming                 Valid Geometry
Name: geometry, dtype: object


In [85]:
from functools import partial

This is a very naive way to construct spatial weights, since it doesn’t use any prefiltering or spatial query and doesn’t recognize symmetry in touches. So, don’t use this on something big, like NAT.shp, because it will take forever.

But, for the purposes of illustration, Queen Weights can be constructed using chained method calls on tabular data. This pattern is how most of the Queen Weights functions in PostGIS I’ve seen work, using array queries.

Since, in PostGIS, the ST_Touches naturally uses a GIST, it does a similar kind of binning reduction we use in PySAL, and could use if we wrapped arbitrary geometry columns in a pandas dataframe. The following does not, but still yields Queen weights:

In [86]:
Wt = (data
.geometry #for all geometries
.apply(lambda x:
.geometry #go back through geometry
.apply(partial(psh.touches, x))) # check if any touch current x, partial to ensure closure
.values.astype(int)) #cast bools to ints
Wps = ps.queen_from_shapefile(ps.examples.get_path(‘columbus.shp’)).full()[0]

In [87]:
np.testing.assert_allclose(Wt, Wps)


So, in my vision, a NOGR focus would be to make the above code run without the lines:

import shapely.geom as geom
import shapely.ops as GIS
from pysal.contrib import shapely_ext as psh

In short, this means implementing a GIS module using only python/cython that can do what shapely.ops does on shapely.geometry objects. I would probably target replicating most of the functions provided in sections 8.5, 8.6, and 8.9 of the PostGIS documentation, ensuring they work when supplied with a dataframe with a column of known spatial type (i.e. containing shapely polygons (geopandas), pysal shapes (pdio), or object.geointerface for all objects).

Ultimately, returning to the cycle of discussion mentioned above, this would reflect PySAL prioritizing point 2 to achieve point 1:

Need: (1) PySAL needs its own tabular/GIS Module with minimal dependencies

Reason: (2) Spatial Analysis often requires tabular/GIS operations

This is where the proposal that suggests novel python/cython implementations of required algorithms from De Berg’s Computational Geometry or Xiao’s GIS Algorithms be the focus of the project. But, implementing a fast NOGR GIS library would be the core here. If there were time at the end, the datamodel could be extended to use the GIS operations provided here.

A Tabular Data Model

The Pandas as Idiom direction is centrally about PySAL’s interface with Pandas dataframes with geometric columns (or Geopandas Geodataframes) as if they were another native datamodel for PySAL. That is, if I prioritize in this direction, I would be attempting to make the boundary between PySAL, Pandas, & GeoPandas/Shapely a little smoother, without introducing new hard dependencies. This focuses on leveraging against packages in 3 within the constraints of 1:

Need: (1) PySAL needs to abstract its datastructures from its data representations/IO framework

Reason: (3) It is unclear how to leverage existing Tabular & GIS tools alongside of PySAL

It may end up looking like an answer to Issue #474: Add contrib module to explore interface with geopandas. But, instead of an exploration it would be an instantiation, would support both the PySAL geometry-based pdio & the Shapely-geometry Geopandas.

Instead of focusing on pure python/cython implementations of core GIS operations, this would focus on making PySAL’s interfaces become more polymorphic. It would reduce the gap between PySAL & Geopandas/Shapely, while keeping a bright line between what uses GDAL and what does not. I imagine this happening through extensive use of optional import strategies.

In it, I see three components:

  1. An adapter module in contrib, geotable, that does a few things:

    • manages transitions between Geopandas dataframes and pdio dataframes
    • minimizes the distance between GDAL-dependent libraries and PySAL saftely & clearly
    • extends shapely_ext to perform common tabular operations
  2. A contrib module, pda. This would:

    • subsume current pdio code
    • extend the pdio strategy to use PySAL’s full IO suite
    • If datatype isn’t supported, loudly call into geotable’s GeoPandas interface & fail safe if unavailable
    • contain the necessary bridge code for PySAL to use pandas/patsy for any purpose
  3. In core:

    • expose consistent labelled array interfaces across core modules
    • enforce consistent unlabelled array rules across core modules (i.e. resolve shaping & array/list inconsistencies)
    • finish & merge polymorphic weights constructors for arbitrary shape iterables

In my view

I think

  • the tabular data model is more important
  • GIS is best left to Shapely & GeoPandas
  • If NOGR is necessary, it will be easier to use it as a drop-in replacement for the shapely hooks in geotable than it would to build it first.

imported from: yetanothergeographer