Bringing Classifiers Alive in PySAL

I’ve talked a lot to fellow developers about making PySAL objects more than containers for the results of a statistical procedure.

One way I think we can do this is to focus on methods like predict, find, update, or reclassify.

So, here, I’ll show the way I’ve implemented a simple API to update map classifiers by defining their __call__ method.

In [2]:
import pysal as ps

The patch I applied to mapclassify should be in this github branch. To get it, you’ll need to git fetch my repository and check out the reclassify branch. Alternatively, what I added to Map_Classifier is so small, it’s easy to show:

First, I added a call method:

def call(self, *args, kwargs):
”“”
    This will allow the classifier to be called like a
    function after instantiation
    “””
if inplace:
self._update(new_data, kwargs)
else:
new = copy.deepcopy(self)
new._update(new_data, **kwargs)
return new

This will allow us to do something like:

classifier = pysal.Quantiles(data)
classifier(k=4)
classifier(k=9)
classifier(new_data, inplace=True)

and proceed to interact with the classifier object over and over again. Since there’s an inplace toggle (False by default), users can choose when to mutate or when to copy.

In theory, the call method can support all of the different init declarations possible. I’ve defined it this way because most of the mapclassify methods I can think of use a mandatory data argument and optional keyword arguments. The only one that varies from this is User_Defined, which I overwrote to handle correctly.

The main point here is that this enables users to quickly reclassify and view new classifications using the object they created! Thus, a common use case might be something like this:

In [4]:
df = ps.pdio.read_files(ps.examples.get_path(‘south.dbf’))

In [5]:
df.head()

Out[5]:
FIPSNO NAME STATE_NAME STATE_FIPS CNTY_FIPS FIPS STFIPS COFIPS SOUTH HR60 BLK90 GI59 GI69 GI79 GI89 FH60 FH70 FH80 FH90 geometry
0 54029 Hancock West Virginia 54 029 54029 54 29 1 1.682864 2.557262 0.223645 0.295377 0.332251 0.363934 9.981297 7.8 9.785797 12.604552 <pysal.cg.shapes.Polygon object at 0x7fc5495eb…
1 54009 Brooke West Virginia 54 009 54009 54 9 1 4.607233 0.748370 0.220407 0.318453 0.314165 0.350569 10.929337 8.0 10.214990 11.242293 <pysal.cg.shapes.Polygon object at 0x7fc5495eb…
2 54069 Ohio West Virginia 54 069 54069 54 69 1 0.974132 3.310334 0.272398 0.358454 0.376963 0.390534 15.621643 12.9 14.716681 17.574021 <pysal.cg.shapes.Polygon object at 0x7fc5495eb…
3 54051 Marshall West Virginia 54 051 54051 54 51 1 0.876248 0.546097 0.227647 0.319580 0.320953 0.377346 11.962834 8.8 8.803253 13.564159 <pysal.cg.shapes.Polygon object at 0x7fc549565…
4 10003 New Castle Delaware 10 003 10003 10 3 1 4.228385 16.480294 0.256106 0.329678 0.365830 0.332703 12.035714 10.7 15.169480 16.380903 <pysal.cg.shapes.Polygon object at 0x7fc549565…

5 rows × 70 columns

In [7]:
data = df[‘HR60’].values

In [8]:
classifier = ps.Quantiles(data)

In [9]:
classifier

Out[9]:
                Quantiles

Lower            Upper              Count
=========================================
x[i] <=  2.497               283
2.497 < x[i] <=  5.104               282
5.104 < x[i] <=  7.621               282
7.621 < x[i] <= 10.981               282
10.981 < x[i] <= 92.937               283

Once estimated, the user can reclassify based on the same API as the constructor:

In [10]:
classifier(k=3)

Out[10]:
                Quantiles

Lower            Upper              Count
=========================================
x[i] <=  4.265               471
4.265 < x[i] <=  8.679               470
8.679 < x[i] <= 92.937               471

In [11]:
classifier(k=9)

Out[11]:
                Quantiles

Lower            Upper              Count
=========================================
x[i] <=  0.000               180
0.000 < x[i] <=  2.836               134
2.836 < x[i] <=  4.265               157
4.265 < x[i] <=  5.628               157
5.628 < x[i] <=  7.137               156
7.137 < x[i] <=  8.679               157
8.679 < x[i] <= 10.600               157
10.600 < x[i] <= 13.924               157
13.924 < x[i] <= 92.937               157

It doesn’t mutate the object unless inplace is provided and is true:

In [13]:
classifier

Out[13]:
                Quantiles

Lower            Upper              Count
=========================================
x[i] <=  2.497               283
2.497 < x[i] <=  5.104               282
5.104 < x[i] <=  7.621               282
7.621 < x[i] <= 10.981               282
10.981 < x[i] <= 92.937               283

In [14]:
classifier(k=6, inplace=True)

In [15]:
classifier

Out[15]:
                Quantiles

Lower            Upper              Count
=========================================
x[i] <=  1.993               236
1.993 < x[i] <=  4.265               235
4.265 < x[i] <=  6.245               235
6.245 < x[i] <=  8.679               235
8.679 < x[i] <= 11.850               235
11.850 < x[i] <= 92.937               236

This also enables users to add new data to the classifier.

Now, I bet there are better updating equations for the different classifiers than reestimating the entire classifier, like there are for running median problems. I anticipated extending this work with more sophisticated updaters than just reclassifying the entire set. This is why I split the call method from what really does the updating:

def _update(self, data, *args, kwargs):
if data is not None:
data = np.append(data.flatten(), y)
else:
data = self.y
self.init(data, *args, kwargs) #this is the most naive updater

As the comment denotes, this is the most universally-acceptible updater, hence it’s definition in the Map_Classify baseclass. Fortunately, this means that any new classifier defined as a subclass of this gets a very naive in-place reclassification method for free.

Thus, you can do stuff like:

In [17]:
new_data = df[‘HR90’].values

In [19]:
classifier(new_data)

Out[19]:
                Quantiles

Lower            Upper              Count
=========================================
x[i] <=  3.228               565
3.228 < x[i] <=  5.912               565
5.912 < x[i] <=  8.710               564
8.710 < x[i] <= 12.735               565
12.735 < x[i] <= 92.937               565

In [20]:
classifier(new_data, k=14)

Out[20]:
                Quantiles

Lower            Upper              Count
=========================================
x[i] <=  0.000               296
0.000 < x[i] <=  2.200               108
2.200 < x[i] <=  3.469               201
3.469 < x[i] <=  4.483               202
4.483 < x[i] <=  5.394               202
5.394 < x[i] <=  6.282               201
6.282 < x[i] <=  7.297               202
7.297 < x[i] <=  8.266               202
8.266 < x[i] <=  9.348               201
9.348 < x[i] <= 10.628               202
10.628 < x[i] <= 12.217               202
12.217 < x[i] <= 14.603               201
14.603 < x[i] <= 18.544               202
18.544 < x[i] <= 92.937               202

In [21]:
classifier(new_data, k=6, inplace=True)

In [22]:
classifier

Out[22]:
                Quantiles

Lower            Upper              Count
=========================================
x[i] <=  2.691               471
2.691 < x[i] <=  5.069               471
5.069 < x[i] <=  7.297               470
7.297 < x[i] <=  9.736               471
9.736 < x[i] <= 13.736               470
13.736 < x[i] <= 92.937               471

So, this is what I mean by “responsive” classes. They should:

  1. support updating/reuse w/ new data
  2. support augmentation of initial/init-time options/parameters
  3. provide call methods that consistently either update or use.

In map classification, I think call would be better suited to find_bin than update_bins. In spatial regression, I think call would be better suited to predict than something else.

call should never alias summary() methods, which probably belong in repr, anyway.

imported from: yetanothergeographer