Today, we’ll ease into doing some more advanced stuff using the sf package in R. The sf package (short for simple features) is the state-of-the-art package for working with spatial data R. It’s a whole lot easier to use than older packages (e.g. rgdal or sp & maptools), but there are still some interesting/confusing elements. We’ll try to take a tour of a large amount of the package, and identify a few tools or skills we’ll need going forward.

Intermediate Simple Features in R

sf tries to make it so that spatial data is not special in R. Instead, it aims to make all the standard tools and idioms that you know how to use on dataframes apply to sf dataframes, too.1 This means that the sf package uses similar strategies to other packages, like dplyr, for how its dataframes work. At its most basic, this means that reading/writing spatial data works very similarly to how reading/writing non-spatial data works:

library(sf)

bristol = sf::st_read('./data/bristol-imd.shp')
## Reading layer `bristol-imd' from data source `/home/lw17329/OneDrive/teaching/second_year_methods/spatial-practicals/data/bristol-imd.shp' using driver `ESRI Shapefile'
## Simple feature collection with 263 features and 12 fields
## geometry type:  MULTIPOLYGON
## dimension:      XY
## bbox:           xmin: 350239.4 ymin: 166638.9 xmax: 364618.1 ymax: 183052.8
## epsg (SRID):    27700
## proj4string:    +proj=tmerc +lat_0=49 +lon_0=-2 +k=0.9996012717 +x_0=400000 +y_0=-100000 +ellps=airy +towgs84=446.448,-125.157,542.06,0.15,0.247,0.842,-20.489 +units=m +no_defs
bristol
## Simple feature collection with 263 features and 12 fields
## geometry type:  MULTIPOLYGON
## dimension:      XY
## bbox:           xmin: 350239.4 ymin: 166638.9 xmax: 364618.1 ymax: 183052.8
## epsg (SRID):    27700
## proj4string:    +proj=tmerc +lat_0=49 +lon_0=-2 +k=0.9996012717 +x_0=400000 +y_0=-100000 +ellps=airy +towgs84=446.448,-125.157,542.06,0.15,0.247,0.842,-20.489 +units=m +no_defs
## First 10 features:
##     LSOA11CD imd_rank imd_score income employment education health crime
## 1  E01014485    11727     23.53   0.18       0.13     15.75   0.41  0.32
## 2  E01014486     2408     49.04   0.34       0.26     31.45   1.25  0.99
## 3  E01014487    16452     17.36   0.07       0.06      0.28  -0.43  0.82
## 4  E01014488    11047     24.64   0.19       0.13      4.99   0.11  1.07
## 5  E01014489     5318     37.42   0.22       0.18     10.67   0.80  1.31
## 6  E01014491    19934     13.70   0.08       0.09      1.02  -0.48  0.34
## 7  E01014492     6812     33.27   0.22       0.19     45.80   0.76  0.37
## 8  E01014493     7510     31.54   0.19       0.15     50.05   0.81  0.31
## 9  E01014494    20047     13.60   0.09       0.08     15.85  -0.18  0.49
## 10 E01014495    11742     23.51   0.13       0.13     28.23   0.63  0.40
##    housing living_env idaci idaopi                       geometry
## 1    13.84      36.20  0.18   0.30 MULTIPOLYGON (((360256.3 17...
## 2    22.07      43.52  0.39   0.57 MULTIPOLYGON (((359676 1741...
## 3    22.84      62.34  0.04   0.14 MULTIPOLYGON (((359036.6 17...
## 4    16.49      38.16  0.28   0.31 MULTIPOLYGON (((359632.6 17...
## 5    20.97      62.02  0.29   0.36 MULTIPOLYGON (((359120.6 17...
## 6    18.62      41.51  0.05   0.15 MULTIPOLYGON (((359467.6 17...
## 7    20.54      22.51  0.35   0.21 MULTIPOLYGON (((353215.9 17...
## 8    20.09      30.64  0.29   0.23 MULTIPOLYGON (((352337 1770...
## 9    13.95      20.54  0.15   0.12 MULTIPOLYGON (((352702 1767...
## 10   17.86      27.29  0.21   0.15 MULTIPOLYGON (((353137.9 17...

The text above describes the dataframe we just read in. It gives most of the spatial information necessary to understand the geography:

  • the number of features (observations) and fields (columns) in the dataframe,
  • the bounding box bbox,
  • the type of geometry (geometry type) in the dataframe,
  • the coordinate system (in terms of its epsg code as well as its proj4 description)
  • and then its rows at the end

Check Yourself: Namespaces

“Namespaces are one honking great idea,” they say. A namespace is a sectioned-off area of a working environment (a space) where everything inside of that area is only addressable by referring to that space (by its name), and then to what you want that is inside of it.

Thus, something like an address is a great example of a namespace: BS8 1SS shows us that the 1SS is within BS8, which itself is a subset of all BS postcodes, and all of which are within the UK (since other countries use a different postal code system).

R has a bit of a complex relationship with namespaces. By default, when you load a package using library, it dumps out everything from within that library. This is kind of like starting work by dumping your entire rucksack out onto your desk, then sifting through its contents every time you need something. Thus, I’ll encourage you to use :: to be specific about which namespace a tool is in. When I write sf::st_read, this means “use the st_read function from within the sf package.” While R lets you use st_read by itself, writing sf::st_read is more clear, since it’s always apparent which package is being called.

Selection & focusing plots

Since a sf dataframe is simply a powered-up dataframe, we can treat it like a dataframe. This includes stuff like selection, subsetting, and aggregation. For an example, the School of Geographical Sciences is located in the Local Super Output Area with the index E01014542. We can grab just that one LSOA in a similar fashion to how we operate with other dataframes:

is_geog = bristol$LSOA11CD == 'E01014542'
geog = bristol[is_geog, ]
geog
## Simple feature collection with 1 feature and 12 fields
## geometry type:  MULTIPOLYGON
## dimension:      XY
## bbox:           xmin: 357787.7 ymin: 172880.4 xmax: 358931.7 ymax: 173851.2
## epsg (SRID):    27700
## proj4string:    +proj=tmerc +lat_0=49 +lon_0=-2 +k=0.9996012717 +x_0=400000 +y_0=-100000 +ellps=airy +towgs84=446.448,-125.157,542.06,0.15,0.247,0.842,-20.489 +units=m +no_defs
##     LSOA11CD imd_rank imd_score income employment education health crime
## 55 E01014542    11034     24.66   0.04       0.03     22.01   0.67   0.9
##    housing living_env idaci idaopi                       geometry
## 55   31.76      72.08  0.04   0.36 MULTIPOLYGON (((358357.9 17...

Now, using the format of the data we explained above, we can see that this new geog dataframe has only one row. We can make a plot of this single row using the plot function. Let’s make a plot that highlights our geog LSOA on top of the rest of bristol LSOAs:

# sf::st_geometry extracts *just* the geometry of the LSOA in bristol
plot(sf::st_geometry(bristol))
plot(sf::st_geometry(geog), col='red', add=TRUE)

It’s hard to see the LSOA in the jumble there. So, let’s zoom in on the LSOA. Part of the nice thing about using the built-in plot function is that it provides us with built-in xlim and ylim options we can use to focus the map. However, to pick a good focus area for the map, we could zoom in on an area that’s near the focal LSOA. To do this, we can: 1. buffer the LSOA, making a larger dilated polygon 2. grab the bounding box of this dilated polygon, which focuses the map on the polygon itself, but keeps a bit of the area around it, too.

To do this:

buffered = sf::st_buffer(geog, 2000)
bbox = sf::st_bbox(buffered)

plot(sf::st_geometry(bristol), xlim=bbox[c('xmin','xmax')],
                               ylim=bbox[c('ymin','ymax')])
plot(sf::st_geometry(geog), col='red', add=TRUE)

Exercise: Making Simple Maps

Now that you’ve seen a few things about making maps,

  1. adapt the code used above to make a new map focused on each of the following points of interest in Bristol:
    • The Downs (LSOA : E01014709)
    • The Memorial Stadium (LSOA : E01014510)
    • The Centre (LSOA : E01014540)
    • Lakota (LSOA: E01033348)
    • St. Werberghs City Farm (LSOA: E01014491)
    • North Street (LSOA: E01033362)
  2. sf makes it easy to make choropleth maps. To make a a choropleth map from a column on our dataframe, we can use the plot function as follows:
plot(bristol['imd_score'])

with this realization,

  • make a map of the living_env deprivation score in the area around the Downs.
  • make a map of the crime deprivation score in the area around Lakota.
  • make a map of the housing deprivation score in the area around The Centre.

NOTE: Be sure to vary the buffer distance if need be.

Check Yourself: Classes & dispatch in R

This simple plotting works in a kind of subtle. magical way. R uses a system called multiple dispatch, which is quite complex. Basically, what is helpful to understand is that most functions in R decide “what to do” with their input depending on what class their input is.

Like… take mean(). If the input is a vector, the method computes the mean for that input:

mean(c(2,3,7))
## [1] 4

However, after loading the mosaic library, the function can compute the mean of a formula, which is not a numeric argument. Before loading mosaic, however, the mean() function does not understand how to use formulae:

unloadNamespace('mosaic') #make sure mosaic isn't loaded
mean(~imd_score, data=bristol) # this will fail
## Warning in mean.default(~imd_score, data = bristol): argument is not
## numeric or logical: returning NA
## [1] NA
library(mosaic) #now load mosaic
mean(~imd_score, data=bristol) # and this will work
## [1] 26.95665

This is because the mean function decides what do to based on the class of its input. If you give it a vector or numeric argument, it’ll just compute the mean. If you give it a formula, the mean function will do something with that instead. But, mean doesn’t know how to deal with formula by default; it has to be enabled to use formulas by the mosiac package.

The same idea happens for plot and our spatial data. For standard R dataframes, df['column'] gives you back a dataframe, but df$column gives you back the corresponding vector of values in that column. If we use sf_df['column'], we get back a sf dataframe, and plot knows how to make a choropleth map out of an sf dataframe with a single column. If we use plot(sf_df$column), then the result just looks like a standard plot of a vector.

To check yourself, we’ll look into a few things:

  • what class does class(bristol['imd_score']) show? and what class does class(bristol$imd_score) show?
  • Then, look at the difference between doing something like head(bristol['imd_score']) and head(bristol$imd_score). How do they look different?
  • Now, run plot(bristol$imd_score). See that it gives a very different result from plot(bristol['imd_score'])?
  • Try to use lattice::histogram(bristol['imd_score']), and then try lattice::histogram(bristol$imd_score). From our test of histogram, do you think that lattice knows how to use sf objects?

Spatial Relationships

Spatial relationships are the structural/geographical relationships between observations in a dataset. You’ve worked a little bit with spatial relationships before, especially in your group projects. However, we can compute a large variety of spatial relationships using sf. One main conceptual model of how geographical objects can relate to one another is called the dimensionally-extended 9 intersection model, or DE9-IM for short. The DE9-IM, developed by Egenhofer & Franzosa (1991) to describe point set relations, is a standard model of how geographic data is represented. is shown below for two shapes, A and B: I know, DE9-IM was my favorite Star Wars droid, too.

A few definitions are helpful to precisely define this idea:

  1. A point set is a collection of points in two dimensions. Points, by definition, have no width or height, and so represent a single abstract position in space.
  2. A geometry, or shape, is a collection of points. All geometries have three parts, which are their own distinct point sets:
    • the interior of a shape is the set of points where every adjacent point is also an interior point.2
    • the exterior of a shape is the set of points where every adjacent point is also an exterior point.
    • the boundary of a shape is the set of points where adjacent points may be interior, exterior, or boundary points.
    • the intersection of set A and set B involves the set of points that are members of both A and B.

With these definitions, the DE9-IM is a system that allows us to define how geometric objects relate to one another, based on how their interiors, exteriors, and boundaries intersect with one another. Further, basic geometric objects (like points, lines, and polygons) have precise definitions of their interiors/exteriors/boundaries:

  • A point is all boundary; they have no interior, and their exterior is all other points.
  • A line is many boundary points: they still have no exterior, and their exterior is all other points not on the line.
  • A polygon is an interior bounded by a collection of boundary points; all exteriors are points that are not within either the interior or the boundary.

Together, this is somewhat arbitrary, but it allows us to talk about spatial relationships in a mathematically rigorous way. For instance, when we say that a line crosses a polygon, we are formally saying that the line’s boundary intersects the polygon’s exterior, interior, and boundary. If the line is within a polygon, then the line’s boundary must not intersect the exterior of the polygon. The full table of these relationships is somewhat complicated, but using the collection of how a geometry’s interior/exterior/boundary intersects another geometry’s interior/exterior/boundary, we can express the relationships between geographic observations.3

The full table of relations is shown below:

Name in Plain English Description sf function name
A intersects B Any point in A is also in B (opposite of disjoint) st_intersects(A,B)
A is disjoint from B No point in A is also in B (opposite of intersects) st_disjoint(A,B)
A touches B A point on A’s boundary touches a point on B’s boundary st_touches(A,B)
A crosses B A’s interior intersects B’s interior, and A’s exterior does not intersect B’s interior st_crosses(A,B)
A is within B A’s exterior intersects B’s interior, and A’s interior does not intersect B’s exterior (opposite of contains) st_within(A,B)
A contains B A’s interior intersects B’s exterior, and B’s interior does not intersect A’s exterior (opposite of within) st_contains(A,B)
A overlaps B A’s interior intersects B’s interior, and A’s interior intersects B’s exterior. st_overlaps(A,B)
A covers B All of B’s interior intersects A’s interior (opposite of covered-by) st_covers(A,B)
A is covered by B All of A’s interior intersects B’s interior (opposite of covers) st_covered_by(A,B)
A equals B Every point in A is also in B, and every point in B is also in A. st_equals_exact(A,B)

These relations are “fundamental” in the sense that they describe a topology for geography. A topology is a formal mathematical representation of the relationships between objects in a set; most of the relationships you encounter in geographical analysis are fundamentally topological relationships.4

Building Spatial Relationships

One of the most common, fundamental, and thoroughly-used relations is touching. sf makes it very fast to find out who’s touching, and where. To collect the LSOAs that touch the School of Geography’s LSOA, we can use the st_touches operator:

nearby_touches = sf::st_touches(bristol, geog)

However, by default, st_touches(a,b) takes two spatial data frames of many geometries; our result, nearby_touches, is expressed in terms of a matrix of relationships between all the LSOAs in bristol and all the geometries in geog. This is expressed as a special kind of matrix. It has \(n\) rows, where \(n\) is the number of LSOAs in bristol, and \(k\) columns, where \(k\) is the number of geometries in geog. We know, however, that there is only one geometry in geog, but sf gives us back a full matrix anyway. So, for us to simply get back the \(n\)-length vector of TRUE/FALSE describing whether a geometry touches geog, we need to use the apply function.

apply takes a matrix, a margin, and a function. A margin is kind of like the direction the function will be used. If we supply a margin of 1, the function will be computed over all the rows of the input matrix. If the margin is 2, then the function will be applied down the columns of the input matrix. We just want to know if any observation in bristol touches geog, so we’ll use the any function. And, since we only have one column, this gives us back an \(n\) length vector which is TRUE when that row of bristol touches geog, and is FALSE when that row of bristol does not touch geog.

nearby_touches = apply(nearby_touches, 1, any)
head(nearby_touches)
## [1] FALSE FALSE FALSE FALSE FALSE FALSE

Thus, we see that the first four observations in our table do not touch geog.

Check Yourself: Recalling skills about binary vectors

How many LSOAs touch geog?5

Now that we have the TRUE/FALSE for the geometries that touch geog, we can make a map of just the observations that touch geog using the standard subsetting methods you already know about:

plot(sf::st_geometry(bristol), xlim=bbox[c('xmin','xmax')],
                               ylim=bbox[c('ymin','ymax')])
plot(sf::st_geometry(bristol[nearby_touches,]), 
     col='lightblue', add=TRUE)
plot(sf::st_geometry(geog), col='red', add=TRUE)

Check Yourself: Recalling skills about inverting binary vectors

How could you make the same map, but color in the observations not touching geog?6

Composite Spatial Operations

Sometimes, you want something a little more than a primitive spatial operation like those defined in the table above. Like, sometimes you may want to string together a few spatial operations. For instance, what if we wanted to get all of the LSOAs within 500 meters of geog? Well, breaking our query down:

    LSOAs within 500 meters of geog

We see that there is a within statement, so we probably need a within relation. But, we want to apply a kind of distance filter; within 500 meters of geog. Fortunately, we can do that with tools you’ve already seen: st_buffer. We can build a 500-meter buffer around geog, which is our search_area. Then, any observations that are contained within that area will be considered as within 500 feet.

search_area = st_buffer(geog, 500)
within_500m = st_within(bristol, search_area)
within_500m = apply(within_500m, 1, any)
plot(sf::st_geometry(bristol), xlim=bbox[c('xmin','xmax')],
                               ylim=bbox[c('ymin','ymax')])
plot(sf::st_geometry(bristol[within_500m,]), 
     col='lightblue', add=TRUE)
plot(sf::st_geometry(geog), col='red', add=TRUE)

Woah! That’s fewer than that actually touch geog? This may not be what you expected, and here’s why: human beings commonly understand anything within 500 meters of geog to mean where any part of the area is within 500 meters of any part of geog… we have a slightly different semantics for within than DE9-IM.

So, thinking a little more critically about what we want, we may instead do something like: 1. build a search area of 500 feet around geog 2. select all LSOAs with any part in that search area.

For “any part in that search area”, we need to think about intersects, not within. So, using that concept of what our search query means, we do:

nearby_within_500m = st_intersects(bristol, search_area)
nearby_within_500m = apply(nearby_within_500m, 1, any)
plot(sf::st_geometry(bristol), xlim=bbox[c('xmin','xmax')],
                               ylim=bbox[c('ymin','ymax')])
plot(sf::st_geometry(bristol[nearby_within_500m,]), 
     col='lightblue', add=TRUE)
plot(sf::st_geometry(bristol[within_500m,]), 
     col='lavender', add=TRUE)
plot(sf::st_geometry(geog), col='red', add=TRUE)

That looks more like what folks think of when they say anything within 500 meters.

Exercise: The varying density of LSOAs

For the set of interesting areas we examined before:

  • The Downs (LSOA : E01014709)
  • The Memorial Stadium (LSOA : E01014510)
  • The Centre (LSOA : E01014540)
  • Lakota (LSOA: E01033348)
  • St. Werberghs City Farm (LSOA: E01014491)
  • North Street (LSOA: E01033362)
  1. Get the total number of observations that:
    • touch the area of interest
    • intersect a 500 meter search area around the area of interest
    • intersect a 1000 meter search area around the area of interest
    • intersect a 1500 meter search area around the area of interest
  2. Which area of interest is in the most densely-packed part of Bristol?

Challenge: Responsible to a Higher Order

the idea of st_touches, in graph theory, can be used to build an adjacency matrix that connects observations together. There are many different ways to build adjacency matrices in R, but we won’t worry about that now. Suffice it to say, one useful concept that is simple to compute from an adjacency matrix is the higher order neighbor of each observation. For our example above, all of the observations that touch geog are called first-order neighbors of geog. Second-order neighbors of geog are those that touch the neighbors of geog’s first order neighbors! This proceeds so-on and so-forth until we get to the last order neighbor, which is the observation the furthest number of steps from geog. So, if you’re up for it…

second_order = st_touches(bristol, bristol[nearby_touches, ])
second_order = apply(second_order, 1, any)

gives us the second-order neighbors of geog.

  1. Make a map of all the observations that are second-order neighbors of geog. do not highight geog in red. Is there anything surprising about the second-order neighbors of geog?
  2. Using second_order use the same concepts to make a map of the third_order and fourth_order neighbors. 3. For an extra-hard challenge, it’s common to say that the \(n\)-th order exclusive neighbors are those neighbors that first occur at the \(n\)th order. By convention, the observation itself is considered the “zeroth” order neighbor, and is always excluded from higher-order neighbors. For our second_order example above, we can filter out observations we’ve seen at lower orders using the exclusive or function, xor, and the OR operator, |:
second_order_exclusive = xor(second_order, nearby_touches | is_geog) 

In this, we want to get the elements that are TRUE only in second_order and not in either nearby_touches or is_geog. Extending this idea,

  • make a plot of the nearby_touches and second_order_exclusive neighbors.
  • following the same pattern above, compute third_order_exclusive and fourth_order_exclusive.
  • Make a map of all of the exclusive neighbors together in a single image, with different colors for each level of exclusive neighbor.

Kernels and Distances in sf

The last major kind of spatial relationships tends to focus on distances. But, instead of focuing on continuous functions of distance, a kernel function is a smoothly-varying expression of how close two observations are. Kernel functions usually take the distance between observations and then express them in terms of a value between zero and one, where one means really close and zero means really far. This reflects a kind of standardized nearness score, and is incredibly common in machine learning, data science, and geographical analysis.

We can get the distances from geog and every other observation in bristol using sf_distance:

distances = st_distance(bristol, geog)

To see all the distances, use:

# because distances have units, and lattice doesn't understand them, 
# we *have* to use hist and not lattice::histogram!!
hist(distances)

Check Yourself: Near and Far

Add a column to bristol called distance_from_geog. Make a choropleth map of that new column.

Kernel functions

Now, given all these distances, we can compute a kernel around geog using the following approach:

kernel = exp(-as.numeric(distances)/1500)

With this, we’re saying that our nearness score is: \[ k(i,j) = e^{\left(-\frac{d_{ij}}{1500}\right)} \] The \(1500\) value is called the bandwidth, and it reflects a kind of baseline level of nearness in the map. This is a parameter that we can tune to increase or decrease the average nearness in the map.

To visualize the kernel, you can assign a new column and make a map:

bristol['geog_building_kernel'] = kernel

plot(bristol['geog_building_kernel'])

Exercise: Building Kernels & Kernel Maps

  1. Try repeating the kernel & plotting code above using a few different values for the bandwidth.
  • Make a map for bandwidth values of 10, 100, 1000, and 10000. What happens to the similarity between observations as bandwidth gets really small? What about when it gets large?
  1. Kernel functions abound. Pick a kernel function listed on wikipedia and define it as a function in R.7 For example, the negative exponential kernel we used above would be defined as:
nexp_kernel = function(distance, bandwidth=1500){
                       u = as.numeric(distance) / bandwidth
                      return(exp(-u))
                      }

In general, the kernel function defines u = distance / bandwidth, and then uses the equations you see online as standard kernel shapes. So, for one more example, we could implement the sigmoid kernel as:

sigmoid_kernel = function(distance, bandwidth=1500){
                          u = distance/bandwidth
                          return( (2 / pi) * (1/(exp(u) + exp(-u))) )
                          }

Then, make a map of the kernel using the same bandwidths you did in part 1.

  • How does the map change as bandwidth changes for the kernel function you picked?
  • How is the map of kernel weights you picked different from the negative exponential kernel you tried in part 1?

Spatial Lags

Now, we can talk about how spatial relationships help us talk about spatial relateness and spatial similarity. Since we now know how to use DE9-IM language to define spatial relationships, we can just assume that we’ve picked some method of defining observations that are nearby and some that are not. Moving a bit more abstract, let’s move away from geog, and call it just any site, \(i\), with an observed value for some data \(x\), called \(x_i\). Then, let some other observations, \(j\), be found in our spatial search nearby \(i\). For now, let’s call the collection of these observations the neighborhood of site \(i\), and let’s spell that in Greek as \(\eta_i\). Then, for every observation in the neighborhood of geog (which is spelled \(j \in \eta_i\) in mathematical language), we observe data at the other sites \(j\) and call them \(x_j\).

If we wanted to summarize the values \(x_j\) for \(j\) in the neighborhood of \(i\), one common method is the local sum. This measure adds up all \(x_j\) for \(j \in \eta\): \[ \boldsymbol{l}_i = \sum_{j\in\eta_i} x_j \] Breaking this expression down, we see that the local sum for observation \(i\) is the sum of all of the features in the neighborhood around \(i\). For some other observation, the neighborhood will be different, as will the local sum. In addition, if we change what \(\eta_i\) is made of, like by changing from first-order to second-order neighbors, or from a very narrow/small bandwidth to a wide/large bandwidth, we will change the local sum.

The local sum is useful when we need the count of something nearby \(i\). This might make sense when you’re interested only in the total number of things in the neighborhood, like “I aced that SM2 test, how many pubs are within a 100-meter radius of Wills?” But, this does not give us a summary of what nearby areas are like individually. Instead, we might want to know the average, median, or standard deviation for all \(x_j\) in the neighborhood of \(i\). For the data we’ve been working with on the Bristol index of multiple deprivation, the local average is a useful statistic, since it gives us a single score for the deprivation surroudning geog that can be compared to the deprivation at geog itself. Thus, local weighted averages are the most common form of summary by far: \[ \boldsymbol{l}_i = \frac{1}{n_{\eta_i}}\sum_{j \in eta_i} w_{ij}x_j \] where \(n_{\eta_i}\) is the number of observations in \(i\)’s neighborhood and \(w_{ij}\) is the weight relating sites \(i\) and \(j\). For some kinds of spatial relationships, \(w_{ij}\) is only zero or one, such as when we used st_touches(bristol, geog): there, \(w_{ij}\) is TRUE (or \(1\)) when row \(j\) of bristol touches geog, and is zero when they do not touch. Alternatively, \(w_{ij}\) can be some continuous value expressing the geographcial nearness between sites \(i\) and \(j\), like the kernel functions we used before. Regardless, this kind of local weighted sum is the most common way of summarizing the surroundings of an observation.8 As a matter of terminology, this local weighted sum for variable \(x\), when it’s taken for every site instead of just at a single site, is usually called the spatial lag of \(x\), and likewise we’d refer to the spatial lag around \(i\) as the local weighted sum for site \(i\).

To actually compute these values, we can do something similar to what we did before. 1. split the dataframe according to observations being in our spatial search area 2. summarize the observations that are within our spatial search area using an appropriate function

Below, we can use our nearby_touches vector to extract only the rows of bristol that touch geog:

bristol[nearby_touches, ]
## Simple feature collection with 8 features and 13 fields
## geometry type:  MULTIPOLYGON
## dimension:      XY
## bbox:           xmin: 357138.7 ymin: 172034.8 xmax: 359550.4 ymax: 174170.8
## epsg (SRID):    27700
## proj4string:    +proj=tmerc +lat_0=49 +lon_0=-2 +k=0.9996012717 +x_0=400000 +y_0=-100000 +ellps=airy +towgs84=446.448,-125.157,542.06,0.15,0.247,0.842,-20.489 +units=m +no_defs
##      LSOA11CD imd_rank imd_score income employment education health crime
## 52  E01014538    22829     11.15   0.03       0.01      0.48  -0.27 -0.04
## 53  E01014539    19569     14.06   0.07       0.06      0.38  -0.58  0.78
## 54  E01014540     9221     27.91   0.12       0.08     18.87   0.23  1.36
## 63  E01014551    20037     13.61   0.04       0.03      7.13  -0.18  0.14
## 66  E01014554    22105     11.74   0.03       0.03      5.02  -0.54  0.81
## 245 E01033342    10818     25.08   0.07       0.05     19.11   0.53  1.02
## 246 E01033344     8994     28.33   0.11       0.07      7.90   0.27  1.37
## 247 E01033345    14154     20.11   0.02       0.01      7.15  -0.59  1.45
##     housing living_env idaci idaopi                       geometry
## 52    24.08      54.13  0.10   0.19 MULTIPOLYGON (((357819.1 17...
## 53    19.90      48.72  0.08   0.09 MULTIPOLYGON (((358951 1740...
## 54    29.51      60.39  0.37   0.23 MULTIPOLYGON (((358672.4 17...
## 63    20.31      61.93  0.06   0.17 MULTIPOLYGON (((357442 1734...
## 66    14.30      49.44  0.06   0.14 MULTIPOLYGON (((357611.6 17...
## 245   41.72      51.00  0.15   0.38 MULTIPOLYGON (((358856 1736...
## 246   31.23      81.40  0.46   0.20 MULTIPOLYGON (((358121 1725...
## 247   32.78      63.40  0.10   0.05 MULTIPOLYGON (((358427.5 17...
##     geog_building_kernel
## 52                     1
## 53                     1
## 54                     1
## 63                     1
## 66                     1
## 245                    1
## 246                    1
## 247                    1

and, if we wanted the average of the imd_score from within that subset, we could grab the mean from the imd_score column of that new subsetted data:

mean(bristol[nearby_touches, ]$imd_score)
## [1] 18.99875

Exercise: Computing Spatial Lags

  1. Compute the overall/global average of bristol$imd_score.
  2. Compute the local average around geog using nearby_within_500m.
  • What is the local average around geog?
  • Is that larger or smaller than the global average?
  • Is that larger or smaller than the local average found using nearby_touches?
  • Given that we’ve seen that nearby_touches includes fewer, closer observations than nearby_within_500m, how do you geographically interpret the fact that one has a higher deprivation score than another?
  1. The function weighted.mean allows you to take a weighted mean in R. Using weighted.mean:
  • Compute the weighted mean of bristol$imd_score using kernel as the weights. Is this larger or smaller than the local weighted average from nearby_touches?
  • Change the bandwidth to be very small. Does the local weighted average increase or decrease?
  • Change the bandwidth to be very large. How does that affect the local weighted average? Does it approach the map average?
  • Compute the weighted mean of bristol$imd_score using nearby_touches as the weights vector. Is this the same as the local weighted average around \(i\)? Using the equation for the local weighted mean above, why does nearby_touches work this way?

Challenge: Thinking About Spatial Dependence

  1. For the areas of interest we used above:
    • The Downs (LSOA : E01014709)
    • The Memorial Stadium (LSOA : E01014510)
    • The Centre (LSOA : E01014540)
    • Lakota (LSOA: E01033348)
    • St. Werberghs City Farm (LSOA: E01014491)
    • North Street (LSOA: E01033362)

Compute the local average of observations that touch each area of interest.

  1. Make a dataframe9 for all the areas of interest and geog containing:

    • one column of the lsoa_id for that row
    • one column of the focal value for imd_score (i.e. the value of imd_score in that area of interest)
    • one column of the spatial lag value for imd_score (i.e. the local weighted average of imd_score in that area of interest)
  2. Compute the raw correlation between those two columns using cor. Disregarding any statistical significance (for now), is the relationship between those sites and their surroundings positive or negative? Put another way, are the surroundings of those sites similar to or dissimilar to the areas of interest themselves?

Conclusion

Spatial relationships are fundamental to quantitative geography. In abstract, we need to represent geographical structures in ways that we can use when we build models. Above, We’ve seen many different ways that spatial relationships are represented in mathematical or computational structures. We’ve also seen how we can use these structures to summarize the local structure of a map in a mathematical way; instead of simply computing statistics about the map as a whole, we can build local averages or local sums that describe the structure of a map around an area of interest. This is a very common idea in quantitative geography, and helps us do a lot of different, neat things.

Beyond quantitative geography, Geographical Information Science also focuses extensively on spatial relationships. The Dimensionally Extended 9-Intersection Model we discussed is a bedrock of GIS; the idea of the spatial predicate is a fundamental component of working with spatial databases and understanding or expressing spatial relationships. Despite the fact that these relationships have been formalized, however, we’ve also now seen how these may not exactly reflect the spatial relationship that springs to mind when the technical term (e.g. within) is used. Overall, this means that it is important to understand both what the formal meaning of a predicate is, and it reinforces that you need to be very clear and precise in your thinking when discussing spatial relationships. Like any formal language (e.g. mathematics, programming langauges), the formal language of spatial relationships can be difficult to speak correctly the first time, so be sure to play around with it so that you can get it right.

Finally, we’ll move forward by seeing the ways in which geographcial structures can be used to estimate spatial statistics and characterize the strength and direction of spatial relationships. This will extensively use sf and the ideas covered in this notebook, so please do be sure to review the concepts behind the local weighted average.

Notes


  1. This is a critical idea in system design, and even has its own fancy Russian-sounding theorem. Put a little more simply, this reduces to: special cases should not be special enough to break the rules.

  2. A clever student might think this is a reprisal of Zeno’s Dichotomy Paradox. Indeed, an open set in mathematics is kind of like that… for any point \(p\) that’s near the edge of the interior, we can always pick some really small distance \(\epsilon\) so that some point \(p'\) is closer to the edge of the interior, but still is inside the interior. Relocate \(p\) to \(p'\). Then, keep looking for another (smaller) \(\epsilon\) that keeps a new \(p'\) within the interior. In the limit, \(\epsilon\) will get smaller and smaller. However, because \(\epsilon\) can get infinitessimally small, we actually never have to pick a \(p'\) that is outside of the interior. In a slightly more formal way, this is exactly the mathematical definition of an open set. A closed set is one where there is some \(p'\) where we cannot pick \(\epsilon\) so that we get closer to the edge; at that point, we either are in or we are out.

  3. The DE9IM only applies for vector data of infinite resolution; raster data is much simpler to understand and represent, since it fundamentally is a collection of regularly-sampled positions with known values. There is no geometry other than the grid that raster data represents. And, practically speaking, computers can only represent vector data with finite precision, so the interior/exterior of shapes is not actually a fully open set. But, the DE9IM is a conceptual model for how we’ll talk about spatial relationships between geometries.

  4. In case you ever want to dress up what you do as a quantitative geographer, you can always call yourself a mathematician studying applied topology, but folks might look at you sideways.

  5. How would you get the sum total of TRUEs in a vector? Recall that, under the hood, a boolean variable is represented with one for TRUE and zero for FALSE.

  6. In R, you can change TRUE to FALSE by doing !TRUE. Try this yourself at the console. This is because ! is the negation operator in R, and will flip FALSE to TRUE and vice versa

  7. If you pick the uniform kernel from wikipedia, I won’t stop you, but I can promise it ain’t the most illuminating experience.

  8. But, it is definitely not the only way; we use this idea of summarizing the neighborhood of \(i\) in a very general way in quantitative geography. For instance, another well-known spatial statistic we will cover soon is built from \(\boldsymbol{l}_i = \frac{1}{n_{\eta_i}}\sum_{j \in \eta_i} w_{ij}(x_i - x_j)^2\), the local dissimilarity between \(i\) and its surroundings.

  9. Making dataframes by ‘hand’ can be done using the data.frame function. For instance, if I had three areas of interest and their lsoa_id, imd_score, and lag_imd_score, I could do:

    data.frame(list(
                lsoa_id = c(1, 2, 3),
                imd_score = c(20, 45, 99),
                lag_imd_score = c(22, 34, 79) 
               )
    )