Today, we’ll ease into doing some more advanced stuff using the sf
package in R
. The sf
package (short for simple features) is the state-of-the-art package for working with spatial data R
. It’s a whole lot easier to use than older packages (e.g. rgdal
or sp & maptools
), but there are still some interesting/confusing elements. We’ll try to take a tour of a large amount of the package, and identify a few tools or skills we’ll need going forward.
sf
tries to make it so that spatial data is not special in R
. Instead, it aims to make all the standard tools and idioms that you know how to use on dataframes apply to sf dataframes, too.1 This means that the sf
package uses similar strategies to other packages, like dplyr
, for how its dataframes work. At its most basic, this means that reading/writing spatial data works very similarly to how reading/writing non-spatial data works:
library(sf)
bristol = sf::st_read('./data/bristol-imd.shp')
## Reading layer `bristol-imd' from data source `/home/lw17329/OneDrive/teaching/second_year_methods/spatial-practicals/data/bristol-imd.shp' using driver `ESRI Shapefile'
## Simple feature collection with 263 features and 12 fields
## geometry type: MULTIPOLYGON
## dimension: XY
## bbox: xmin: 350239.4 ymin: 166638.9 xmax: 364618.1 ymax: 183052.8
## epsg (SRID): 27700
## proj4string: +proj=tmerc +lat_0=49 +lon_0=-2 +k=0.9996012717 +x_0=400000 +y_0=-100000 +ellps=airy +towgs84=446.448,-125.157,542.06,0.15,0.247,0.842,-20.489 +units=m +no_defs
bristol
## Simple feature collection with 263 features and 12 fields
## geometry type: MULTIPOLYGON
## dimension: XY
## bbox: xmin: 350239.4 ymin: 166638.9 xmax: 364618.1 ymax: 183052.8
## epsg (SRID): 27700
## proj4string: +proj=tmerc +lat_0=49 +lon_0=-2 +k=0.9996012717 +x_0=400000 +y_0=-100000 +ellps=airy +towgs84=446.448,-125.157,542.06,0.15,0.247,0.842,-20.489 +units=m +no_defs
## First 10 features:
## LSOA11CD imd_rank imd_score income employment education health crime
## 1 E01014485 11727 23.53 0.18 0.13 15.75 0.41 0.32
## 2 E01014486 2408 49.04 0.34 0.26 31.45 1.25 0.99
## 3 E01014487 16452 17.36 0.07 0.06 0.28 -0.43 0.82
## 4 E01014488 11047 24.64 0.19 0.13 4.99 0.11 1.07
## 5 E01014489 5318 37.42 0.22 0.18 10.67 0.80 1.31
## 6 E01014491 19934 13.70 0.08 0.09 1.02 -0.48 0.34
## 7 E01014492 6812 33.27 0.22 0.19 45.80 0.76 0.37
## 8 E01014493 7510 31.54 0.19 0.15 50.05 0.81 0.31
## 9 E01014494 20047 13.60 0.09 0.08 15.85 -0.18 0.49
## 10 E01014495 11742 23.51 0.13 0.13 28.23 0.63 0.40
## housing living_env idaci idaopi geometry
## 1 13.84 36.20 0.18 0.30 MULTIPOLYGON (((360256.3 17...
## 2 22.07 43.52 0.39 0.57 MULTIPOLYGON (((359676 1741...
## 3 22.84 62.34 0.04 0.14 MULTIPOLYGON (((359036.6 17...
## 4 16.49 38.16 0.28 0.31 MULTIPOLYGON (((359632.6 17...
## 5 20.97 62.02 0.29 0.36 MULTIPOLYGON (((359120.6 17...
## 6 18.62 41.51 0.05 0.15 MULTIPOLYGON (((359467.6 17...
## 7 20.54 22.51 0.35 0.21 MULTIPOLYGON (((353215.9 17...
## 8 20.09 30.64 0.29 0.23 MULTIPOLYGON (((352337 1770...
## 9 13.95 20.54 0.15 0.12 MULTIPOLYGON (((352702 1767...
## 10 17.86 27.29 0.21 0.15 MULTIPOLYGON (((353137.9 17...
The text above describes the dataframe we just read in. It gives most of the spatial information necessary to understand the geography:
bbox
,geometry type
) in the dataframe,epsg
code as well as its proj4
description)“Namespaces are one honking great idea,” they say. A namespace is a sectioned-off area of a working environment (a space) where everything inside of that area is only addressable by referring to that space (by its name), and then to what you want that is inside of it.
Thus, something like an address is a great example of a namespace: BS8 1SS shows us that the 1SS
is within BS8
, which itself is a subset of all BS
postcodes, and all of which are within the UK (since other countries use a different postal code system).
R
has a bit of a complex relationship with namespaces. By default, when you load a package using library
, it dumps out everything from within that library. This is kind of like starting work by dumping your entire rucksack out onto your desk, then sifting through its contents every time you need something. Thus, I’ll encourage you to use ::
to be specific about which namespace a tool is in. When I write sf::st_read
, this means “use the st_read
function from within the sf
package.” While R
lets you use st_read
by itself, writing sf::st_read
is more clear, since it’s always apparent which package is being called.
Since a sf
dataframe is simply a powered-up dataframe, we can treat it like a dataframe. This includes stuff like selection, subsetting, and aggregation. For an example, the School of Geographical Sciences is located in the Local Super Output Area with the index E01014542
. We can grab just that one LSOA in a similar fashion to how we operate with other dataframes:
is_geog = bristol$LSOA11CD == 'E01014542'
geog = bristol[is_geog, ]
geog
## Simple feature collection with 1 feature and 12 fields
## geometry type: MULTIPOLYGON
## dimension: XY
## bbox: xmin: 357787.7 ymin: 172880.4 xmax: 358931.7 ymax: 173851.2
## epsg (SRID): 27700
## proj4string: +proj=tmerc +lat_0=49 +lon_0=-2 +k=0.9996012717 +x_0=400000 +y_0=-100000 +ellps=airy +towgs84=446.448,-125.157,542.06,0.15,0.247,0.842,-20.489 +units=m +no_defs
## LSOA11CD imd_rank imd_score income employment education health crime
## 55 E01014542 11034 24.66 0.04 0.03 22.01 0.67 0.9
## housing living_env idaci idaopi geometry
## 55 31.76 72.08 0.04 0.36 MULTIPOLYGON (((358357.9 17...
Now, using the format of the data we explained above, we can see that this new geog
dataframe has only one row. We can make a plot of this single row using the plot
function. Let’s make a plot that highlights our geog
LSOA on top of the rest of bristol LSOAs:
# sf::st_geometry extracts *just* the geometry of the LSOA in bristol
plot(sf::st_geometry(bristol))
plot(sf::st_geometry(geog), col='red', add=TRUE)
It’s hard to see the LSOA in the jumble there. So, let’s zoom in on the LSOA. Part of the nice thing about using the built-in plot
function is that it provides us with built-in xlim
and ylim
options we can use to focus the map. However, to pick a good focus area for the map, we could zoom in on an area that’s near the focal LSOA. To do this, we can: 1. buffer the LSOA, making a larger dilated polygon 2. grab the bounding box of this dilated polygon, which focuses the map on the polygon itself, but keeps a bit of the area around it, too.
To do this:
buffered = sf::st_buffer(geog, 2000)
bbox = sf::st_bbox(buffered)
plot(sf::st_geometry(bristol), xlim=bbox[c('xmin','xmax')],
ylim=bbox[c('ymin','ymax')])
plot(sf::st_geometry(geog), col='red', add=TRUE)
Now that you’ve seen a few things about making maps,
E01014709
)E01014510
)E01014540
)E01033348
)E01014491
)E01033362
)sf
makes it easy to make choropleth maps. To make a a choropleth map from a column on our dataframe, we can use the plot
function as follows:plot(bristol['imd_score'])
with this realization,
living_env
deprivation score in the area around the Downs.crime
deprivation score in the area around Lakota.housing
deprivation score in the area around The Centre.NOTE: Be sure to vary the buffer distance if need be.
This simple plotting works in a kind of subtle. magical way. R
uses a system called multiple dispatch, which is quite complex. Basically, what is helpful to understand is that most functions in R
decide “what to do” with their input depending on what class their input is.
Like… take mean()
. If the input is a vector, the method computes the mean for that input:
mean(c(2,3,7))
## [1] 4
However, after loading the mosaic
library, the function can compute the mean of a formula
, which is not a numeric argument. Before loading mosaic
, however, the mean()
function does not understand how to use formulae:
unloadNamespace('mosaic') #make sure mosaic isn't loaded
mean(~imd_score, data=bristol) # this will fail
## Warning in mean.default(~imd_score, data = bristol): argument is not
## numeric or logical: returning NA
## [1] NA
library(mosaic) #now load mosaic
mean(~imd_score, data=bristol) # and this will work
## [1] 26.95665
This is because the mean
function decides what do to based on the class of its input. If you give it a vector
or numeric
argument, it’ll just compute the mean. If you give it a formula
, the mean
function will do something with that instead. But, mean
doesn’t know how to deal with formula
by default; it has to be enabled to use formulas by the mosiac
package.
The same idea happens for plot
and our spatial data. For standard R
dataframes, df['column']
gives you back a dataframe, but df$column
gives you back the corresponding vector of values in that column. If we use sf_df['column']
, we get back a sf
dataframe, and plot knows how to make a choropleth map out of an sf
dataframe with a single column. If we use plot(sf_df$column)
, then the result just looks like a standard plot of a vector.
To check yourself, we’ll look into a few things:
class(bristol['imd_score'])
show? and what class does class(bristol$imd_score)
show?head(bristol['imd_score'])
and head(bristol$imd_score)
. How do they look different?plot(bristol$imd_score)
. See that it gives a very different result from plot(bristol['imd_score'])
?lattice::histogram(bristol['imd_score'])
, and then try lattice::histogram(bristol$imd_score)
. From our test of histogram
, do you think that lattice
knows how to use sf
objects?Spatial relationships are the structural/geographical relationships between observations in a dataset. You’ve worked a little bit with spatial relationships before, especially in your group projects. However, we can compute a large variety of spatial relationships using sf
. One main conceptual model of how geographical objects can relate to one another is called the dimensionally-extended 9 intersection model, or DE9-IM for short. The DE9-IM, developed by Egenhofer & Franzosa (1991) to describe point set relations, is a standard model of how geographic data is represented. is shown below for two shapes, A and B:
A few definitions are helpful to precisely define this idea:
With these definitions, the DE9-IM is a system that allows us to define how geometric objects relate to one another, based on how their interiors, exteriors, and boundaries intersect with one another. Further, basic geometric objects (like points, lines, and polygons) have precise definitions of their interiors/exteriors/boundaries:
Together, this is somewhat arbitrary, but it allows us to talk about spatial relationships in a mathematically rigorous way. For instance, when we say that a line crosses a polygon, we are formally saying that the line’s boundary intersects the polygon’s exterior, interior, and boundary. If the line is within a polygon, then the line’s boundary must not intersect the exterior of the polygon. The full table of these relationships is somewhat complicated, but using the collection of how a geometry’s interior/exterior/boundary intersects another geometry’s interior/exterior/boundary, we can express the relationships between geographic observations.3
The full table of relations is shown below:
Name in Plain English | Description | sf function name |
---|---|---|
A intersects B | Any point in A is also in B (opposite of disjoint) | st_intersects(A,B) |
A is disjoint from B | No point in A is also in B (opposite of intersects) | st_disjoint(A,B) |
A touches B | A point on A’s boundary touches a point on B’s boundary | st_touches(A,B) |
A crosses B | A’s interior intersects B’s interior, and A’s exterior does not intersect B’s interior | st_crosses(A,B) |
A is within B | A’s exterior intersects B’s interior, and A’s interior does not intersect B’s exterior (opposite of contains) | st_within(A,B) |
A contains B | A’s interior intersects B’s exterior, and B’s interior does not intersect A’s exterior (opposite of within) | st_contains(A,B) |
A overlaps B | A’s interior intersects B’s interior, and A’s interior intersects B’s exterior. | st_overlaps(A,B) |
A covers B | All of B’s interior intersects A’s interior (opposite of covered-by) | st_covers(A,B) |
A is covered by B | All of A’s interior intersects B’s interior (opposite of covers) | st_covered_by(A,B) |
A equals B | Every point in A is also in B, and every point in B is also in A. | st_equals_exact(A,B) |
These relations are “fundamental” in the sense that they describe a topology for geography. A topology is a formal mathematical representation of the relationships between objects in a set; most of the relationships you encounter in geographical analysis are fundamentally topological relationships.4
One of the most common, fundamental, and thoroughly-used relations is touching. sf
makes it very fast to find out who’s touching, and where. To collect the LSOAs that touch the School of Geography’s LSOA, we can use the st_touches
operator:
nearby_touches = sf::st_touches(bristol, geog)
However, by default, st_touches(a,b)
takes two spatial data frames of many geometries; our result, nearby_touches
, is expressed in terms of a matrix of relationships between all the LSOAs in bristol
and all the geometries in geog
. This is expressed as a special kind of matrix. It has \(n\) rows, where \(n\) is the number of LSOAs in bristol
, and \(k\) columns, where \(k\) is the number of geometries in geog
. We know, however, that there is only one geometry in geog
, but sf
gives us back a full matrix anyway. So, for us to simply get back the \(n\)-length vector of TRUE
/FALSE
describing whether a geometry touches geog
, we need to use the apply
function.
apply
takes a matrix, a margin, and a function. A margin is kind of like the direction the function will be used. If we supply a margin of 1, the function will be computed over all the rows of the input matrix. If the margin is 2, then the function will be applied down the columns of the input matrix. We just want to know if any observation in bristol
touches geog
, so we’ll use the any
function. And, since we only have one column, this gives us back an \(n\) length vector which is TRUE
when that row of bristol
touches geog
, and is FALSE
when that row of bristol
does not touch geog
.
nearby_touches = apply(nearby_touches, 1, any)
head(nearby_touches)
## [1] FALSE FALSE FALSE FALSE FALSE FALSE
Thus, we see that the first four observations in our table do not touch geog
.
How many LSOAs touch geog
?5
Now that we have the TRUE
/FALSE
for the geometries that touch geog
, we can make a map of just the observations that touch geog
using the standard subsetting methods you already know about:
plot(sf::st_geometry(bristol), xlim=bbox[c('xmin','xmax')],
ylim=bbox[c('ymin','ymax')])
plot(sf::st_geometry(bristol[nearby_touches,]),
col='lightblue', add=TRUE)
plot(sf::st_geometry(geog), col='red', add=TRUE)
How could you make the same map, but color in the observations not touching geog
?6
Sometimes, you want something a little more than a primitive spatial operation like those defined in the table above. Like, sometimes you may want to string together a few spatial operations. For instance, what if we wanted to get all of the LSOAs within 500 meters of geog
? Well, breaking our query down:
LSOAs within 500 meters of geog
We see that there is a within
statement, so we probably need a within
relation. But, we want to apply a kind of distance filter; within 500 meters of geog. Fortunately, we can do that with tools you’ve already seen: st_buffer
. We can build a 500-meter buffer around geog
, which is our search_area
. Then, any observations that are contained within that area will be considered as within 500 feet.
search_area = st_buffer(geog, 500)
within_500m = st_within(bristol, search_area)
within_500m = apply(within_500m, 1, any)
plot(sf::st_geometry(bristol), xlim=bbox[c('xmin','xmax')],
ylim=bbox[c('ymin','ymax')])
plot(sf::st_geometry(bristol[within_500m,]),
col='lightblue', add=TRUE)
plot(sf::st_geometry(geog), col='red', add=TRUE)
Woah! That’s fewer than that actually touch geog
? This may not be what you expected, and here’s why: human beings commonly understand anything within 500 meters of geog
to mean where any part of the area is within 500 meters of any part of geog
… we have a slightly different semantics for within
than DE9-IM.
So, thinking a little more critically about what we want, we may instead do something like: 1. build a search area of 500 feet around geog
2. select all LSOAs with any part in that search area.
For “any part in that search area”, we need to think about intersects
, not within
. So, using that concept of what our search query means, we do:
nearby_within_500m = st_intersects(bristol, search_area)
nearby_within_500m = apply(nearby_within_500m, 1, any)
plot(sf::st_geometry(bristol), xlim=bbox[c('xmin','xmax')],
ylim=bbox[c('ymin','ymax')])
plot(sf::st_geometry(bristol[nearby_within_500m,]),
col='lightblue', add=TRUE)
plot(sf::st_geometry(bristol[within_500m,]),
col='lavender', add=TRUE)
plot(sf::st_geometry(geog), col='red', add=TRUE)
That looks more like what folks think of when they say anything within 500 meters
.
For the set of interesting areas we examined before:
E01014709
)E01014510
)E01014540
)E01033348
)E01014491
)E01033362
)the idea of st_touches
, in graph theory, can be used to build an adjacency matrix that connects observations together. There are many different ways to build adjacency matrices in R
, but we won’t worry about that now. Suffice it to say, one useful concept that is simple to compute from an adjacency matrix is the higher order neighbor of each observation. For our example above, all of the observations that touch geog
are called first-order neighbors of geog
. Second-order neighbors of geog
are those that touch the neighbors of geog
’s first order neighbors! This proceeds so-on and so-forth until we get to the last order neighbor, which is the observation the furthest number of steps from geog
. So, if you’re up for it…
second_order = st_touches(bristol, bristol[nearby_touches, ])
second_order = apply(second_order, 1, any)
gives us the second-order neighbors of geog
.
geog
. do not highight geog
in red. Is there anything surprising about the second-order neighbors of geog
?second_order
use the same concepts to make a map of the third_order
and fourth_order
neighbors. 3. For an extra-hard challenge, it’s common to say that the \(n\)-th order exclusive neighbors are those neighbors that first occur at the \(n\)th order. By convention, the observation itself is considered the “zeroth” order neighbor, and is always excluded from higher-order neighbors. For our second_order
example above, we can filter out observations we’ve seen at lower orders using the exclusive or function, xor
, and the OR
operator, |
:second_order_exclusive = xor(second_order, nearby_touches | is_geog)
In this, we want to get the elements that are TRUE
only in second_order
and not in either nearby_touches
or is_geog
. Extending this idea,
nearby_touches
and second_order_exclusive
neighbors.third_order_exclusive
and fourth_order_exclusive
.sf
The last major kind of spatial relationships tends to focus on distances. But, instead of focuing on continuous functions of distance, a kernel function is a smoothly-varying expression of how close two observations are. Kernel functions usually take the distance between observations and then express them in terms of a value between zero and one, where one means really close and zero means really far. This reflects a kind of standardized nearness score, and is incredibly common in machine learning, data science, and geographical analysis.
We can get the distances from geog
and every other observation in bristol using sf_distance
:
distances = st_distance(bristol, geog)
To see all the distances, use:
# because distances have units, and lattice doesn't understand them,
# we *have* to use hist and not lattice::histogram!!
hist(distances)
Add a column to bristol
called distance_from_geog
. Make a choropleth map of that new column.
Now, given all these distances, we can compute a kernel around geog
using the following approach:
kernel = exp(-as.numeric(distances)/1500)
With this, we’re saying that our nearness score is: \[ k(i,j) = e^{\left(-\frac{d_{ij}}{1500}\right)} \] The \(1500\) value is called the bandwidth, and it reflects a kind of baseline level of nearness in the map. This is a parameter that we can tune to increase or decrease the average nearness in the map.
To visualize the kernel, you can assign a new column and make a map:
bristol['geog_building_kernel'] = kernel
plot(bristol['geog_building_kernel'])
nexp_kernel = function(distance, bandwidth=1500){
u = as.numeric(distance) / bandwidth
return(exp(-u))
}
In general, the kernel function defines u = distance / bandwidth
, and then uses the equations you see online as standard kernel shapes. So, for one more example, we could implement the sigmoid kernel as:
sigmoid_kernel = function(distance, bandwidth=1500){
u = distance/bandwidth
return( (2 / pi) * (1/(exp(u) + exp(-u))) )
}
Then, make a map of the kernel using the same bandwidths you did in part 1.
Now, we can talk about how spatial relationships help us talk about spatial relateness and spatial similarity. Since we now know how to use DE9-IM language to define spatial relationships, we can just assume that we’ve picked some method of defining observations that are nearby and some that are not. Moving a bit more abstract, let’s move away from geog
, and call it just any site, \(i\), with an observed value for some data \(x\), called \(x_i\). Then, let some other observations, \(j\), be found in our spatial search nearby \(i\). For now, let’s call the collection of these observations the neighborhood of site \(i\), and let’s spell that in Greek as \(\eta_i\). Then, for every observation in the neighborhood of geog
(which is spelled \(j \in \eta_i\) in mathematical language), we observe data at the other sites \(j\) and call them \(x_j\).
If we wanted to summarize the values \(x_j\) for \(j\) in the neighborhood of \(i\), one common method is the local sum. This measure adds up all \(x_j\) for \(j \in \eta\): \[ \boldsymbol{l}_i = \sum_{j\in\eta_i} x_j \] Breaking this expression down, we see that the local sum for observation \(i\) is the sum of all of the features in the neighborhood around \(i\). For some other observation, the neighborhood will be different, as will the local sum. In addition, if we change what \(\eta_i\) is made of, like by changing from first-order to second-order neighbors, or from a very narrow/small bandwidth to a wide/large bandwidth, we will change the local sum.
The local sum is useful when we need the count of something nearby \(i\). This might make sense when you’re interested only in the total number of things in the neighborhood, like “I aced that SM2 test, how many pubs are within a 100-meter radius of Wills?” But, this does not give us a summary of what nearby areas are like individually. Instead, we might want to know the average, median, or standard deviation for all \(x_j\) in the neighborhood of \(i\). For the data we’ve been working with on the Bristol index of multiple deprivation, the local average is a useful statistic, since it gives us a single score for the deprivation surroudning geog
that can be compared to the deprivation at geog
itself. Thus, local weighted averages are the most common form of summary by far: \[
\boldsymbol{l}_i = \frac{1}{n_{\eta_i}}\sum_{j \in eta_i} w_{ij}x_j
\] where \(n_{\eta_i}\) is the number of observations in \(i\)’s neighborhood and \(w_{ij}\) is the weight relating sites \(i\) and \(j\). For some kinds of spatial relationships, \(w_{ij}\) is only zero or one, such as when we used st_touches(bristol, geog)
: there, \(w_{ij}\) is TRUE
(or \(1\)) when row \(j\) of bristol
touches geog
, and is zero when they do not touch. Alternatively, \(w_{ij}\) can be some continuous value expressing the geographcial nearness between sites \(i\) and \(j\), like the kernel functions we used before. Regardless, this kind of local weighted sum is the most common way of summarizing the surroundings of an observation.8 As a matter of terminology, this local weighted sum for variable \(x\), when it’s taken for every site instead of just at a single site, is usually called the spatial lag of \(x\), and likewise we’d refer to the spatial lag around \(i\) as the local weighted sum for site \(i\).
To actually compute these values, we can do something similar to what we did before. 1. split the dataframe according to observations being in our spatial search area 2. summarize the observations that are within our spatial search area using an appropriate function
Below, we can use our nearby_touches
vector to extract only the rows of bristol
that touch geog
:
bristol[nearby_touches, ]
## Simple feature collection with 8 features and 13 fields
## geometry type: MULTIPOLYGON
## dimension: XY
## bbox: xmin: 357138.7 ymin: 172034.8 xmax: 359550.4 ymax: 174170.8
## epsg (SRID): 27700
## proj4string: +proj=tmerc +lat_0=49 +lon_0=-2 +k=0.9996012717 +x_0=400000 +y_0=-100000 +ellps=airy +towgs84=446.448,-125.157,542.06,0.15,0.247,0.842,-20.489 +units=m +no_defs
## LSOA11CD imd_rank imd_score income employment education health crime
## 52 E01014538 22829 11.15 0.03 0.01 0.48 -0.27 -0.04
## 53 E01014539 19569 14.06 0.07 0.06 0.38 -0.58 0.78
## 54 E01014540 9221 27.91 0.12 0.08 18.87 0.23 1.36
## 63 E01014551 20037 13.61 0.04 0.03 7.13 -0.18 0.14
## 66 E01014554 22105 11.74 0.03 0.03 5.02 -0.54 0.81
## 245 E01033342 10818 25.08 0.07 0.05 19.11 0.53 1.02
## 246 E01033344 8994 28.33 0.11 0.07 7.90 0.27 1.37
## 247 E01033345 14154 20.11 0.02 0.01 7.15 -0.59 1.45
## housing living_env idaci idaopi geometry
## 52 24.08 54.13 0.10 0.19 MULTIPOLYGON (((357819.1 17...
## 53 19.90 48.72 0.08 0.09 MULTIPOLYGON (((358951 1740...
## 54 29.51 60.39 0.37 0.23 MULTIPOLYGON (((358672.4 17...
## 63 20.31 61.93 0.06 0.17 MULTIPOLYGON (((357442 1734...
## 66 14.30 49.44 0.06 0.14 MULTIPOLYGON (((357611.6 17...
## 245 41.72 51.00 0.15 0.38 MULTIPOLYGON (((358856 1736...
## 246 31.23 81.40 0.46 0.20 MULTIPOLYGON (((358121 1725...
## 247 32.78 63.40 0.10 0.05 MULTIPOLYGON (((358427.5 17...
## geog_building_kernel
## 52 1
## 53 1
## 54 1
## 63 1
## 66 1
## 245 1
## 246 1
## 247 1
and, if we wanted the average of the imd_score
from within that subset, we could grab the mean from the imd_score
column of that new subsetted data:
mean(bristol[nearby_touches, ]$imd_score)
## [1] 18.99875
bristol$imd_score
.geog
using nearby_within_500m
.geog
?nearby_touches
?nearby_touches
includes fewer, closer observations than nearby_within_500m
, how do you geographically interpret the fact that one has a higher deprivation score than another?weighted.mean
allows you to take a weighted mean in R
. Using weighted.mean
:bristol$imd_score
using kernel
as the weights. Is this larger or smaller than the local weighted average from nearby_touches
?bristol$imd_score
using nearby_touches
as the weights vector. Is this the same as the local weighted average around \(i\)? Using the equation for the local weighted mean above, why does nearby_touches
work this way?E01014709
)E01014510
)E01014540
)E01033348
)E01014491
)E01033362
)Compute the local average of observations that touch each area of interest.
Make a dataframe9 for all the areas of interest and geog
containing:
lsoa_id
for that rowimd_score
(i.e. the value of imd_score
in that area of interest)imd_score
(i.e. the local weighted average of imd_score
in that area of interest)Compute the raw correlation between those two columns using cor
. Disregarding any statistical significance (for now), is the relationship between those sites and their surroundings positive or negative? Put another way, are the surroundings of those sites similar to or dissimilar to the areas of interest themselves?
Spatial relationships are fundamental to quantitative geography. In abstract, we need to represent geographical structures in ways that we can use when we build models. Above, We’ve seen many different ways that spatial relationships are represented in mathematical or computational structures. We’ve also seen how we can use these structures to summarize the local structure of a map in a mathematical way; instead of simply computing statistics about the map as a whole, we can build local averages or local sums that describe the structure of a map around an area of interest. This is a very common idea in quantitative geography, and helps us do a lot of different, neat things.
Beyond quantitative geography, Geographical Information Science also focuses extensively on spatial relationships. The Dimensionally Extended 9-Intersection Model we discussed is a bedrock of GIS; the idea of the spatial predicate is a fundamental component of working with spatial databases and understanding or expressing spatial relationships. Despite the fact that these relationships have been formalized, however, we’ve also now seen how these may not exactly reflect the spatial relationship that springs to mind when the technical term (e.g. within
) is used. Overall, this means that it is important to understand both what the formal meaning of a predicate is, and it reinforces that you need to be very clear and precise in your thinking when discussing spatial relationships. Like any formal language (e.g. mathematics, programming langauges), the formal language of spatial relationships can be difficult to speak correctly the first time, so be sure to play around with it so that you can get it right.
Finally, we’ll move forward by seeing the ways in which geographcial structures can be used to estimate spatial statistics and characterize the strength and direction of spatial relationships. This will extensively use sf
and the ideas covered in this notebook, so please do be sure to review the concepts behind the local weighted average.
This is a critical idea in system design, and even has its own fancy Russian-sounding theorem. Put a little more simply, this reduces to: special cases should not be special enough to break the rules.↩
A clever student might think this is a reprisal of Zeno’s Dichotomy Paradox. Indeed, an open set in mathematics is kind of like that… for any point \(p\) that’s near the edge of the interior, we can always pick some really small distance \(\epsilon\) so that some point \(p'\) is closer to the edge of the interior, but still is inside the interior. Relocate \(p\) to \(p'\). Then, keep looking for another (smaller) \(\epsilon\) that keeps a new \(p'\) within the interior. In the limit, \(\epsilon\) will get smaller and smaller. However, because \(\epsilon\) can get infinitessimally small, we actually never have to pick a \(p'\) that is outside of the interior. In a slightly more formal way, this is exactly the mathematical definition of an open set. A closed set is one where there is some \(p'\) where we cannot pick \(\epsilon\) so that we get closer to the edge; at that point, we either are in or we are out.↩
The DE9IM only applies for vector data of infinite resolution; raster data is much simpler to understand and represent, since it fundamentally is a collection of regularly-sampled positions with known values. There is no geometry other than the grid that raster data represents. And, practically speaking, computers can only represent vector data with finite precision, so the interior/exterior of shapes is not actually a fully open set. But, the DE9IM is a conceptual model for how we’ll talk about spatial relationships between geometries.↩
In case you ever want to dress up what you do as a quantitative geographer, you can always call yourself a mathematician studying applied topology, but folks might look at you sideways.↩
How would you get the sum total of TRUE
s in a vector? Recall that, under the hood, a boolean variable is represented with one for TRUE
and zero for FALSE
.↩
In R
, you can change TRUE
to FALSE
by doing !TRUE
. Try this yourself at the console. This is because !
is the negation operator in R
, and will flip FALSE
to TRUE
and vice versa↩
If you pick the uniform kernel from wikipedia, I won’t stop you, but I can promise it ain’t the most illuminating experience.↩
But, it is definitely not the only way; we use this idea of summarizing the neighborhood of \(i\) in a very general way in quantitative geography. For instance, another well-known spatial statistic we will cover soon is built from \(\boldsymbol{l}_i = \frac{1}{n_{\eta_i}}\sum_{j \in \eta_i} w_{ij}(x_i - x_j)^2\), the local dissimilarity between \(i\) and its surroundings.↩
Making dataframes by ‘hand’ can be done using the data.frame
function. For instance, if I had three areas of interest and their lsoa_id
, imd_score
, and lag_imd_score
, I could do:
data.frame(list(
lsoa_id = c(1, 2, 3),
imd_score = c(20, 45, 99),
lag_imd_score = c(22, 34, 79)
)
)
↩