To this point in Spatial Modelling (I or II), we’ve really primarily dealt with regression models about spatial data. We have not really examined or understood how spatial structures or relationships themselves must be modeled. In the Regression module, we did talk a little bit about spatial fixed effects, which allow for there to be different mean/baselines in each area.
library(sf)
library(mosaic)
library(spdep)
bristol = sf::read_sf('./data/bristol-imd.shp')
Let’s say we need to have the ward ID for each LSOA. Right now, the LSOAs don’t have information on their wards built into the dataframe. If we have Bristol ward boundaries in the following geojson file:
Read in the data on ward boundaries:
wards = sf::read_sf('./data/wards.geojson')
What steps are necessary to join up the LSOAs that are within each ward? One workflow you’ve learned already might be:
First, we need to ensure that our data is in the same map projection.
To reproject the wards
dataframe into the same projection as your bristol
data, we use the sf::st_transform
method. I find it easiest to set projections based on the exact projection string from the data we’re targeting:
wards = sf::st_transform(wards, sf::st_crs(bristol))
After we do the reprojection, we can check to see if the data is in exactly in the same projection by:
sf::st_crs(wards) == sf::st_crs(bristol)
## [1] TRUE
and, we can check if everything “looks right” by making a map:
plot(bristol['imd_score'], reset=FALSE, lwd=.1)
plot(sf::st_geometry(wards), border='white', lwd=2, add=TRUE)
Finally, we need to do a spatial join to link up the ward identifier to the LSOAs.. This is a multi-step operation, so let’s walk through it.
bristol_join = sf::st_join(bristol, wards, join=sf::st_within)
Do this, then plot(bristol_join['ward_id'])
. Why does that not look like you might expect? 2. A clever person might object, and think this might work:
bristol_join = sf::st_join(wards, bristol,join=sf::st_contains_properly, left=FALSE)
Why does that not work how you expect? 3. The way I’d recommend doing this is first getting the centroids of the lower-level units:
bristol_centroids = sf::st_centroid(bristol)
## Warning in st_centroid.sf(bristol): st_centroid assumes attributes are
## constant over geometries of x
Then, we can use st_within
assign each LSOA to a ward that contains that LSOA’s centroid!
bristol_centroid_join = sf::st_join(bristol_centroids, wards, join=sf::st_within)
Now, to get the ward_id
back alongside our original polygons, you can do a dplyr::left_join
based on the LSOA code. First, we should only keep the two columns we need, LSOA11CD
and ward_id
. And, since we don’t want to have multiple geometry columns hanging around, we need to drop the geometries from bristol_centroid_join
:
clean_ward_centroid = sf::st_drop_geometry(bristol_centroid_join)
clean_ward_centroid = clean_ward_centroid[c('LSOA11CD', 'name', 'ward_id')]
names(clean_ward_centroid) = c('LSOA11CD', 'ward_name','ward_id')
Now, our new clean_ward_centroid
is a dataframe with two columns. It has an LSOA identifier matched up to the ward_id
and ward_name
for the ward that contained the LSOA’s centroid. Now, we want to do a join of this data back into our original Bristol LSOA data. A left_join
takes two dataframes, x
and y
, and matches every row of x
to a row in y
, based on a “key” column that is shared between the two frames. So, for instance, looking in detail at a few rows of bristol
:
## Simple feature collection with 10 features and 1 field
## geometry type: MULTIPOLYGON
## dimension: XY
## bbox: xmin: 351942.9 ymin: 173775 xmax: 360652.9 ymax: 177452.7
## epsg (SRID): 27700
## proj4string: +proj=tmerc +lat_0=49 +lon_0=-2 +k=0.9996012717 +x_0=400000 +y_0=-100000 +ellps=airy +towgs84=446.448,-125.157,542.06,0.15,0.247,0.842,-20.489 +units=m +no_defs
## # A tibble: 10 x 2
## LSOA11CD geometry
## <chr> <MULTIPOLYGON [m]>
## 1 E01014485 (((360256.3 174422.7, 360255 174422, 360251 174422, 360224 17…
## 2 E01014486 (((359676 174114.7, 359661.4 174127.8, 359661.7 174128.7, 359…
## 3 E01014487 (((359036.6 174653.1, 359035.2 174668.6, 359034.5 174676, 359…
## 4 E01014488 (((359632.6 174806.4, 359634.2 174807.4, 359648.1 174816, 359…
## 5 E01014489 (((359120.6 174097.5, 359120.3 174098.5, 359119.6 174102.1, 3…
## 6 E01014491 (((359467.6 175130.1, 359466.3 175131.7, 359461.6 175137.5, 3…
## 7 E01014492 (((353215.9 177024.6, 353210.2 177011.1, 353209.3 177009.1, 3…
## 8 E01014493 (((352337 177024, 352336.3 177025.2, 352331 177033.5, 352330 …
## 9 E01014494 (((352702 176779, 352703.2 176782.1, 352704.6 176785.9, 35271…
## 10 E01014495 (((353137.9 176866.4, 353138.4 176867, 353138.6 176867.3, 353…
and a few rows of our clean_ward_centroid
:
## # A tibble: 10 x 3
## LSOA11CD ward_name ward_id
## <chr> <chr> <chr>
## 1 E01014485 Ashley E05010885
## 2 E01014486 Ashley E05010885
## 3 E01014487 Ashley E05010885
## 4 E01014488 Ashley E05010885
## 5 E01014489 Ashley E05010885
## 6 E01014491 Ashley E05010885
## 7 E01014492 Avonmouth & Lawrence Weston E05010886
## 8 E01014493 Avonmouth & Lawrence Weston E05010886
## 9 E01014494 Avonmouth & Lawrence Weston E05010886
## 10 E01014495 Avonmouth & Lawrence Weston E05010886
We want to stick the 'ward_id'
from clean_ward_centroid
in each row of bristol
where LSOA11CD
is the same in both dataframes. This would match up each LSOA with its ward_id
.
To do this, we’ll use dplyr
joins. dplyr
has a bunch of different kinds of joins, but they all kind of have similar structure. They all take two dataframes (x
,y
), and a column that contains values that we can use to match up each row of x
rows in y
. And, all columns in x
and y
are in the final output.
inner_join(x,y,by='key')
is a join between two tables x
and y
using a column (specified with the by
option). The inner
part means that only rows of x
and y
that have a matching value in the by
column are kept; all rows in x
or y
with no match are dropped from the result.left_join(x,y,by='key')
keeps all rows in x
, but puts NA
or missing values when a row of y
can’t be matched to a row of x
.right_join(x,y,by='key')
keeps all rows in y
, putting missing values anywhere where there is no match in x
.outer_join(x,y,by='key')
will use all rows of both x
and y
, with missing values on the y
side if x
a row has no match in y
, and missing values on the x
side of y
has no match in x
.So, for our purposes, we want to match the ward_id
inside clean_ward_centroid
back up to the LSOAs in bristol
. And, we want to see if our join failed to match up any wards to LSOAs. This means we want either a left or right join, since we want to keep values from bristol
that might not have been matched successfully in our spatial join. Thus, I’d recommend:
bristol = dplyr::left_join(bristol, clean_ward_centroid, by='LSOA11CD')
# The equivalent Right Join would be flipped from the left join:
# bristol = dplyr::right_join(clean_ward_centroid, bristol, by='LSOA11CD')
Now, to check everything’s done:
plot(bristol['ward_id'], reset=F)
plot(sf::st_centroid(bristol), add=T, color='k', pch=20)
## Warning in st_centroid.sf(bristol): st_centroid assumes attributes are
## constant over geometries of x
## Warning in plot.sf(sf::st_centroid(bristol), add = T, color = "k", pch =
## 20): ignoring all but the first attribute
plot(sf::st_geometry(wards), add=T, border='white',lwd=2)
Ooh! Look at that weird one! Let’s zoom in:
plot(bristol['ward_id'], reset=F,
xlim=c(357750, 359500), ylim=c(175500, 175700))
plot(sf::st_geometry(sf::st_centroid(bristol)),
add=T, color='k', pch=20,
xlim=c(357750, 359500), ylim=c(175500, 175700))
plot(sf::st_geometry(wards), add=T, border='white',lwd=5,
xlim=c(357750, 359500), ylim=c(175500, 175700))
Oof, despite the fact that this is “correct”, this still yields results that we might consider to be counterintuitive. It’s always important to visualize the results of your joins when you do them, to make sure they look like what you expect.
We’ve talked before about structural relationships between features over geographies. Called “spatial autocorrelation,” we talked about this in terms of how an observation is dependent on itself within a given geography. So, for instance, we’ve spent a long time looking at or thinking about something that looks like this:
dev.off() # you may need to run this after making complicated maps!!
## null device
## 1
library(spdep)
nearby = spdep::dnearneigh(sf::st_centroid(bristol), 0, 2000)
nearby = spdep::nb2listw(nearby)
bristol['nearby_living_env'] = spdep::lag.listw(nearby, bristol$living_env)
plot(nearby_living_env ~ living_env, data=bristol, pch=20)
abline(lm(nearby_living_env ~ living_env, data=bristol), col='orangered', lwd=2)
abline(v=mean(bristol$living_env), col='slategrey', lty='dashed', lwd=2)
abline(h=mean(bristol$nearby_living_env), col='slategrey', lty='dashed', lwd=2)
But, most of our interest in social science comes from building models where some data \(X\) allows us to predict or explain some outcome, \(y\). How can we use geographical structure to account for this?
In our last lecture, we talked about four distinct kinds of spatial effects:
Take the linear model with a single effect \(\beta\) and mean \(\alpha\).
\[ y = \alpha + x\beta + \epsilon\]
Let’s consider a really simple model relating housing quality and crime:
model = lm(crime ~ housing, data=bristol)
This model predicts crime prevalence in an LSOA using the housing deprivation of that LSOA. There are a few rudimentary theories we could posit about this structural relationship. For instance, low quality housing may be easier to break into or vandalize, making it more likely to be targeted by criminals.1 \[ \begin{split} crime = & \ \ \ \ \ baseline \ \ crime \\& + (housing\_deprivation) \bullet (extra\ \ crime\ \ per\ \ housing\_deprivation) \\&+ unexplained \ \ crime \end{split}\] spelled in a more terse way, this is: \[ y_i = \alpha + x_i\beta + \epsilon_i\] where \(y_i\) is crime in an LSOA, \(x_i\) is housing quality in that LSOA, and \(\epsilon_i\) is the error in predicting crime using housing deprivation in that LSOA.
One way we can figure out if our model might need to include spatial effects is to look at the model’s errors, or residuals: \[ e_i = y_i - \hat{y}_i = y_i - (\hat \alpha + x_i\hat{\beta})\]
So, let’s extract the residuals at LSOA level:
bristol['residual'] = residuals(model)
-1
to avoid fitting a global intercept and just fit each ward mean:fe_model = lm(crime ~ -1 + housing + ward_name, data=bristol)
Interpreting the ward_name
effect as the baseline response in that ward, name two areas where there is a significantly higher-than-zero baseline crime in that ward, and two areas where there is significantly lower-than-zero baseline crime.
lmtest::bptest
we discussed in the regression module can be used to check for spatial heterogeneity in variance, too. Using lmtest::bptest(model, ~ward_name, data=bristol)
, test whether or not there are significant differences in the variability of prediction errors between different neighborhoods.bptest
, does the fe_model
exhibit spatial heterogeneity in its prediction error? Run the bptest
again to check. Thinking about distribution means and variances, why should you not be surprised that the spatial fixed effect model has this result from the bptest
?Next, we’ll move to spatially-dependent models. These are models that involve some amount of feedback or structural interaction between the outcome at one site (say, site \(y_i\)) and other sites \(y_j\) that are nearby. Usually, these kinds of models are expressed in terms of the feedback between sites. They’re used to model spillovers, mainly in analysis of social systems. One very common model that is used to express these kinds of feedbacks is called the simultaneous autoregressive model. It’s simultaneous because each site \(i\) is modeled as if it were affecting other sites \(j\) all in the same time period. And, it’s autoregressive because parts of \(y\) will be regressed on itself. Mathematically, the model is spelled: \[ y_i = \rho \sum_j^N w_{ij}y_j + x_i\beta + \epsilon_i\] for some outcome at site \(i\). Again, recall that \(\sum_j^Nw_{ij}y_j\) should be thought of as the local average around \(y_i\), or spatial lag of \(y_i\). Thus, this model uses the local average of \(y_i\) around that site to predict \(y_i\) itself.
This is a somewhat confusing model given the fact that \(y\) appears on both sides of the equation.4 We won’t delve into its interpretation here. But, given our nearby
spatial weighting idea this models the crime at site \(i\) as a function of the crime within \(2000\) meters and the housing deprivation at that site.[^hint-reduction2] In truth, we always end up removing \(y_j\) from the “wrong” side of the equation to isolate \(y_i\) on the left hand side alone. But, this requires some matrix mathematics. At a conceptual level, housing deprivation in the area around site \(i\) itself is affected by \(i\); this is exactly what generates a feedback. However, because we have to isolate the effect of each \(y_i\) on other \(y_j\) in order to estimate the model, it is only the spatial structure of housing deprivation that ends up mattering: we have to get all of the \(y\) on one side of the equation in order to estimate the model.
We’ll fit one of these models now, and maybe take a look at the estimates of spillover from this model.
spdep::lm.morantest
function, using the nearby
weights we made earlier.sar_model = spdep::lagsarlm(crime ~ housing, nearby, data=bristol)
summary(sar_model)
;think of rho
as an analogue of Moran’s \(I\), which can vary from close to \(-1\) up to \(1\) (kind of like a correlation coefficient). Thinking about \(\rho\) in this way, is there evidence of significant feedback/spatial dependence between our responses at sites in Bristol?Spatial nonstationarity involves changes in the effect of \(X\) on \(y\) over geography. This could mean that, in some areas an increase in \(X\) is related to an increase in \(y\), and in other areas the reverse holds. Spatial Nonstationarity is a particular problem for geographical analysis, because it is difficult to dis-entangle from measures of spatial dependence and measures of context. As such, models that examine spatial nonstationarity face many conceptual and practical challenges, such as - Where do things change? in what regions is the effect stable and when does it become unstable? - How does the effect change? Is there a clear theory for what we’re looking for, or are we simply looking for the existence of instability/effect change?
In a very limited sense, spatial nonstationarity is easy to examine. That is, if and only if we have a known or recieved geography over which we believe the effect will change. One simple way we could do that right now is to examine if different areas of Bristol have different relationships between living environment and housing quality in Bristol. As such, let’s create a binary indicator variable that splits Bristol into counties north of and south of the river.
south_of_river = c('Bedminster', 'Bishopsworth', 'Brislington East', 'Brislington West',
'Filwood', 'Hartcliffe & Withywood', 'Hengrove & Whitchurch Park',
'Knowle', 'Southville', 'Stockwood', 'Windmill Hill')
bristol['is_south'] = bristol$ward_name %in% south_of_river
ggplot(bristol, aes(x=housing, y=living_env, color=is_south)) +
geom_point() +
geom_smooth(aes(x=housing, y=living_env), method='lm', inherit.aes=F, color='black')
ggplot(bristol, aes(x=housing, y=living_env, color=is_south)) +
geom_point() +
geom_smooth(method='lm')
Run this code for yourself. Visually, did you win your bets from the previous question?
living_env ~ housing
and examine the results7.living_env ~ housing*is_south
,The last kind of effect that is common is a spatial contextual effect. This captures the effect of surroundings on a model outcome. Practically speaking, this is often done using the local average of a predictor variable as an extra predictor itself: \[ y_i = \alpha + x_i\beta_1 + \left(\sum_i^N w_{ij}x_j \right)\beta_2 + \epsilon\] This model (called the SLX model from Hallack-Vega (2015)) is one of a few different kinds of contextual effect models out there. The SLX model is interesting because it provides a distinct effect for the site \(\beta_1\) and the situation \(\beta_2\) that can be interpreted independently of one another. Unlike the models considered in spatial dependence, the SLX model is mathematically much simpler to estimate, interpret, and discuss.
We’ll examine the SLX model for a very simple regression that relates crime
and housing
.
bristol
dataframe as a new column called touches_housing
.crime ~ housing + touches_housing
.housing ~ income
. Arguably, income is what largely drives the process of choosing housing, both in a city-wide sense and in a local sense.touches
relation. Then, fit a model of housing ~ income + touches_income
.nearby
relation we build before, which is the local average within 2km.touches
relation, and the average built from the nearby
relation. How do the two maps differ?housing ~ income + nearby_income
.This is the last lecture in the Topics in Advanced Spatial Modelling module. Together, this presents the main concepts behind nearly any different kind of geographical/statistical regression analysis you’ll see in more advanced spatial modelling11 For instance, multilevel models are kind of like our spatial heterogeneity & spatial nonstationarity exercises, but combined together in a single model. Geographically-weighted regression is kind of like the contextual, heterogeneity, and nonstationarity exercises blended together. Much of spatial econometrics deals with autoregressive models that explicitly have spillovers/interactions between adjacent units, as well as dealing with contextual effects and data-driven heterogeneity discovery.
For a long time, quantitative geography was mainly interested in the existence of heterogeneity (e.g. Hartshornian regional geography focusing on place-differences). Then, arguably after the Whittle (1955) model and later from Tobler (1970)’s analysis of Detroit, analytical/quantitative geography became intensely interested in dependence. Alongside of this and Casetti’s original work on the “expansion” method (Casetti, 1972), geographers began to take nonstationarity seriously, culminating in a lot of work now in the subject. Contextual effects in geographical/spatial models really begin in earnest in the late 80s (e.g. Raudenbush & Bryk (1986)) and (as such) are one of the more recent contributions to the conceptual overview of the way geography is modeled. What the future spatial effects that quantitative geographers focus on remains to be seen.
But, of course, we’re leaving out really strongly significant elements that a model such as this should consider, like income or legacies of oppression/social domination.↩
Using boxplot(x ~ category, data=dataset)
will give you boxplots of the variable x
stratified by the category category
for the observations in dataset
.↩
Think of this like you do in R
. If \(j\) were a vector containing the group in which each of the sites \(i\) fell, then \(j[i]\) would be \(i\)’s group. Then, \(\alpha_{j[i]}\) could be thought of as the \(\alpha\) corresponding to the group for observation \(i\).↩
We can restate this in something called reduced form, where \(y\) only shows up on one side of the equation, if we use a little matrix algebra. We start with an expression of the same model above, recognizing that the restatement of \(\sum_j^N w_{ij}y_j\) can be re-done in terms of matrix multiplication between the matrix \(\mathbf{W}\) and vector \(y\): \[ y = \rho \mathbf{W}y + x\beta + \epsilon\] This model can no longer be thought of in terms of individual sites; instead, it models the whole map at once, as a single entity. This is because each site outcome is related to every other site outcome, since \(y\) is on both sides of the equation. Using a little more algebra, we can take a few steps: \[\begin{align} y &= \rho \mathbf{W} y + x\beta + \epsilon \\ y - \rho \mathbf{W} y &= x\beta + \epsilon \end{align}\] Now, we can “factor out” that \(y\). But, when we do factoring in matrix algebra, we usually have to replace \(y\) with an \(I\), the identity matrix… This is just a trick to ensure that the shapes of our final matrix product are “correct,” but we still get \(y\) at the end of the day, since \(I y = y\). \[ (I - \rho \mathbf{W}) y = x\beta + \epsilon \] Finally, we have to “divide” out that \((I - \rho \mathbf{W})\) term to get \(y\) by itself. But, we can’t really just divide by a matrix; instead we have to multiply both sides by the matrix inverse of what we want to get rid of: \[ (I - \rho \mathbf{W})^{-1}(I - \rho \mathbf{W}) y = (I - \rho \mathbf{W})^{-1}(x\beta + \epsilon)\] Since a matrix times its inverse cancels into \(I\) (just like \(\frac{x}{x} = 1\)), we get: \[ y = (I - \rho \mathbf{W})^{-1}(x\beta + \epsilon)\] You do not need to know this for the exam.↩
It’s important to note that the \(\beta\) effect we’ve just estimated in this SAR model is not actually the change in \(y\) given a unit change in \(x\). This is because it matters where the “unit change in \(x\)” occurs! Some sites will have stronger spillovers than others, and so will exert a stronger impact on \(y\) in these models.↩
One simple way you can think about this is to just eyeball a line of best fit for one color of points. Alternatively, think about how you’d shift the black line to better fit for one of the colors than the other. How would you have to tilt the line up or down (slope) shift it left or right (intercept)?↩
You’ll probably want to use the subset()
function to split the data into two halves.↩
Remember: if two estimates’ confidence intervals overlap, the true value of the parameter might just be in that interval, and the two estimates are actually uncertain measurements of the same value.↩
For example, an interaction effect model of y ~ x*is_red
would be written as: \[ y = \alpha + is\_red + x\beta_1 + x * is\_red * \beta_2 \]↩
remember, in spdep
, this is the spdep::poly2nb
function. And, don’t forget that you have to convert the nb
to list2
using something in spdep
…↩
save for maybe agent-based/geosimulation-style models.↩