Everything is related to everything else, but near things are more related than distant things. [This] is thus very parochial, and ignores most of the world. Tobler, (1970)
Today, we’ll be using the previous practical’s information about spatial relationships, functions, and modelling to understand the spatial structure of deprivation in Bristol. We’ll only be doing exploratory spatial data analysis this time, with an eye to doing some more advanced modelling in the weeks ahead.
Our main interest for this problem will be to examine: 1. Is deprivation clustered in Bristol? 2. Where are the clusters of deprivation in Bristol? 3. At what scale does deprivation cluster in Bristol (very local, ward-level, or regionally?)
To do this, we’ll first read in our data from last time:
library(sf)
library(dplyr)
bristol = sf::read_sf('./data/bristol-imd.shp')
Note that imd_score
indicates the deprivation of a local super output area; when imd_score
is large, then that LSOA is more deprived. Now for something completely different, let’s prepare bit more with an interlude on R
functions.
At a very high-up level of abstraction, a function in R
is a distinct and defined set of code that performs some operation or procedure. Last time, we wrote a kernel in terms of a function:
nexp_kernel = function(distance, bandwidth=1500){
u = distance / bandwidth
return(exp(-u))
}
In abstract, functions are descriptions of what you want to do to a set of inputs. Breaking it down, all functions are created using the same methods:
name_of_function = function(argument_1, argument_2, argument_with_default = TRUE)
{
if(argument_with_default){
return(argument_1 * argument_2)
}
else{
return(argument_1 / argument_2)
}
}
name_of_function
is an arbitrary name, and should be something that clearly describes what the function does (like, load_data()
, or simulate_next_year()
).function()
, the argument_1, argument_2, argument_with_default
, are called arguments. Basically, an argument is an input provided to a function. Functions can have an unlimited number of arguments, but it’s best to only take arguments you need to use later in the function. The actual words used to describe argument_1
can be anything. They should be descriptive, clear, and helpful (such as distance
and bandwidth
when we were defining nexp_kernel
).Finally, the body of the function comes between the braces. In our example, this is where the if/else
stuff is. Otherwise, you can think of it like: function(){This is the body of the function}
. The body is where you describe the operations you want to use on the arguments. Above, our example function will:
argument_with_default
is trueargument_1 * argument_2
.argument_1 / argument_2
.You send values outside of the function into the rest of your code using return()
. Once you return(something)
, the function stops. For instance, in our example code, if we return(argument_1 * argument_2)
, we never even check the other branch of code after the else{}
.
Functions are very helpful for code re-use, avoiding typos and ensuring that if you have to change or fix something, you change it in one place and one place only.. If you find yourself doing a sequence of steps many times in an R
script, try writing a function that does what you want to an argument, like:
dataset_2010 = read.csv('work/data/2010.csv')
missing_score = is.na(dataset_2010)
dataset_2010[missing_attainment, 'score'] = mean(dataset_2010$score)
clean_model_2010 = lm(score ~ class + race + background, data=dataset_2010)
# copy and paste, hope that you change all the 2010s into 2011s
dataset_2011 = read.csv('work/data/2011.csv')
missing_score = is.na(dataset_2011)
dataset_2011[missing_attainment, 'score'] = mean(dataset_2011$score)
clean_model_2011 = lm(score ~ class + race + background, data=dataset_2011)
# copy and paste, hope that you change all the 2011s into 2012s
dataset_2012 = read.csv('work/data/2012.csv')
missing_score = is.na(dataset_2012)
dataset_2012[missing_attainment, 'score'] = mean(dataset_2012$score)
clean_model_2012 = lm(score ~ class + race + background, data=dataset_2012)
might become:
clean_and_model_year = function(year){
filename = paste('work/data/', year, '.csv', sep='')
dataset = read.csv(filename)
missing_score = is.na(dataset$score)
dataset[missing_score, 'score'] = mean(dataset$score)
clean_model = lm(score ~ class + race + background,
data = dataset)
return(clean_model)
}
clean_model_2010 = clean_and_model_year(2010)
clean_model_2011 = clean_and_model_year(2011)
clean_model_2012 = clean_and_model_year(2012)
Now that you know all about functions in R
, let’s break down what the following function does, line by line. At a higher level, this function returns a vector of TRUE
/FALSE
; the output is TRUE
in rows where rest
is within distance
of target
, and FALSE
in rows where rest
is not within distance
of target
.
within_distance = function(target, rest, distance){
buffer = sf::st_buffer(target, distance)
hits_buffer = sf::st_intersects(rest, buffer, sparse=FALSE)
return(hits_buffer)
}
sf::st_buffer
, what does the buffer
object represent?sf::st_intersects
, what does the hits_buffer
object represent?Now, let’s look into the scale of the process in Bristol. Using our aforementioned within_distance
function, let’s plot a few distance bands around the School of Geography, or collections of observations that are within a given number of feet of geog
. First, though, let’s take one step at a time and visualize the number of observations that are within, say… 1.5km of the School of Geographical Sciences (SoGS):
is_geog = bristol$LSOA11CD == 'E01014542'
geog = bristol[is_geog, ]
is_within_1500m = within_distance(geog, bristol, 1500)
plot(sf::st_geometry(bristol))
plot(sf::st_geometry(bristol[is_within_1500m, ]),
col='forestgreen', add=TRUE)
Combining the techniques we’ve seen before, we could make a big plot of a few different bands at the same time:
# put down the base map
plot(sf::st_geometry(bristol))
# counting down from 6 to 1,
for(i in 6:1){
# let's focus on all observations within i*500 meters
distance = 500*i
# and build a filter for *only* these observations
is_within = within_distance(geog, bristol, distance)
# let's pick a pretty colour
colour = c('red','blue','green',
'purple','yellow','slategray')[i]
# and plot them onto the map
plot(sf::st_geometry(bristol[is_within,]),
col=colour, add=T)
# for a little bit of explanation, let's display a message
# describing what we're doing each time
message = paste('Colouring observations within',
distance, 'meters of SoGS', colour)
message(message)
}
## Colouring observations within 3000 meters of SoGS slategray
## Colouring observations within 2500 meters of SoGS yellow
## Colouring observations within 2000 meters of SoGS purple
## Colouring observations within 1500 meters of SoGS green
## Colouring observations within 1000 meters of SoGS blue
## Colouring observations within 500 meters of SoGS red
Now we can see the observations that fall within each band of 500 meters around geog
. But, how are the observations in each of these neighborhoods related to the value at geog
? If Tobler’s law held around geog
, we should expect its deprivation to be more related to observations nearby and less related to observations far away.
To explore this, let’s use the exact same code from above and get the mean value within each band:
# we need a place to stick our results in each iteration
means = matrix(nrow=6, ncol=1)
# counting down from 6 to 1,
for(i in 6:1){
# let's focus on all observations within i*500 meters
distance = 500*i
# and build a filter for *only* these observations
is_within = within_distance(geog, bristol, distance)
# and compute the mean of all values within the band
# that are *not* geog. Recalling last time, we can use
# the weighted.mean function, since observations that are
# not near are given zero weight, and observations that *are*
# near are given a weight of 1/n_nearby.
within_not_geog = is_within & !is_geog
local_mean = weighted.mean(bristol$imd_score, within_not_geog)
means[i] = local_mean
}
Remember that within_not_geog
has a one wherever an observation is near geog
and has a zero otherwise. And, remember that we use weighted.mean(bristol$imd_score, within_geog)
to get the mean according to the weights in within_not_geog
. Let’s think of each element of within_not_geog
as a weight, \(w_{ij}\), that governs how relevant another site \(j\) is to the local average around \(i\).
To check to see if you understand how weights work, using the following data, choose the right weights for the remaining observations in order to get the desired weighted means:
data = c(2, 3, 5, 9)
weights = c(1, ,0, )
# give the 2nd and 4th observations a weight of zero or one
# so that the weighted mean is 2.5:
weighted.mean(data, weights)
weights = c(.25, , ,0)
# give the middle two observations a weight of .5, zero, or one
# so that the weighted mean is 4:
weighted.mean(data, weights)
weights = c(1, 0, 0, )
# give the last observation a weight so that the mean is 5:
weighted.mean(data, weights)
Given this, we can plot the mean value of imd_score
as a function of the distance away from geog
:
plot((1:6) * 500, means, xlab='Distance (m) from Geography',
ylab='Mean imd_score within distance')
imd_score
changes as you move further away from the School of Geographical Sciences.imd_score
shown above to the value of imd_score
at geog
, and the map average imd_score
.geog
more or less deprived?geog
more or less deprived?E01014709
)E01014510
)E01014540
)E01033348
)E01014491
)E01033362
)Above, we’ve simply taken a snapshot of the mean deprivation at each 500-meter search area. In the code below, I’ve scaffolded a loop that can have a flexible number of steps and step size for the local averages around geog
.
N_STEPS =
STEPSIZE =
means = matrix(nrow=N_STEPS, ncol=1)
for(step in 1:N_STEPS){
distance = STEPSIZE*step
is_within = within_distance(geog, bristol, distance)
is_within_not_geog = is_within & !is_geog
local_mean = mean(bristol[is_within_not_geog, ]$imd_score)
means[step] = local_mean
}
By putting in values of N_STEPS
and STEPSIZE
, you can control the number of “steps” (i.e. distance bands around geog
) as well as how far each step goes. This is kind of like increasing the resolution of an image: we can build many more distance bands with many more fine-grained steps. This lets us pay much closer attention to the changes in the average imd_score
as a function of distance.
N_STEPS = 75
and STEPSIZE=100
and computing the local means. Then, using the following code, make a plot of the local mean as a function of distance:plot((1:N_STEPS)*STEPSIZE, means, type='l',
ylab='Mean Neighborhood IMD Score', xlab='distance (m)')
abline
can plot horizontal and vertical lines, as well as lines defined by an intercept and a slope. Noting that the h
parameter can set the height of a horizontal line (like, abline(h=height)
), add two horizontal lines to the plot above. Make one line have the height of the imd_score
for geog
and make it a pretty colour; have the other line be the height of the average imd_score
for the entire map, and make that a pretty colour, too.Above, we’ve seen the relationship for imd_score
between geog
and the surrounding observations near geog
. Thinking critically, let’s say we fix the value at a specific, known, distance. Then, what if we were to get the local_mean
for all observations, using the neighborhood around each site? Just for an example, we might fix this distance at \(1500\) meters. Then, we might do something like:
n_observations = nrow(bristol)
local_means = matrix(nrow=n_observations, ncol=1)
# for each row in our dataset:
for(row in 1:n_observations){
# get one single row of `bristol`
this_geometry = bristol[row, ]
# and tell me which row this is:
message(this_geometry$LSOA11CD)
# find observations within 1500 of this row
nearby = within_distance(this_geometry, bristol, 1500)
nearby_not_self = nearby & !(1:n_observations == i)
# and get the mean of values near this row, but that aren't this row
local_mean = mean(bristol[nearby_not_self,]$imd_score)
# and save the results
local_means[row] = local_mean
}
Now, we’ve got the average deprivation within 1500 meters around each observation. We can plot this like any other data:
plot(bristol['imd_score'],
breaks=seq(0,80,10))
bristol['local_mean_1500m'] = local_means
plot(bristol['local_mean_1500m'],
breaks=seq(0,80,10))
imd_score
itself? Are there any visual patterns that are present in this map that are not be present in the map of imd_score
?imd_score
and the 1500 meter local means.Now that we’ve gotten these local means, one simple and quick way we can look at the relationship between each site and its surroundings is through the correlation between the local means and the values themselves. So, let’s run a cor.test
on the local means and imd_score
:
library(mosaic)
correlation = cor.test(local_mean_1500m ~ imd_score, data=bristol)
correlation
##
## Pearson's product-moment correlation
##
## data: local_mean_1500m and imd_score
## t = 11.108, df = 261, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.4783903 0.6434261
## sample estimates:
## cor
## 0.5665631
Aha, our correlation test is significant! This suggests that the imd_score
at each location is correlated with the average imd_score
at locations that are up to 1500 meters away! Stated most basically, this means that deprived LSOAs tend to be surrounded by other deprived areas, and non-deprived areas tend to also be surrounded by non-deprived areas. Like clusters with like when it comes to deprivation in Bristol.
We can also see a scatterplot of this relationship with the line of best fit1 using the code below:
plot(local_mean_1500m ~ imd_score, data=bristol, pch=19)
abline(lm(local_mean_1500m ~ imd_score, data=bristol),
col='salmon', lw=2)
abline(v=mean(bristol$imd_score),
col='slategray', lty='dotted', lw=2)
abline(h=mean(bristol$local_mean_1500m),
col='slategray', lty='dotted', lw=2)
Thinking about this plot above, each site \(i\) that falls in the top right of the plot are in locations where (A) they’re more deprived than the average locality and (B) their surroundings are more deprived than the average locality. Sites in the bottom left are the opposite: they are less deprived than the average locality and are also in less deprived surroundings. This is because observations in these quadrants give us evidence of positive association: their \(x\) variable and their \(y\) variable are both on the same side of the mean. In general, since our correlation between each site and its surroundings is positive, the map is dominated by observations in the top right and bottom left.
However, observations in the bottom right and top left of the plot provide evidence of negative association. In the top left, these are observations with below-average deprivation but that are situated in surroundings with higher-than-average deprivation. Likewise, in the bottom right, observations have higher-than-average deprivation, but have lower-than-average deprivation in their surroundings.
This scatterplot, relating a centered variable to the average of its surroundings, is called the Moran Scatterplot. The slope of the line of best fit for this plot is a value called Moran’s \(I\)2. The formula for Moran’s I commonly comes from the following equation: \[
I = \frac{N}{\sum_i^N\sum_j^Nw_{ij}}\frac{\sum_i^N \sum_j^N w_{ij}(x_i - \bar{x})(x_j - \bar{x})}{\sum_i (x_i - \bar{x})^2}
\] This statement is more general than our scatterplot about local averages. For instance, when we estimated the regression (i.e. line of best fit) relating local_mean ~ imd_score
, the local_mean
variable around site \(i\) assumes that each weight relating site \(i\) to site \(j\), \(w_{ij}\), is equal to the number of neighbors around \(i\). This is so that \(\sum_j^N w_{ij}x_j\) is still the local average.
To show that Moran’s \(I\) is the slope of this regression of local averages onto the data, let’s make things easy on ourselves by first centering our variable \(z = x - \bar{x}\). And, let’s remember that our local mean at site \(i\) is a weighted average of the \(j\) sites around site \(i\): \[\tilde{z}_i = \sum_j^N w_{ij}z_j\] where \(w_{ij}\) is the weight that represents the spatial relationship between sites \(i\) and \(j\).
Then, our line of best fit shown in the scatterplot above corresponds to the line relating \(\tilde{z}_i\) and \(z_i\): \[ \tilde{z}_i = \beta z_i + \epsilon\] where \(\beta\) is the slope of the line of best fit on our centered data, and \(\epsilon\) is a normally-distributed error term with a constant variance. Because we’ve removed the mean from \(z_i\), we’ve centered the regression at the origin, and don’t have to worry about fitting an intercept term.
Like usual in regression, let’s minimize the sum of squared errors. So, this means that for our best guess of \(\beta\) (let’s call that \(\hat{\beta}\)), we can define the predicted value \(\hat{\tilde{z}}_i = \hat{\beta}z_i\). Using this “best guess” \(\hat{\beta}\), we can define the prediction error or residual in our regression in the same way we used to: \[ e_i = \tilde{z}_i - \hat{\beta}z_i = \sum_{j}^Nw_{ij}z_j - \hat{\beta}z_i\] For your own benefit, it’s useful to just stick with the \(\tilde{z}_i\) notation, but substitute back in \(\sum_j^N w_{ij}z_j\) if you forget what the local average, \(\tilde{z}_i\), is.
OK: we’ve defined the residual in predicting our local mean from the site specific value \(z_i\). Then, since we’re looking for some value of \(\hat{\beta}\) that minimizes the sum of squared errors, we can express the sum of squared errors, \(SSE\), as: \[ SSE(\hat{\beta}) = \sum_i^N \left(\tilde{z}_i - \hat{\beta}z_i\right)^2 = \sum_i^N\left(\left(\sum_j^Nw_{ij}z_i\right) - \hat{\beta}z_i\right)\] From here, if you know calculus or have worked on minimization problems before, you might know that this function is minimized when the derivative of \(SSE\) is zero. That is, because we have a quadratic function (something of the form \(f(x) = x^2\)), our minimum value for \(SSE\) will occur when we find the \(\hat{\beta}\) value at the red point at the bottom of the cup, where the curve stops decreasing and starts increasing again. So, this point where the function stops decreasing and starts increasing is where the derivative of \(SSE(\hat{\beta})\) is zero. This point is at: \[ 2 \sum_i^N \left[ -z_i\left(\tilde{z}_i - \hat\beta z_i \right) \right]= 0\] 1. Solve this for \(\hat\beta\).4 Then, turn \(\tilde{z}_i\) back into \(\sum_j^N w_{ij}z_i\). Finally, turn \(z_i\) back into \(x_i - \bar{x}\). This should look something very close to the equation used above for Moran’s \(I\).
While it may be true tha observations are related to their surroundings, we have not necessarily demonstrated that observations are more related to their nearby surroundings than distant ones. For that, we need to compare how related an observation is to locations “nearby” versus those that are far away. To do that, let us revisit our code to compute local means as a function of distance. A slight modification below runs this code for each site: This will take a bit, so please do only run this code once:
N_STEPS = 10
STEPSIZE = 800
means = matrix(nrow=N_STEPS, ncol=n_observations)
# For each site in the problem:
for(site in 1:n_observations){
# let's give that row a special name
row = bristol[site,]
# and come up with a TRUE/FALSE filter for that site
is_site = ((1:n_observations) == site)
# and, just to be careful, let's track our progress
message(paste('Doing site:', site))
# Then, let's build those concentric neighborhoods around the site
for(step in 1:N_STEPS){
distance = STEPSIZE*step
is_within = within_distance(row, bristol, distance)
is_within_not_site = is_within & !is_site
local_mean = mean(bristol[is_within_not_site, ]$imd_score)
# and save each so that the column is the site
# and the row is the local mean
means[step,site] = local_mean
}
}
The \(j\)th column of the \(i\)th row of the means
matrix measures the local mean around site \(j\) for distance band \(800 * i\). The correlation between each row of means
and imd_score
gives us the correlation between (A) the deprivation at each site and (B) the deprivation surrounding that site. Since this is the correlation of an attribute with itself over space, it is called spatial autocorrelation, where auto means “with self” and correlation has its standard meaning.
means
and imd_score
, and saves it into a matrix called correlations
. If you can’t figure this out, follow to this footnote for a hint.5test$p.value
is the p-value of the correlation computed using cor.test
, get the p-value for the correlation at each distance band as well.Overall, this lab practical should give you an appreciation for how geographical structure in data can be measured, such as how nearby areas tend to experience similar deprivation. Specifically, we worked extensively with the idea of spatial autocorrelation: the correlation of a variable \(x\) with itself across distance. In general, spatial autocorrelation is an instance of spatial dependence in a single variable; spatial dependence can cross variables, meaning that the values of some independent variable \(x\) can affect nearby dependent variables \(y\), or can even involve multivariate structures.
At its core, however, spatial autocorrelation, the Moran Scatterplot, and all of these analyses relating local means to the site values… they all intend to model how nearby things are related. This is an empirical characterization of Tobler’s “First Law of Geography,” and it is useful in modelling geographical processes. In addition to this structure helping us make better predictive models, it can also affect our inference about processes in the data itself. Next time, we’ll revisit the idea of spatial heterogeneity: when geographical areas are distinct from one another in their behavior.
Remember! The line of best fit (i.e. the regression line from the lm
is in unstandardized units.) The correlation (i.e. the value of correlation from cor.test
is in standardized units).↩
So called because P.A.P. Moran used \(I\) as the symbol of the statistic in the original paper. This is kind of like how we refer to Pearson’s \(r\) or Kendall’s \(\tau\)… There’s no committee or naming convention for these things, so it can get kind of confusing, given the fact that \(I\) is not a very easy-to-differentiate letter from \(l\).↩
Use lm(local_mean_1500m ~ imd_score, data=bristol)
to compute the slope of a special line of best fit.↩
Start by taking the product, \(-z_i(\tilde{z}_i - \hat\beta z_i)\). Then, split the sum into parts that contain \(\hat{\beta}\) and parts that do not. Finally, separate \(\hat{\beta}\) from anything containing \(z_i\) or \(\tilde{z_i}\).↩
Think about what you want to do for a single row, cor(means[i,], bristol$imd_score)
. Then, make a loop where \(i\) changes from \(1\) to the largest number of distance bands/N_STEPS
.
correlations = matrix(nrow=N_STEPS, ncol=1)
for(i in 1:N_STEPS){
test = cor.test(means[i,], bristol$imd_score)
correlations[i] = test$estimate
}
↩