The Spatial Relationships in Geographic Data

Everything is related to everything else, but near things are more related than distant things. [This] is thus very parochial, and ignores most of the world. Tobler, (1970)

Today, we’ll be using the previous practical’s information about spatial relationships, functions, and modelling to understand the spatial structure of deprivation in Bristol. We’ll only be doing exploratory spatial data analysis this time, with an eye to doing some more advanced modelling in the weeks ahead.

Our main interest for this problem will be to examine: 1. Is deprivation clustered in Bristol? 2. Where are the clusters of deprivation in Bristol? 3. At what scale does deprivation cluster in Bristol (very local, ward-level, or regionally?)

To do this, we’ll first read in our data from last time:

library(sf)
library(dplyr)
bristol = sf::read_sf('./data/bristol-imd.shp')

Note that imd_score indicates the deprivation of a local super output area; when imd_score is large, then that LSOA is more deprived. Now for something completely different, let’s prepare bit more with an interlude on R functions.

Check Yourself: A Little Bit More on Functions

At a very high-up level of abstraction, a function in R is a distinct and defined set of code that performs some operation or procedure. Last time, we wrote a kernel in terms of a function:

nexp_kernel = function(distance, bandwidth=1500){
                       u = distance / bandwidth
                       return(exp(-u))
}

In abstract, functions are descriptions of what you want to do to a set of inputs. Breaking it down, all functions are created using the same methods:

name_of_function = function(argument_1, argument_2, argument_with_default = TRUE)
                           {
                            if(argument_with_default){
                              return(argument_1 * argument_2)
                            }
                            else{
                              return(argument_1 / argument_2)
                            }
                    }
  • The name_of_function is an arbitrary name, and should be something that clearly describes what the function does (like, load_data(), or simulate_next_year()).
  • Then, the words inside of function(), the argument_1, argument_2, argument_with_default, are called arguments. Basically, an argument is an input provided to a function. Functions can have an unlimited number of arguments, but it’s best to only take arguments you need to use later in the function. The actual words used to describe argument_1 can be anything. They should be descriptive, clear, and helpful (such as distance and bandwidth when we were defining nexp_kernel).
  • Finally, the body of the function comes between the braces. In our example, this is where the if/else stuff is. Otherwise, you can think of it like: function(){This is the body of the function}. The body is where you describe the operations you want to use on the arguments. Above, our example function will:

    1. check if the argument_with_default is true
    2. if it is, then it will compute argument_1 * argument_2.
    3. if it’s not, then it will compute argument_1 / argument_2.
  • You send values outside of the function into the rest of your code using return(). Once you return(something), the function stops. For instance, in our example code, if we return(argument_1 * argument_2), we never even check the other branch of code after the else{}.

Functions are very helpful for code re-use, avoiding typos and ensuring that if you have to change or fix something, you change it in one place and one place only.. If you find yourself doing a sequence of steps many times in an R script, try writing a function that does what you want to an argument, like:

dataset_2010 = read.csv('work/data/2010.csv')
missing_score = is.na(dataset_2010)
dataset_2010[missing_attainment, 'score'] = mean(dataset_2010$score)
clean_model_2010 = lm(score ~ class + race + background, data=dataset_2010)

# copy and paste, hope that you change all the 2010s into 2011s
dataset_2011 = read.csv('work/data/2011.csv')
missing_score = is.na(dataset_2011)
dataset_2011[missing_attainment, 'score'] = mean(dataset_2011$score)
clean_model_2011 = lm(score ~ class + race + background, data=dataset_2011)

# copy and paste, hope that you change all the 2011s into 2012s
dataset_2012 = read.csv('work/data/2012.csv')
missing_score = is.na(dataset_2012)
dataset_2012[missing_attainment, 'score'] = mean(dataset_2012$score)
clean_model_2012 = lm(score ~ class + race + background, data=dataset_2012)

might become:

clean_and_model_year = function(year){
                         filename = paste('work/data/', year, '.csv', sep='')
                         dataset = read.csv(filename)
                         missing_score = is.na(dataset$score)
                         dataset[missing_score, 'score'] = mean(dataset$score)
                         clean_model = lm(score ~ class + race + background, 
                                          data = dataset)
                         return(clean_model)
                        }
clean_model_2010 = clean_and_model_year(2010)
clean_model_2011 = clean_and_model_year(2011)
clean_model_2012 = clean_and_model_year(2012)

Exercise: Breaking Down a Function

Now that you know all about functions in R, let’s break down what the following function does, line by line. At a higher level, this function returns a vector of TRUE/FALSE; the output is TRUE in rows where rest is within distance of target, and FALSE in rows where rest is not within distance of target.

within_distance = function(target, rest, distance){
    buffer = sf::st_buffer(target, distance)
    hits_buffer = sf::st_intersects(rest, buffer, sparse=FALSE)
    return(hits_buffer)
}
  1. What is the name of the function?
  2. What are the arguments of the function? Judging from the body of the function, what does each argument represent?
  3. Using your past knowledge about sf::st_buffer, what does the buffer object represent?
  4. Using your past knowledge about sf::st_intersects, what does the hits_buffer object represent?

The “Scale” of Deprivation

Now, let’s look into the scale of the process in Bristol. Using our aforementioned within_distance function, let’s plot a few distance bands around the School of Geography, or collections of observations that are within a given number of feet of geog. First, though, let’s take one step at a time and visualize the number of observations that are within, say… 1.5km of the School of Geographical Sciences (SoGS):

is_geog = bristol$LSOA11CD == 'E01014542'
geog = bristol[is_geog, ]

is_within_1500m = within_distance(geog, bristol, 1500)
plot(sf::st_geometry(bristol))
plot(sf::st_geometry(bristol[is_within_1500m, ]), 
     col='forestgreen', add=TRUE)

Combining the techniques we’ve seen before, we could make a big plot of a few different bands at the same time:

# put down the base map
plot(sf::st_geometry(bristol))

# counting down from 6 to 1,
for(i in 6:1){
    # let's focus on all observations within i*500 meters
    distance = 500*i
    # and build a filter for *only* these observations
    is_within = within_distance(geog, bristol, distance)
    # let's pick a pretty colour
    colour = c('red','blue','green',
               'purple','yellow','slategray')[i]
    # and plot them onto the map
    plot(sf::st_geometry(bristol[is_within,]), 
         col=colour, add=T)
    
    # for a little bit of explanation, let's display a message
    # describing what we're doing each time

    message = paste('Colouring observations within', 
                    distance, 'meters of SoGS', colour)
    message(message)
}
## Colouring observations within 3000 meters of SoGS slategray
## Colouring observations within 2500 meters of SoGS yellow
## Colouring observations within 2000 meters of SoGS purple
## Colouring observations within 1500 meters of SoGS green
## Colouring observations within 1000 meters of SoGS blue
## Colouring observations within 500 meters of SoGS red

Now we can see the observations that fall within each band of 500 meters around geog. But, how are the observations in each of these neighborhoods related to the value at geog? If Tobler’s law held around geog, we should expect its deprivation to be more related to observations nearby and less related to observations far away.

To explore this, let’s use the exact same code from above and get the mean value within each band:

# we need a place to stick our results in each iteration
means = matrix(nrow=6, ncol=1)
# counting down from 6 to 1,
for(i in 6:1){
    # let's focus on all observations within i*500 meters
    distance = 500*i
    # and build a filter for *only* these observations
    is_within = within_distance(geog, bristol, distance)
    # and compute the mean of all values within the band 
    # that are *not* geog. Recalling last time, we can use
    # the weighted.mean function, since observations that are
    # not near are given zero weight, and observations that *are*
    # near are given a weight of 1/n_nearby. 
    within_not_geog = is_within & !is_geog
    local_mean = weighted.mean(bristol$imd_score, within_not_geog)
    means[i] = local_mean
}

Check Yourself: Local Weighted Means

Remember that within_not_geog has a one wherever an observation is near geog and has a zero otherwise. And, remember that we use weighted.mean(bristol$imd_score, within_geog) to get the mean according to the weights in within_not_geog. Let’s think of each element of within_not_geog as a weight, \(w_{ij}\), that governs how relevant another site \(j\) is to the local average around \(i\).

To check to see if you understand how weights work, using the following data, choose the right weights for the remaining observations in order to get the desired weighted means:

data = c(2, 3, 5, 9)


weights = c(1, ,0, )
# give the 2nd and 4th observations a weight of zero or one
# so that the weighted mean is 2.5:
weighted.mean(data, weights) 

weights = c(.25, , ,0)
# give the middle two observations a weight of .5, zero, or one
# so that the weighted mean is 4:
weighted.mean(data, weights)

weights = c(1, 0, 0, )
# give the last observation a weight so that the mean is 5:
weighted.mean(data, weights)

Local Averages as Distance Increases

Given this, we can plot the mean value of imd_score as a function of the distance away from geog:

plot((1:6) * 500, means, xlab='Distance (m) from Geography', 
                         ylab='Mean imd_score within distance')

Exercise: Interpreting Plots

  1. In plain language, describe the plot above. This description should explain how the mean imd_score changes as you move further away from the School of Geographical Sciences.
  2. Compare the the neighborhood averages for imd_score shown above to the value of imd_score at geog, and the map average imd_score.
  • Relative to its surroundings, is geog more or less deprived?
  • Relative to the map as a whole, is the area around geog more or less deprived?
  1. Compute the six local averages for the following areas of interest as well. Describe how they change as distance increases, too:
  • The Downs (LSOA : E01014709)
  • The Memorial Stadium (LSOA : E01014510)
  • The Centre (LSOA : E01014540)
  • Lakota (LSOA: E01033348)
  • St. Werberghs City Farm (LSOA: E01014491)
  • North Street (LSOA: E01033362)

Challenge: Increasing the Resolution

Above, we’ve simply taken a snapshot of the mean deprivation at each 500-meter search area. In the code below, I’ve scaffolded a loop that can have a flexible number of steps and step size for the local averages around geog.

N_STEPS = 
STEPSIZE = 
means = matrix(nrow=N_STEPS, ncol=1)

for(step in 1:N_STEPS){
    distance = STEPSIZE*step
    is_within = within_distance(geog, bristol, distance)
    is_within_not_geog = is_within & !is_geog
    local_mean = mean(bristol[is_within_not_geog, ]$imd_score)
    means[step] = local_mean
}

By putting in values of N_STEPS and STEPSIZE, you can control the number of “steps” (i.e. distance bands around geog) as well as how far each step goes. This is kind of like increasing the resolution of an image: we can build many more distance bands with many more fine-grained steps. This lets us pay much closer attention to the changes in the average imd_score as a function of distance.

  1. Try setting the values for N_STEPS = 75 and STEPSIZE=100 and computing the local means. Then, using the following code, make a plot of the local mean as a function of distance:
plot((1:N_STEPS)*STEPSIZE, means, type='l', 
     ylab='Mean Neighborhood IMD Score', xlab='distance (m)')
  • describe the change in the average deprivation as a function of distance, now that we have many more steps.
  • If you’ve done the problem correctly, you should see that the function levels off at a specific value. Before doing anything else, what do you think that value is?
  • The abline can plot horizontal and vertical lines, as well as lines defined by an intercept and a slope. Noting that the h parameter can set the height of a horizontal line (like, abline(h=height)), add two horizontal lines to the plot above. Make one line have the height of the imd_score for geog and make it a pretty colour; have the other line be the height of the average imd_score for the entire map, and make that a pretty colour, too.
  • Now that you can see the lines, were you correct in your belief about the value at which the local mean levels off? Why do you think that the line hits this value and stops changing?
  1. Pick one of the six other areas of interest listed above and create the same plot of local means as a function of distance.

Moving to Many Neighborhoods

Above, we’ve seen the relationship for imd_score between geog and the surrounding observations near geog. Thinking critically, let’s say we fix the value at a specific, known, distance. Then, what if we were to get the local_mean for all observations, using the neighborhood around each site? Just for an example, we might fix this distance at \(1500\) meters. Then, we might do something like:

n_observations = nrow(bristol)
local_means = matrix(nrow=n_observations, ncol=1)

# for each row in our dataset:
for(row in 1:n_observations){
    # get one single row of `bristol`
    this_geometry = bristol[row, ]
    # and tell me which row this is:
    message(this_geometry$LSOA11CD)
    # find observations within 1500 of this row
    nearby = within_distance(this_geometry, bristol, 1500)
    nearby_not_self = nearby & !(1:n_observations == i)
    # and get the mean of values near this row, but that aren't this row
    local_mean = mean(bristol[nearby_not_self,]$imd_score)
    # and save the results
    local_means[row] = local_mean
}

Now, we’ve got the average deprivation within 1500 meters around each observation. We can plot this like any other data:

plot(bristol['imd_score'], 
     breaks=seq(0,80,10))

bristol['local_mean_1500m'] = local_means
plot(bristol['local_mean_1500m'], 
     breaks=seq(0,80,10))

Check Yourself: Interpreting Maps

  1. How does the map of the local mean at 1500 meters compare to the map of imd_score itself? Are there any visual patterns that are present in this map that are not be present in the map of imd_score?
  2. Run the same code above to generate the 3000 meter local averages and 6000 meter local averages. Describe how these maps look compared to the maps for imd_score and the 1500 meter local means.

Correlations and Local Averages

Now that we’ve gotten these local means, one simple and quick way we can look at the relationship between each site and its surroundings is through the correlation between the local means and the values themselves. So, let’s run a cor.test on the local means and imd_score:

library(mosaic)
correlation = cor.test(local_mean_1500m ~ imd_score, data=bristol)
correlation
## 
##  Pearson's product-moment correlation
## 
## data:  local_mean_1500m and imd_score
## t = 11.108, df = 261, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.4783903 0.6434261
## sample estimates:
##       cor 
## 0.5665631

Aha, our correlation test is significant! This suggests that the imd_score at each location is correlated with the average imd_score at locations that are up to 1500 meters away! Stated most basically, this means that deprived LSOAs tend to be surrounded by other deprived areas, and non-deprived areas tend to also be surrounded by non-deprived areas. Like clusters with like when it comes to deprivation in Bristol.

We can also see a scatterplot of this relationship with the line of best fit1 using the code below:

plot(local_mean_1500m ~ imd_score, data=bristol, pch=19)
abline(lm(local_mean_1500m ~ imd_score, data=bristol), 
       col='salmon', lw=2)
abline(v=mean(bristol$imd_score), 
       col='slategray', lty='dotted', lw=2)
abline(h=mean(bristol$local_mean_1500m), 
       col='slategray', lty='dotted', lw=2)

Thinking about this plot above, each site \(i\) that falls in the top right of the plot are in locations where (A) they’re more deprived than the average locality and (B) their surroundings are more deprived than the average locality. Sites in the bottom left are the opposite: they are less deprived than the average locality and are also in less deprived surroundings. This is because observations in these quadrants give us evidence of positive association: their \(x\) variable and their \(y\) variable are both on the same side of the mean. In general, since our correlation between each site and its surroundings is positive, the map is dominated by observations in the top right and bottom left.

However, observations in the bottom right and top left of the plot provide evidence of negative association. In the top left, these are observations with below-average deprivation but that are situated in surroundings with higher-than-average deprivation. Likewise, in the bottom right, observations have higher-than-average deprivation, but have lower-than-average deprivation in their surroundings.

Exercise: The Moran Scatterplot

This scatterplot, relating a centered variable to the average of its surroundings, is called the Moran Scatterplot. The slope of the line of best fit for this plot is a value called Moran’s \(I\)2. The formula for Moran’s I commonly comes from the following equation: \[ I = \frac{N}{\sum_i^N\sum_j^Nw_{ij}}\frac{\sum_i^N \sum_j^N w_{ij}(x_i - \bar{x})(x_j - \bar{x})}{\sum_i (x_i - \bar{x})^2} \] This statement is more general than our scatterplot about local averages. For instance, when we estimated the regression (i.e. line of best fit) relating local_mean ~ imd_score, the local_mean variable around site \(i\) assumes that each weight relating site \(i\) to site \(j\), \(w_{ij}\), is equal to the number of neighbors around \(i\). This is so that \(\sum_j^N w_{ij}x_j\) is still the local average.

  1. What is the Moran’s \(I\) for the 1500m notion of proximity?3
  2. Focusing again on the Moran Scatterplot, I’ve highlighted a few points and a few of the corresponding polygons in distinct colors:

Challenge: The Moran Estimator

To show that Moran’s \(I\) is the slope of this regression of local averages onto the data, let’s make things easy on ourselves by first centering our variable \(z = x - \bar{x}\). And, let’s remember that our local mean at site \(i\) is a weighted average of the \(j\) sites around site \(i\): \[\tilde{z}_i = \sum_j^N w_{ij}z_j\] where \(w_{ij}\) is the weight that represents the spatial relationship between sites \(i\) and \(j\).

Then, our line of best fit shown in the scatterplot above corresponds to the line relating \(\tilde{z}_i\) and \(z_i\): \[ \tilde{z}_i = \beta z_i + \epsilon\] where \(\beta\) is the slope of the line of best fit on our centered data, and \(\epsilon\) is a normally-distributed error term with a constant variance. Because we’ve removed the mean from \(z_i\), we’ve centered the regression at the origin, and don’t have to worry about fitting an intercept term.

Like usual in regression, let’s minimize the sum of squared errors. So, this means that for our best guess of \(\beta\) (let’s call that \(\hat{\beta}\)), we can define the predicted value \(\hat{\tilde{z}}_i = \hat{\beta}z_i\). Using this “best guess” \(\hat{\beta}\), we can define the prediction error or residual in our regression in the same way we used to: \[ e_i = \tilde{z}_i - \hat{\beta}z_i = \sum_{j}^Nw_{ij}z_j - \hat{\beta}z_i\] For your own benefit, it’s useful to just stick with the \(\tilde{z}_i\) notation, but substitute back in \(\sum_j^N w_{ij}z_j\) if you forget what the local average, \(\tilde{z}_i\), is.

OK: we’ve defined the residual in predicting our local mean from the site specific value \(z_i\). Then, since we’re looking for some value of \(\hat{\beta}\) that minimizes the sum of squared errors, we can express the sum of squared errors, \(SSE\), as: \[ SSE(\hat{\beta}) = \sum_i^N \left(\tilde{z}_i - \hat{\beta}z_i\right)^2 = \sum_i^N\left(\left(\sum_j^Nw_{ij}z_i\right) - \hat{\beta}z_i\right)\] From here, if you know calculus or have worked on minimization problems before, you might know that this function is minimized when the derivative of \(SSE\) is zero. That is, because we have a quadratic function (something of the form \(f(x) = x^2\)), our minimum value for \(SSE\) will occur when we find the \(\hat{\beta}\) value at the red point at the bottom of the cup, where the curve stops decreasing and starts increasing again. So, this point where the function stops decreasing and starts increasing is where the derivative of \(SSE(\hat{\beta})\) is zero. This point is at: \[ 2 \sum_i^N \left[ -z_i\left(\tilde{z}_i - \hat\beta z_i \right) \right]= 0\] 1. Solve this for \(\hat\beta\).4 Then, turn \(\tilde{z}_i\) back into \(\sum_j^N w_{ij}z_i\). Finally, turn \(z_i\) back into \(x_i - \bar{x}\). This should look something very close to the equation used above for Moran’s \(I\).

Measuring Distance Decay

While it may be true tha observations are related to their surroundings, we have not necessarily demonstrated that observations are more related to their nearby surroundings than distant ones. For that, we need to compare how related an observation is to locations “nearby” versus those that are far away. To do that, let us revisit our code to compute local means as a function of distance. A slight modification below runs this code for each site: This will take a bit, so please do only run this code once:

N_STEPS = 10
STEPSIZE = 800
means = matrix(nrow=N_STEPS, ncol=n_observations)

# For each site in the problem:
for(site in 1:n_observations){
  # let's give that row a special name
  row = bristol[site,]
  # and come up with a TRUE/FALSE filter for that site
  is_site = ((1:n_observations) == site)
  # and, just to be careful, let's track our progress
  message(paste('Doing site:', site))
  # Then, let's build those concentric neighborhoods around the site
  for(step in 1:N_STEPS){
    distance = STEPSIZE*step
    is_within = within_distance(row, bristol, distance)
    is_within_not_site = is_within & !is_site
    local_mean = mean(bristol[is_within_not_site, ]$imd_score)
    # and save each so that the column is the site 
    # and the row is the local mean
    means[step,site] = local_mean
  }
}

The \(j\)th column of the \(i\)th row of the means matrix measures the local mean around site \(j\) for distance band \(800 * i\). The correlation between each row of means and imd_score gives us the correlation between (A) the deprivation at each site and (B) the deprivation surrounding that site. Since this is the correlation of an attribute with itself over space, it is called spatial autocorrelation, where auto means “with self” and correlation has its standard meaning.

Exercise: Testing Tobler’s Law

  1. Try to write your own for-loop that computes the correlation between each row of means and imd_score, and saves it into a matrix called correlations. If you can’t figure this out, follow to this footnote for a hint.5
  2. Make a plot of the correlations as a function of distance. Describe how the correlation between sites changes as the distance away from the site increases.

Challenge: Significance in Correlation

  1. Using the fact that test$p.value is the p-value of the correlation computed using cor.test, get the p-value for the correlation at each distance band as well.
  • How far away do you have to go before the local averages stop being correlated with the site values at a statistically-significant level?
  • Do you see a relationship between the size of correlation and the significance of correlation? Knowing what you know about null hypotheses and test statistics, why might this be the case?

Conclusion

Overall, this lab practical should give you an appreciation for how geographical structure in data can be measured, such as how nearby areas tend to experience similar deprivation. Specifically, we worked extensively with the idea of spatial autocorrelation: the correlation of a variable \(x\) with itself across distance. In general, spatial autocorrelation is an instance of spatial dependence in a single variable; spatial dependence can cross variables, meaning that the values of some independent variable \(x\) can affect nearby dependent variables \(y\), or can even involve multivariate structures.

At its core, however, spatial autocorrelation, the Moran Scatterplot, and all of these analyses relating local means to the site values… they all intend to model how nearby things are related. This is an empirical characterization of Tobler’s “First Law of Geography,” and it is useful in modelling geographical processes. In addition to this structure helping us make better predictive models, it can also affect our inference about processes in the data itself. Next time, we’ll revisit the idea of spatial heterogeneity: when geographical areas are distinct from one another in their behavior.

Footnotes


  1. Remember! The line of best fit (i.e. the regression line from the lm is in unstandardized units.) The correlation (i.e. the value of correlation from cor.test is in standardized units).

  2. So called because P.A.P. Moran used \(I\) as the symbol of the statistic in the original paper. This is kind of like how we refer to Pearson’s \(r\) or Kendall’s \(\tau\)… There’s no committee or naming convention for these things, so it can get kind of confusing, given the fact that \(I\) is not a very easy-to-differentiate letter from \(l\).

  3. Use lm(local_mean_1500m ~ imd_score, data=bristol) to compute the slope of a special line of best fit.

  4. Start by taking the product, \(-z_i(\tilde{z}_i - \hat\beta z_i)\). Then, split the sum into parts that contain \(\hat{\beta}\) and parts that do not. Finally, separate \(\hat{\beta}\) from anything containing \(z_i\) or \(\tilde{z_i}\).

  5. Think about what you want to do for a single row, cor(means[i,], bristol$imd_score). Then, make a loop where \(i\) changes from \(1\) to the largest number of distance bands/N_STEPS.

    correlations = matrix(nrow=N_STEPS, ncol=1)
    for(i in 1:N_STEPS){
      test = cor.test(means[i,], bristol$imd_score)
      correlations[i] = test$estimate
    }