Today, we’re going to look at local spatial autocorrelation. Like a kind of outlier diagnostic, local spatial autocorrelation measures how the local structure of a spatial relationship around each site either conforms to what we expect or is different from what we expect. Together, local spatial statistics are a general branch of statistics tha aim to characterize the relationship between a single observation and the sites surrounding it.
Often, local spatial autocorrelation is contrasted with global spatial autocorrelation, which is the structural relationship between sites (in abstract) and their surroundings (again, in abstract)., and this may have strongly different structure for many ideas of what surrounds each observation. Thus, local statistics are an attempt at measuring the geographical beahvior of a given social, physical, or behaviorial process around each observation.
library(sf)
library(mosaic)
bristol = sf::st_read('./data/bristol-imd.shp')
## Reading layer `bristol-imd' from data source `/home/lw17329/OneDrive/teaching/second_year_methods/spatial-practicals/data/bristol-imd.shp' using driver `ESRI Shapefile'
## Simple feature collection with 263 features and 12 fields
## geometry type: MULTIPOLYGON
## dimension: XY
## bbox: xmin: 350239.4 ymin: 166638.9 xmax: 364618.1 ymax: 183052.8
## epsg (SRID): 27700
## proj4string: +proj=tmerc +lat_0=49 +lon_0=-2 +k=0.9996012717 +x_0=400000 +y_0=-100000 +ellps=airy +towgs84=446.448,-125.157,542.06,0.15,0.247,0.842,-20.489 +units=m +no_defs
First, though, let’s really refresh our understanding of plain, a-spatial correlation, and how our knowledge of outliers work in two dimensions for a standard relationship between two variables. Here, let’s look at the relationship between housing deprivation and crime in Bristol:
correlation = cor.test(housing ~ crime, data=bristol)
correlation
##
## Pearson's product-moment correlation
##
## data: housing and crime
## t = 7.8199, df = 261, p-value = 1.309e-13
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.3322359 0.5287749
## sample estimates:
## cor
## 0.4356841
While there is some correlation between these two scores (areas with housing deprivation tend to have higher crime), we can also see that not every observation agrees with this trend. Namely, we can highlight one such observation with really low crime rates, but relatively high housing deprivation:
plot(housing ~ crime, data=bristol, pch=19)
abline(lm(housing ~ crime, data=bristol),
col='orangered', lwd=2)
abline(v=mean(bristol$crime), lwd=1.5, col='slategrey', lty='dashed')
abline(h=mean(bristol$housing), lwd=1.5, col='slategrey', lty='dashed')
housing_outlier = bristol[bristol$LSOA11CD == 'E01014714',]
points(housing_outlier$crime, housing_outlier$housing, col='red', pch=20)
housing
deprivation. - Approximately (using your eyeballs, not R
), what’s the mean of those four observations’ crime
levels?crime
levels as our candidate outlier.R
), what’s the mean of those three observations’ housing
deprivation?housing
and crime
values. Highlight the potential outlier (don’t forget, named housing_outlier
), in white. Describe the difference in the two maps around the housing_outlier
. Where is the housing_outlier
?Humans do a few things they “see” outliers in a scatterplot. Mainly our intution about which observations are (visual) outliers comes from the distance between an observation and the center of the data being analyzed. This is why we are apt to think that the red observation is an outlier, but the blue may not be.
Fundamentally, this is because the data at our outlier site, \((x_i, y_i)\), is still very far away from the center of the point cloud, \((\bar x, \bar y)\), despite the variability and co-variability in the data. The co-variability matters greatly for how easy it is to detect outliers. For instance, it’s much easier for you to just “see” an outlier in this data:
Than in this data:
even though those two plots have means in about the same location. The strength of covariation matters greatly to our perception of what data are outliers in 2 dimensions, and this holds in higher dimensions as well (though it becomes incredibly difficult to see).
One way that we can mathematically model this idea of extreme values (those that are unlike other values) using the mean co-variance in our problem is to use another kind of variance-weighted distance. Recently, we worked with Cohen’s \(d\), a univariate measure of the distance between two samples, standardized by the variance of the samples. Another measure of difference for 2 (or more) dimensional data is the Mahalanobis distance. This is a statistical distance that is defined as a weighted difference between observations and the “center” of the point cloud. The weighting is a function of the covariation between variables, and the “center” is the mean of each variable. The Mahalanobis distance itself measures how far out of the point cloud a given point is.
R
using the following code:# Compute the mahalanobis distance:
just_data = sf::st_drop_geometry(bristol[c('housing','crime')])
housing_mean = mean(bristol$housing)
crime_mean = mean(bristol$crime)
mdistance = mahalanobis(just_data,
c(housing_mean, crime_mean),
cov(just_data))
Then, make a histogram of mdistance
. Describe the distribution of Mahalanobis distances. Why might the distances be distributed this way if the Mahalnobis distance measures how far out of the point cloud each point is?1 2. Make a scatterplot of the same housing ~ crime
relationship we’ve seen before, but make each point’s color a function of the Mahalanobis distance we computed before. Below is the code (using ggplot
, which we haven’t explicitly used yet) that can build the plot:
# We'll use ggplot because it's easier for the coloring:
library(ggplot2)
# Start a ggplot
ggplot2::ggplot(bristol, aes(x=crime, y=housing, color=mdistance)) +
# Then, add points at each x,y in (crime, housing)
ggplot2::geom_point() +
# add an "lm" smooth to the scatterplot with no standard error:
ggplot2::geom_smooth(method='lm', se=FALSE) +
# add a horizontal line at the average of housing
ggplot2::geom_hline(yintercept = mean(bristol$housing)) +
# add a vertical line at the average of crime
ggplot2::geom_vline(xintercept = mean(bristol$crime)) +
# use viridis colors instead of the default
ggplot2::scale_color_viridis_c(option='plasma') +
# and use a dark scheme so it's easier to see
ggplot2::theme_dark()
The idea of observations that just don’t fit within an expected relationship or structure in our data takes special meaning in a geographical context. Given our existing explorations into spatial autocorrelation, measures of local spatial autocorrelation measure how outlier-y each observation is from its surroundings. In a similar manner to our previous dicsussions of outliers in a scatterplot, it is easy to develop a measure of local spatial autocorrelation from the Moran Scatterplot.
To get started, let us build touches
weights, also known as contiguity weights for our data. This can be done quickly using the spdep
package in R
. Using spdep
, we can build an object that contains all the information about which areas touch each other area:
library(spdep)
touches = spdep::poly2nb(bristol)
# for some reason, spdep has many formats of weight:
touches = spdep::nb2listw(touches)
While before we used sf
to build the within_distance
relationships between every pair of observations manually, this function turns these kinds of requests into a handy function (that’s also substantially faster). We can compute a few different types of weights using spdep
, and do so way faster than we did before:
# All of the following are `nb` objects. Don't forget to convert them to listw!
within_2000m = spdep::dnearneigh(sf::st_centroid(bristol), 0, 2000)
within_4000m = spdep::dnearneigh(sf::st_centroid(bristol), 0, 4000)
nearest_10 = spdep::knearneigh(sf::st_centroid(bristol), k=10)
nearest_10 = spdep::knn2nb(nearest_10)
One common kind of local statistic used in spatial statistics relates to the Moran scatterplot.3 We can make this plot below in the way we did last practical:
nearby_crime = spdep::lag.listw(touches, bristol$crime)
bristol['nearby_crime'] = nearby_crime
plot(nearby_crime ~ crime, data=bristol, pch=20)
abline(lm(nearby_crime ~ crime, data=bristol), lwd=2, col='orangered')
abline(h=mean(bristol$nearby_crime), col='slategrey', lwd=2, lty='dashed')
abline(v=mean(bristol$crime), col='slategrey', lwd=2, lty='dashed')
And, this scatterplot corresponds to the map of crime
in our data:
plot(bristol['crime'], lwd=.1)
Recalling the formula of the line of best fit, we can express the relationship between the variable observed at sites \(i\) and \(j\) (spelled \(x_i\) and \(x_j\)), as a function of the variability (and co-variability) of \(x_i\) and \(x_j\) around the mean, weighted by the spatial relationship between sites \(i\) and \(j\), modeled by \(w_{ij}\). Formally, this is stated: \[ I = \frac{N}{\sum_i^N\sum_j^N w_{ij}}\frac{\sum_i^N\sum_j^Nw_{ij}(x_i - \bar{x})(x_j - \bar{x})}{\sum_i^N \left(x_i - \bar{x}\right)^2}\] But, practically speaking, you can think of the top as the spatially-weighted co-variation for all pairs of sites \(i\) and \(j\). You can think of the bottom as the intrinsic variation, or just the variance, of \(x\), regardless of site. Like the Mahalanobis Distance we looked at above, we might want to try to build a measure of how outlier-y the data at each site \(x_i\) is from its local average, \(\sum_j^N w_{ij}(x_j - \bar{x})\). To build this, Anselin (1995) broke up the estimator into two distinct parts:4 \[ I = \sum_i^N (x_i - \bar{x}) \left[\frac{N}{\sum_i^N\sum_j^Nw_{ij}} \frac{\sum_j w_{ij}(x_j - \bar{x})}{\sum_i^N(x_i - \bar{x})^2}\right]\] Then, we can say that the contribution of each site to Moran’s \(I\), the slope of the line shown above, is a result of the partial sums over \(i\). That is, we just pick some \(i\), and then the local Moran’s \(I\), the co-variation between the site and the local average becomes: \[ I_i = (x_i - \bar{x})\left[\frac{N}{\sum_i^N\sum_j^N w_{ij}} \frac{\sum_j^Nw_{ij}(x_j - \bar{x})}{\sum_i^N (x_i - \bar{x})^2} \right] \] This statistic is called the local Moran’s \(I\).
Local Moran’s \(I\) statistics are one of the most common ways to understand the local variability of data.
spdep
:local_morans = spdep::localmoran(bristol$crime, touches)
Compute the Local Moran statistic for the other notions of nearness as well (within_2000m
, within_4000m
, and nearest_10
)5 3. The local_moran
result is a matrix with a few columns. Ii
is the Moran score itself. Using one of the senses of geographical nearness, make a Moran scatterplot for nearby_crime ~ crime
, using the first column of the local_moran
matrix (e.g. within_2000m_moran[,1]
) to color-in the points in the scatterplot. Where are large, positive values of the Local Moran statistic? Where are the large, negative values of the Local Moran statistic? 6 4. Make a map of the Local Moran’s \(I_i\) values. Keeping in mind that large local statistics are not necessarily large crime values (recalling the scatterplot above,) how can you interpret the areas of light yellow and areas of dark blue in the plot?
housing
, income
, education
, and employment
. Interpret the patterns in Local Moran scores there, too. Which patterns are the most similar? the most different?The Local Moran statistic (like the Mahalanobis statistic) is actually amenable to statistical inference, in addition to simply visualizing the structure of the data. Here, let’s do some significance testing on our results. The null hypothesis for Local Moran’s \(I_i\) testing is that there is no significant clustering or outlier structure in the data.
local_moran
matrix are significant at a 95% significance level? How about at a 99% significance level?local_moran_scores = local_moran[ ,1]
p_values = local_moran[ ,5]
is_significant = p_values <= # set your significance level here!
crime
like we have done before. Then, in addition to the lines we add for axes and a line of best fit, use points
to plot over significant \(I_i\) values in red.crime
like we have done before. Be sure to set the reset=FALSE
option, however. It also will help legibility if you use lwd=.01
, to make the width of the lines between LSOAs very small. Then, add the following line to plot the outlines of the sigificant values in white:plot(sf::st_geometry(bristol[is_significant,]), lwd=2, border='white', add=TRUE, reset=FALSE)
Local spatial statistics are a form of outlier diagnostic, designed to give you insight about how the structure of the relationship between an observation and its surroundings changes right around the observation itself. Because of how flexible these measures are, local spatial statistics are among the most commonly-used spatial statistics across the social sciences. Especially the local Moran’s \(I\), these statistics are easy to understand, simple to visualize, and quick to compute. Thus, as you’ve seen, they give a good indication of how to explore the local structure of univariate data in a given dataset.
In the future, however, we will need to analyze more than just one variable at a time. Indeed, most social or physical modelling problems involve multiple variables all at once. To do this, we need to build more models, not just use exploratory statistics. Our final lab for this segment will consider how geography is built into models, and strategies that are common in all forms of spatial models (not just regression).
Put another way, If everyone is unique, how can anyone be unique? Some points have to be more “typical” than others.↩
Add the Mahalanobis distance as a column.↩
as a bit of an R challenge, try making this scatterplot in ggplot
instead of the basic plot
? No pressure; ggplot
is tough.↩
this is possible because you can factor out each \(x_i - \bar{x}\) from the inner sum over \(j\). If you think of site \(i\) as the “focus,” then the sum over \(i\) in the numerator is capturing the contribution of each site. The sum over \(j\) happens for each site, and reflects the local average at site \(i\). Thus, we can pull out that sum over \(i\) from the part that talks about how to get the local average at each site. That denominator is the variance of \(i\), and never changes regardless of where we take the sum or where we compute the local average.↩
Don’t forget: we had to use spdep::nb2listw
to convert the touches
object from one type of representation to another. This is simply a legacy issue for spdep
.↩
Even if you didn’t attempt the challenge on the Mahalanobis distance, there is R
code in part 2 that may be of assistance.↩
These are just nicknames for these structures of relationships, and many people have different nicknames. Think of what they represent (e.g. a focal point and its surroundings are all higher than average, a focal and its surroundings are all below average, or the focal is below (above) and the surroundings are above (below). This name is only meant to be evocative and geographical, but other kinds of names might include hotspot/coldspot (for plateau/playa), donut/diamond in the rough (for pit/peak), or cluster/outlier (for both platueau&playa/peak&pit). There is no standardization here.↩