library(tidyverse) # for tidy tools and reshaping methods
library(sf) # for working with spatial data
library(ggplot2) # for plotting
weca = st_read('../data/weca.gpkg', quiet=T)
knitr::kable(head(st_drop_geometry(weca))[,c(4,5,6,7)])
lsoa_name | price_dec_1995 | price_mar_1996 | price_jun_1996 |
---|---|---|---|
Bath and North East Somerset 007A | 49000 | 54000 | 85500 |
Bath and North East Somerset 007B | 68250 | 68950 | 67000 |
Bath and North East Somerset 007C | 72500 | 73750 | 66500 |
Bath and North East Somerset 010E | 111250 | 110000 | 103500 |
Bath and North East Somerset 010B | 70000 | 89500 | 89500 |
Bath and North East Somerset 010A | 62500 | 60500 | 64000 |
The following dataset of temperatures from the Filton weather station has a different tidiness problem. Which problem is it, and how can you tell?
library(readr)
temps = read_csv("../data/bristol_yearly_temps.csv")
temps
year | stat | degrees_celsius |
---|---|---|
2016 | maximum | 18.7 |
2016 | median | 10.6 |
2016 | minimum | 6.1 |
2017 | maximum | 18.4 |
2017 | median | 12.4 |
2017 | minimum | 5.5 |
2018 | maximum | 21.1 |
2018 | median | 11.5 |
2018 | minimum | 4.4 |
This temperature data is too long. It mixes information about different kinds of statistics in the stat
column, and therefore contains many different kinds of temperatures in the degrees_celsius
column. This is Wickham’s second problem: multiple variables are stored in a single column; the maximum, minimum, and median temperatures are all stored within the degrees_celsius
column.
How would you tidy the data & compute the range of temperatures for each year?
To resolve this issue, we need to spread
, or pivot_wider
, the data to split the stat
column into three variables.
temps_wide = pivot_wider(temps,
id_cols=year, # this uniquely identifies each row
names_from=stat, # this contains the "names" of our variables
values_from=degrees_celsius # this contains the values of our variables
)
temps_wide %>% mutate(range = maximum - minimum)
year | maximum | median | minimum | range |
---|---|---|---|---|
2013 | 20.9 | 10.55 | 4.4 | 16.5 |
2014 | 20.0 | 12.80 | 6.5 | 13.5 |
2015 | 17.3 | 12.05 | 5.3 | 12.0 |
2016 | 18.7 | 10.60 | 6.1 | 12.6 |
2017 | 18.4 | 12.40 | 5.5 | 12.9 |
2018 | 21.1 | 11.50 | 4.4 | 16.7 |
An alternative approach to compute the temperature range might filter the temps
data. This is more complicated, even though it provides the right result. This is because the actual “pivot” occurs manually: you select the maximum and minimum values individually using the filter
and select
verbs. This only works because you know the columns that will be needed for the range, and that number of columns is relatively small. If the spread were more complicated, or involved more variables, this would entail significant copying & pasting.
maxes = temps %>% filter(stat == 'maximum') %>% # get the maximum column
select(-stat) %>% # drop the stat column, just keep year & value
rename(maximum = degrees_celsius) # rename the degrees to maximum
mins = temps %>% filter(stat=='minimum') %>%
select(-stat) %>% # drop the stat column, as before
rename(minimum = degrees_celsius) # rename the degrees to minimum
join = merge(maxes, mins, by='year') # join the minima and maxima by years
join %>% mutate(range = maximum - minimum) # compute the range
year | maximum | minimum | range | |
---|---|---|---|---|
114 | 2013 | 20.9 | 4.4 | 16.5 |
115 | 2014 | 20.0 | 6.5 | 13.5 |
116 | 2015 | 17.3 | 5.3 | 12.0 |
117 | 2016 | 18.7 | 6.1 | 12.6 |
118 | 2017 | 18.4 | 5.5 | 12.9 |
119 | 2018 | 21.1 | 4.4 | 16.7 |
The weca
data has one of these issues. Which one does it have, and how can you tell?
The weca
data is too wide. It mixes information about the quarter/year in with the information about price. This means that the data has Wickham’s first problem: column headers are values, not variable names. In theory, lsoa, la, price, quarter, and year are distinct variables that each encode a distinct bit of information about the observation. Further, this means that the “tidy” version of this data contains a single row for the price of a single LSOA in a single quarter in a single year. The data needs to be “melted” to become tidy, which will be completed in the next step.
It will be helpful to have tidy data for some analyses in the remainder of the exercise. Use a pivot to transform the data to make each row measure the price of an LSOA in a specific month & year
wecatidy = pivot_longer(weca, price_dec_1995:price_dec_2018, names_to=c(NA, 'quarter', 'year'), names_sep='_', values_to='price')
la_code | la_name | lsoa_code | lsoa_name | lsoa11cd | geom | quarter | year | price |
---|---|---|---|---|---|---|---|---|
E06000022 | Bath and North East Somerset | E01014370 | Bath and North East Somerset 007A | E01014370 | MULTIPOLYGON (((-2.357691 5… | dec | 1995 | 49000 |
E06000022 | Bath and North East Somerset | E01014370 | Bath and North East Somerset 007A | E01014370 | MULTIPOLYGON (((-2.357691 5… | mar | 1996 | 54000 |
E06000022 | Bath and North East Somerset | E01014370 | Bath and North East Somerset 007A | E01014370 | MULTIPOLYGON (((-2.357691 5… | jun | 1996 | 85500 |
E06000022 | Bath and North East Somerset | E01014370 | Bath and North East Somerset 007A | E01014370 | MULTIPOLYGON (((-2.357691 5… | sep | 1996 | 122000 |
E06000022 | Bath and North East Somerset | E01014370 | Bath and North East Somerset 007A | E01014370 | MULTIPOLYGON (((-2.357691 5… | dec | 1996 | 129000 |
E06000022 | Bath and North East Somerset | E01014370 | Bath and North East Somerset 007A | E01014370 | MULTIPOLYGON (((-2.357691 5… | mar | 1997 | 129000 |
Using your new tidy data, use ggplot2
to make a boxplot where the horizontal axis is years, the vertical axis is price, and boxes are colored by local authority.
ggplot(wecatidy, aes(x=year, y=price, color=la_name)) + geom_boxplot() + theme(axis.text.x = element_text(angle=45, vjust=1, hjust=1))
## Warning: Removed 581 rows containing non-finite values (stat_boxplot).
Using tidy verbs such as group_by
, select
, mutate
, and merge
, can you:
wecatidy %>% group_by(year) %>% summarize(mean_price = mean(price, na.rm=T))
year | mean_price |
---|---|
1995 | 58154.10 |
1996 | 58988.64 |
1997 | 62572.58 |
1998 | 69444.76 |
1999 | 76872.93 |
2000 | 91882.90 |
groupby(year, la_name)
, then the summary is built first for la_name
, then year
, if requested:wecatidy %>% group_by(year, la_name) %>%
summarize(mean_price = mean(price, na.rm=T))
year | la_name | mean_price |
---|---|---|
1995 | Bath and North East Somerset | 70375.12 |
1995 | Bristol, City of | 51259.92 |
1995 | South Gloucestershire | 60340.01 |
1996 | Bath and North East Somerset | 72169.71 |
1996 | Bristol, City of | 51352.70 |
1996 | South Gloucestershire | 61347.14 |
wecatidy %>% filter(year >= 2008) %>%
group_by(lsoa_code) %>% summarize(price = median(price))
lsoa_code | price |
---|---|
E01014370 | 254375.0 |
E01014371 | 251662.5 |
E01014372 | 280000.0 |
E01014373 | 224250.0 |
E01014374 | 317500.0 |
E01014375 | 399499.0 |
library(rcartocolor)
price_changes = wecatidy %>% filter(year >= 2008) %>% #
group_by(lsoa_code, year) %>%
summarize(price = median(price, na.rm=T)) %>%
arrange(year) %>%
summarize(pct_change = (last(price) - first(price))/first(price)*100)
price_changes = merge(weca[c('lsoa_code', 'geom')], price_changes, by='lsoa_code')
The price change table now looks like this:
lsoa_code | pct_change | geometry |
---|---|---|
E01014370 | 26.378667 | MULTIPOLYGON (((-2.357691 5… |
E01014371 | 86.639470 | MULTIPOLYGON (((-2.351819 5… |
E01014372 | 33.333333 | MULTIPOLYGON (((-2.356748 5… |
E01014373 | 25.000000 | MULTIPOLYGON (((-2.320094 5… |
E01014374 | 34.724409 | MULTIPOLYGON (((-2.307303 5… |
E01014375 | 2.054054 | MULTIPOLYGON (((-2.342037 5… |
And, we can make a map using either continuous:
price_changes %>%
ggplot(., aes(fill=pct_change)) +
geom_sf(lwd=.1) +
scale_fill_carto_c('% Change in Price', palette='Sunset', direction=-1)
or discrete map classification methods, which emphasizes the visual differences.
price_changes %>% mutate(map_class = cut(pct_change, quantile(pct_change, probs=seq(0,1,.2)))) %>%
ggplot(., aes(fill=map_class)) +
geom_sf(lwd=.1) +
scale_fill_carto_d('% Change in Price', palette='Sunset', direction=-1)