library(tidyverse)
## ── Attaching packages ───────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.2 ✓ purrr 0.3.4
## ✓ tibble 3.0.3 ✓ dplyr 1.0.2
## ✓ tidyr 1.1.2 ✓ stringr 1.4.0
## ✓ readr 2.0.1 ✓ forcats 0.5.0
## ── Conflicts ──────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
temps = read_csv('../data/bristol_temps.csv')
knitr::kable(head(temps))
year | jan | feb | mar | apr | may | jun | jul | aug | sep | oct | nov | dec |
---|---|---|---|---|---|---|---|---|---|---|---|---|
1900 | 5.3 | 3.6 | 4.3 | 9.3 | 11.2 | 15.0 | 18.9 | 16.5 | 15.2 | 11.0 | 8.0 | 7.8 |
1901 | 4.1 | 2.8 | 5.3 | 9.6 | 12.7 | 15.0 | 18.4 | 16.3 | 14.7 | 10.4 | 5.1 | 4.2 |
1902 | 5.6 | 2.2 | 7.6 | 8.2 | 10.1 | 14.5 | 16.2 | 15.9 | 14.1 | 10.6 | 7.6 | 5.1 |
1903 | 5.1 | 7.4 | 8.1 | 7.7 | 12.3 | 13.6 | 16.4 | 15.4 | 14.3 | 11.7 | 7.5 | 4.5 |
1904 | 5.0 | 4.5 | 5.1 | 9.7 | 11.8 | 14.5 | 18.1 | 16.2 | 13.4 | 10.3 | 6.3 | 5.3 |
1905 | 4.1 | 5.9 | 7.6 | 8.4 | 12.0 | 15.4 | 18.5 | 15.8 | 13.2 | 8.1 | 5.4 | 5.2 |
This data is messy. The values are temperatures in a month, in a year. So, we should have a column for month and a column for year. Yet, we have months on the columns and years on the rows. That means the data is not tidy. Tidy data might look something like this:
year | month | temperature |
---|---|---|
1900 | jan | 5.3 |
1900 | feb | 3.6 |
1900 | mar | 4.3 |
1900 | apr | 9.3 |
1900 | may | 11.2 |
1900 | jun | 15.0 |
1900 | jul | 18.9 |
1900 | aug | 16.5 |
1900 | sep | 15.2 |
1900 | oct | 11.0 |
1900 | nov | 8.0 |
1900 | dec | 7.8 |
But, who cares? I can use the rowMeans
function to get the average temperature in each year. That answers my question:
plot(temps$year, rowMeans(temps[ , -1]))
But… if we want something that’s not the mean? We have rowSums
, but no rowSds
, rowMedians
, basically anything else. However, if we had tidy data, we would not have to rely on special functions like rowMeans
. So, let’s do something better.
The data currently has a variable (month
) in its columns. Focusing on the values as we discussed above, there are only three variables: year
, month
, and temperature
. If the data were tidy, then these would be the only columns.
To make data longer, we use the pivot_longer
function:
temps %>% pivot_longer(jan:dec)
## # A tibble: 1,428 x 3
## year name value
## <dbl> <chr> <dbl>
## 1 1900 jan 5.3
## 2 1900 feb 3.6
## 3 1900 mar 4.3
## 4 1900 apr 9.3
## 5 1900 may 11.2
## 6 1900 jun 15
## 7 1900 jul 18.9
## 8 1900 aug 16.5
## 9 1900 sep 15.2
## 10 1900 oct 11
## # … with 1,418 more rows
This is almost our tidy example from above, except the column names are wrong. We can re-name the columns inside of the pivot_longer
call:
long_temps = temps %>% pivot_longer(jan:dec, names_to='month', values_to='temp')
knitr::kable(head(long_temps))
year | month | temp |
---|---|---|
1900 | jan | 5.3 |
1900 | feb | 3.6 |
1900 | mar | 4.3 |
1900 | apr | 9.3 |
1900 | may | 11.2 |
1900 | jun | 15.0 |
Together, this gives us what we need to replicate the rowMeans
solution in a tidy fashion. This would involve computing the average temperature in each year in this long data. That is, we need to group the data by a specific column, and then summarize the values within each group.
We can do that with group_by
and summarize
. group_by
is an operation that collects together sets of rows based on a common value. Then, we typically use the summarize()
function (or summarise
) to transform the values in each group according to some rule. Here, we use the standard R
mean
function to get the mean of the temperatures in each group. This gives us a similar result to the rowMeans
above.
long_temps %>% group_by(year) %>% summarize(mean_temp = mean(temp))
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 119 x 2
## year mean_temp
## <dbl> <dbl>
## 1 1900 10.5
## 2 1901 9.88
## 3 1902 9.81
## 4 1903 10.3
## 5 1904 10.0
## 6 1905 9.97
## 7 1906 10.4
## 8 1907 9.88
## 9 1908 10.5
## 10 1909 9.53
## # … with 109 more rows
We can plot this in the same way, or use ggplot2
to make a plot that’s a little more pretty:
long_temps %>%
group_by(year) %>% # make the groups for each year
summarize(temp = mean(temp)) %>% # get the average temperature
ggplot(aes(x=year, y=temp)) + # plot the result
geom_point() +
geom_line() +
geom_smooth()
## `summarise()` ungrouping output (override with `.groups` argument)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
Alternatively, with the tidy data already, we can plot the values by year in quite a few ways. Just as a bunch of dots in each year:
ggplot(long_temps, aes(x=year, y=temp)) + geom_point() + geom_smooth()
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
or as a set of boxplots in each year:
ggplot(long_temps, aes(x=year, group=year, y=temp)) + geom_boxplot()
A climate curve shows the range of temperatures a place experiences every month. It’s kind of a standard summary of the climate of a place, and basically every Wikipedia page on a place a version of one. For this, we would need to get a summary of the typical temperature per month and plot that.
There are a few ways this could get done using the tidy data, but doing this with the standard data would be very verbose.
First, we could use the group
option in ggplot
:
ggplot(long_temps, aes(x=month, y=temp, group=year)) + geom_line()
But that doesn’t look quite right: it’s sorting the months in alphabetical order, not calendar order. We can force this order, though, by making the months an ordered factor, which is like a categorical variable with a set ordering. This is very simple to do, and requires us to first get the order of the factors. I’ll do this by taking them from our original un-tidy dataframe:
month_order = colnames(select(temps, -year))
and the using ordered
to force the months into an ordered categorical variable:
long_temps['month'] = ordered(long_temps$month, levels=month_order)
Finally we can make our plot:
long_temps %>% ggplot(aes(x=month, y=temp, group=year)) +
geom_line(alpha=.1) +
geom_smooth(color='orangered', se=F, aes(group=NA))
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
This data contains information on sample sites in the ocean. It provides measures of nutrients, as well as information on the prevalence of species of phytoplankton.
species = read_csv('../data/species.csv')
knitr::kable(head(species))
Year | Date | Latitude | Longitude | Depth | A. robusta | A. unicornis | C. divergens | C. dentata | E. huxleyi | G. ericsonii | S. florida | NiCompounds | Phosphate | Silicate | Salinity | Temperature | Chlorophyll |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
2003 | 2003-05-22 | -29.5 | -26.9 | 13 | 0.00000 | 60.75334 | 0.000 | 0 | 1761.8469 | 1397.3269 | 0 | 0.080900 | 0.0991000 | 1.08 | 35.804 | 21.330 | 0.03 |
2003 | 2003-05-22 | -29.5 | -26.9 | 116 | 353.35689 | 117.78563 | 0.000 | 0 | 353.3569 | 1060.0707 | 0 | 2.402083 | 0.3227000 | 1.32 | 35.740 | 17.220 | 0.11 |
2003 | 2003-05-24 | -21.6 | -25.0 | 22 | 71.27584 | 0.00000 | 0.000 | 0 | 997.8617 | 142.5517 | 0 | 0.012900 | 0.2010156 | 1.07 | 37.290 | 26.618 | 0.03 |
2003 | 2003-05-24 | -21.6 | -25.0 | 34 | 0.00000 | 112.10762 | 0.000 | 0 | 1681.6143 | 224.2152 | 0 | 0.013100 | 0.2034769 | 1.08 | 37.291 | 26.620 | 0.03 |
2003 | 2003-05-24 | -21.6 | -25.0 | 60 | 0.00000 | 102.45902 | 102.459 | 0 | 1434.4262 | 2356.5574 | 0 | 0.016300 | 0.2082746 | 1.10 | 37.290 | 26.624 | 0.03 |
2003 | 2003-05-24 | -21.6 | -25.0 | 140 | 186.56716 | 0.00000 | 0.000 | 0 | 149.2537 | 298.5075 | 0 | 0.025800 | 0.3755803 | 1.23 | 36.608 | 21.342 | 0.11 |
For this specific question, the data is actually tidy! the values for, say, the Salinity
or Phosphate
column represent the salinity of a specific sample site in the ocean. If we want to ask questions about the relationship between those variables for each of those observations, then the data is tidy enough:
species %>% ggplot(aes(x=Phosphate, y=Salinity)) + geom_point()
However, an enterprising student will notice: the species/genera prevalence columns are not tidy! Each Genus is denoted by the first letter in the species name. So, the values in the column called E. Huxleyi
actually reflects the prevalence (in organisms per volume) of the organism E. Huxleyi
. Further, that column name actually mixes the genera & species name together! So, asking questions about the relationship between species and genera, or the co-occurence of genera/species has to tidy this data a bit.
To split the data up again, we can use pivot_longer
again in the same way we did before:
species_long = species %>% pivot_longer(`A. robusta`:`S. florida`, names_to=c("Genus", "Species"), names_sep='. ', values_to='prevalence')
But, note now that we’ve duplicated tons of measurements of the other rows! This automatically demonstrates why it’s sometimes not economical to input data in a tidy format… it often involves a lot of repetition! Now, though, we can look at the distributions of species/genera in our data, since each value of the prevalence
column is a single observation of the prevalence of a single organism.
species_long %>%
ggplot(aes(x=Genus, y=prevalence, color=Species)) + geom_boxplot() +
scale_y_continuous(trans='log10') + annotation_logticks(sides='l') + ylab("Prevalence (log10)")
## Warning: Transformation introduced infinite values in continuous y-axis
## Warning: Removed 664 rows containing non-finite values (stat_boxplot).
Be patient!
pivot_wider()
: take rows and convert them into columnspivot_longer()
: take columns and convert them into rowsgroup_by(group_name) %>% summarize(group_summary = ..., group_summary2=...)
: split the dataframe into groups based on the unique values of group_name
& summarize those groups using one (or more) functions
count()/n()
: count the number of observations in the groupfirst(variable)
: get the first value of variable
in the grouplast(variable)
: get the last value of variable
in the groupn_distinct(variable)
: get the number of unique values of variable
in the group. Gives the number of unqiue rows if variable
is omitted.filter()
: subset the data by rows according to some condition(s)
between(variable, lower, higher)
: is TRUE
for rows where the value of variable
is at least as large as lower
and at most as large as higher
.distinct(variable)
: is TRUE
for rows when the value for that row not yet been seen (going from the first row to the last row)select()
: subset the data by columns according to some condition(s)
contains(variable,value)
: is TRUE
for rows where variable
is a string that contains value
. See ??tidy_select
for more.mutate(new_variable = do_something(old_variable))
: add a new_variable
to the data that is computed using an old_variable
of the dataframe
transmute(new_variable = do_something(old_variable))
: same as mutate, but only new_variable
is returnedrename(new_name = old_name)
: rename the old_name
column to be called new_name
separate(variable, sep="_")/unite(variable1, variable2, sep="_")
: split (or join) a column into new columns based on a separator, sep
separate_rows(variable, sep=':')
: split a column into rows, where each new row is a copy of the original, but now its value of variable
has been split according to sep
arrange(variable)
: sort the dataframe based on increasing values of variable
(i.e. sort is ascending)
arrange(desc(variable))
sort the dataframe based on decreasing values of variable
left/right/inner/full_join()
semi/anti_join
expand_grid(a, b, c)
: make a new dataframe out of all the combinations of values in a
, b
, and c
seq(start, stop, step)
: make a sequence from the start
to the stop
with a step size of step
rep(vector, number)
: repeat the vector