In the topic selection quiz, we had 17 voters (out of an electorate of 26 people), for a turnout of 65.4 percent! The responses from Microsoft Forms are shown below:
library(readxl)
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.5 ✓ purrr 0.3.4
## ✓ tibble 3.1.1 ✓ dplyr 1.0.5
## ✓ tidyr 1.1.3 ✓ stringr 1.4.0
## ✓ readr 2.0.2 ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(knitr)
responses = read_excel("../content/data/topic-survey.xlsx", sheet='Form1')
responses %>% head(1) %>% kable()
ID | Start time | Completion time | Rank the following options below in terms of which topic you’d like to study for the final section of the course, either by clicking and dragging the option to its desired place, or by clicking/ta… | |
---|---|---|---|---|
1 | 2021-11-02 13:24:02 | 2021-11-02 13:25:04 | anonymous | Splines and friends [ISL Ch. 7] Splines (and other “Generalized Additive Models”) are an extension of typical regression models that allow for nonlinearity. They’re what you see by default any time you use “geom_smooth()” in ggplot2 (which is a kernel regression called LOWESS). They’re most useful when trends are nonlinear. They see loads of applications in biogeography, epidemiology, public health, and conservation. ;Unsupervised Learning by Data Reduction: reducing redundant variables [ISL Ch. 12.2] Oftentimes, data that we analyze is redundant, meaning that two (or more) variables often measure basically the same thing. Data Reduction is a practice involving a variety of approaches (such as principal components analysis) that can “compress” many different variables into a few “composite” variables to simplify analysis. This is very often done when building indexes or other hybrid measures of an underlying phenomena that is not directly measured. Classic examples of this are in the SoVI social vulnerability index used in assessing hazard risk or the Index of Multiple Deprivation in the UK. As with all unsupervised learning techniques, there is no “right answer,” only various ways to assess quality. ;Unsupervised Learning with Clustering: building “types” of data [ISL Ch. 12.4] Clustering involves building up “types” of observations as a kind of statistical shorthand to describe your data. This is often used in demographic research, ecology, and is one of the widest used methods for data exploration. Clustering methods generally are useful when you want to learn about structure in your data, and understand what observations are similar to one another. As an “unsupervised technique,” there generally is no “right answer,” but there are metrics that tell you which answers are better than others. ;Multilevel modelling: learning from context [SR 12] Multilevel models are a technique from statistics that sees increasing adoption across social and environmental science. The principle behind multilevel models is similar to that in other data science approaches: nest simpler models within one another. Multilevel models allow you to specify how parts of your model may themselves be outcomes from another process. For example, your income may depend on your age, your job, and your seniority. Your seniority also depends on your age. A multilevel model will recognize this, and can simultaneously estimate the relationship between age and seniority, and use that to “correctly” assign how much information about your earnings comes from age directly, versus what information about age “leaks in” through seniority. ;Regression trees and forests [ISL Ch. 8] Regression trees are “rule-based” predictors of an outcome. That is, they learn the associations between an outcome and the input data with a “decision tree” (see, for instance, one predicting survival on the Titanic: https://en.wikipedia.org/wiki/Decision_tree_learning#/media/File:Decision_Tree.jpg). With enough variables, we can build a forest of decision trees that can be useful in predicting new data, and also which gives us an indication of which variables are “useful” in predicting an outcome. These see wide use across social and environmental sciences. ;None of the above If you rank this option, tell me what you want in the free response below. Otherwise, leave this last! ; |
This data is not tidy! In order for me to run an election, I need to make it tidy.
The first step, I think, is to make the column names easier to work with. So, let’s shorten them:
colnames(responses) <- c('id', 'start', 'finish', 'email', 'ranking')
Then, Microsoft Forms stitches together the text in the ranked options using the “;” separator. So, I can separate the rankings using the separate()
!
responses_ranked <- responses %>%
separate(ranking, sep=';', into=c('1','2','3','4','5','6'))
## Warning: Expected 6 pieces. Additional pieces discarded in 17 rows [1, 2, 3, 4,
## 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17].
responses_ranked %>% head(2) %>% kable()
id | start | finish | 1 | 2 | 3 | 4 | 5 | 6 | |
---|---|---|---|---|---|---|---|---|---|
1 | 2021-11-02 13:24:02 | 2021-11-02 13:25:04 | anonymous | Splines and friends [ISL Ch. 7] Splines (and other “Generalized Additive Models”) are an extension of typical regression models that allow for nonlinearity. They’re what you see by default any time you use “geom_smooth()” in ggplot2 (which is a kernel regression called LOWESS). They’re most useful when trends are nonlinear. They see loads of applications in biogeography, epidemiology, public health, and conservation. | Unsupervised Learning by Data Reduction: reducing redundant variables [ISL Ch. 12.2] Oftentimes, data that we analyze is redundant, meaning that two (or more) variables often measure basically the same thing. Data Reduction is a practice involving a variety of approaches (such as principal components analysis) that can “compress” many different variables into a few “composite” variables to simplify analysis. This is very often done when building indexes or other hybrid measures of an underlying phenomena that is not directly measured. Classic examples of this are in the SoVI social vulnerability index used in assessing hazard risk or the Index of Multiple Deprivation in the UK. As with all unsupervised learning techniques, there is no “right answer,” only various ways to assess quality. | Unsupervised Learning with Clustering: building “types” of data [ISL Ch. 12.4] Clustering involves building up “types” of observations as a kind of statistical shorthand to describe your data. This is often used in demographic research, ecology, and is one of the widest used methods for data exploration. Clustering methods generally are useful when you want to learn about structure in your data, and understand what observations are similar to one another. As an “unsupervised technique,” there generally is no “right answer,” but there are metrics that tell you which answers are better than others. | Multilevel modelling: learning from context [SR 12] Multilevel models are a technique from statistics that sees increasing adoption across social and environmental science. The principle behind multilevel models is similar to that in other data science approaches: nest simpler models within one another. Multilevel models allow you to specify how parts of your model may themselves be outcomes from another process. For example, your income may depend on your age, your job, and your seniority. Your seniority also depends on your age. A multilevel model will recognize this, and can simultaneously estimate the relationship between age and seniority, and use that to “correctly” assign how much information about your earnings comes from age directly, versus what information about age “leaks in” through seniority. | Regression trees and forests [ISL Ch. 8] Regression trees are “rule-based” predictors of an outcome. That is, they learn the associations between an outcome and the input data with a “decision tree” (see, for instance, one predicting survival on the Titanic: https://en.wikipedia.org/wiki/Decision_tree_learning#/media/File:Decision_Tree.jpg). With enough variables, we can build a forest of decision trees that can be useful in predicting new data, and also which gives us an indication of which variables are “useful” in predicting an outcome. These see wide use across social and environmental sciences. | None of the above If you rank this option, tell me what you want in the free response below. Otherwise, leave this last! |
2 | 2021-11-02 13:26:01 | 2021-11-02 13:27:12 | anonymous | Unsupervised Learning by Data Reduction: reducing redundant variables [ISL Ch. 12.2] Oftentimes, data that we analyze is redundant, meaning that two (or more) variables often measure basically the same thing. Data Reduction is a practice involving a variety of approaches (such as principal components analysis) that can “compress” many different variables into a few “composite” variables to simplify analysis. This is very often done when building indexes or other hybrid measures of an underlying phenomena that is not directly measured. Classic examples of this are in the SoVI social vulnerability index used in assessing hazard risk or the Index of Multiple Deprivation in the UK. As with all unsupervised learning techniques, there is no “right answer,” only various ways to assess quality. | Unsupervised Learning with Clustering: building “types” of data [ISL Ch. 12.4] Clustering involves building up “types” of observations as a kind of statistical shorthand to describe your data. This is often used in demographic research, ecology, and is one of the widest used methods for data exploration. Clustering methods generally are useful when you want to learn about structure in your data, and understand what observations are similar to one another. As an “unsupervised technique,” there generally is no “right answer,” but there are metrics that tell you which answers are better than others. | Splines and friends [ISL Ch. 7] Splines (and other “Generalized Additive Models”) are an extension of typical regression models that allow for nonlinearity. They’re what you see by default any time you use “geom_smooth()” in ggplot2 (which is a kernel regression called LOWESS). They’re most useful when trends are nonlinear. They see loads of applications in biogeography, epidemiology, public health, and conservation. | Regression trees and forests [ISL Ch. 8] Regression trees are “rule-based” predictors of an outcome. That is, they learn the associations between an outcome and the input data with a “decision tree” (see, for instance, one predicting survival on the Titanic: https://en.wikipedia.org/wiki/Decision_tree_learning#/media/File:Decision_Tree.jpg). With enough variables, we can build a forest of decision trees that can be useful in predicting new data, and also which gives us an indication of which variables are “useful” in predicting an outcome. These see wide use across social and environmental sciences. | Multilevel modelling: learning from context [SR 12] Multilevel models are a technique from statistics that sees increasing adoption across social and environmental science. The principle behind multilevel models is similar to that in other data science approaches: nest simpler models within one another. Multilevel models allow you to specify how parts of your model may themselves be outcomes from another process. For example, your income may depend on your age, your job, and your seniority. Your seniority also depends on your age. A multilevel model will recognize this, and can simultaneously estimate the relationship between age and seniority, and use that to “correctly” assign how much information about your earnings comes from age directly, versus what information about age “leaks in” through seniority. | None of the above If you rank this option, tell me what you want in the free response below. Otherwise, leave this last! |
Now, you can see that the ranked responses are now spread across columns. The values contain the topic that is being ranked. This means that the “topic” variable is missing! We need to force the data longer in order to capture this in a tidy format. I suggest a dataset where the “topic” and the “rank” are separate columns:
responses_long <- responses_ranked %>%
pivot_longer('1':'6', names_to='rank', values_to='topic')
responses_long %>% head(3) %>% kable()
id | start | finish | rank | topic | |
---|---|---|---|---|---|
1 | 2021-11-02 13:24:02 | 2021-11-02 13:25:04 | anonymous | 1 | Splines and friends [ISL Ch. 7] Splines (and other “Generalized Additive Models”) are an extension of typical regression models that allow for nonlinearity. They’re what you see by default any time you use “geom_smooth()” in ggplot2 (which is a kernel regression called LOWESS). They’re most useful when trends are nonlinear. They see loads of applications in biogeography, epidemiology, public health, and conservation. |
1 | 2021-11-02 13:24:02 | 2021-11-02 13:25:04 | anonymous | 2 | Unsupervised Learning by Data Reduction: reducing redundant variables [ISL Ch. 12.2] Oftentimes, data that we analyze is redundant, meaning that two (or more) variables often measure basically the same thing. Data Reduction is a practice involving a variety of approaches (such as principal components analysis) that can “compress” many different variables into a few “composite” variables to simplify analysis. This is very often done when building indexes or other hybrid measures of an underlying phenomena that is not directly measured. Classic examples of this are in the SoVI social vulnerability index used in assessing hazard risk or the Index of Multiple Deprivation in the UK. As with all unsupervised learning techniques, there is no “right answer,” only various ways to assess quality. |
1 | 2021-11-02 13:24:02 | 2021-11-02 13:25:04 | anonymous | 3 | Unsupervised Learning with Clustering: building “types” of data [ISL Ch. 12.4] Clustering involves building up “types” of observations as a kind of statistical shorthand to describe your data. This is often used in demographic research, ecology, and is one of the widest used methods for data exploration. Clustering methods generally are useful when you want to learn about structure in your data, and understand what observations are similar to one another. As an “unsupervised technique,” there generally is no “right answer,” but there are metrics that tell you which answers are better than others. |
Finally, I want to cut off all the extra text that just served to explain the topic to you. To do this, I’ll us separate again to cut off the rest of the values after the first square bracket, which I used to indicate the readings for the topics:
responses_final <- responses_long %>%
separate(topic, into=c('topic', NA), sep='\\[')
## Warning: Expected 2 pieces. Missing pieces filled with `NA` in 17 rows [6, 12,
## 18, 24, 30, 36, 42, 48, 54, 60, 66, 72, 78, 84, 90, 96, 102].
responses_final %>% head(6) %>% kable()
id | start | finish | rank | topic | |
---|---|---|---|---|---|
1 | 2021-11-02 13:24:02 | 2021-11-02 13:25:04 | anonymous | 1 | Splines and friends |
1 | 2021-11-02 13:24:02 | 2021-11-02 13:25:04 | anonymous | 2 | Unsupervised Learning by Data Reduction: reducing redundant variables |
1 | 2021-11-02 13:24:02 | 2021-11-02 13:25:04 | anonymous | 3 | Unsupervised Learning with Clustering: building “types” of data |
1 | 2021-11-02 13:24:02 | 2021-11-02 13:25:04 | anonymous | 4 | Multilevel modelling: learning from context |
1 | 2021-11-02 13:24:02 | 2021-11-02 13:25:04 | anonymous | 5 | Regression trees and forests |
1 | 2021-11-02 13:24:02 | 2021-11-02 13:25:04 | anonymous | 6 | None of the above If you rank this option, tell me what you want in the free response below. Otherwise, leave this last! |
With this, we can do some interesting analytics. First, we can just make a simple crosstab of the responses:
responses_final %>%
xtabs(~ topic + rank, data=.) %>%
kable()
1 | 2 | 3 | 4 | 5 | 6 | |
---|---|---|---|---|---|---|
Multilevel modelling: learning from context | 6 | 3 | 1 | 3 | 4 | 0 |
None of the above If you rank this option, tell me what you want in the free response below. Otherwise, leave this last! | 0 | 0 | 0 | 0 | 0 | 17 |
Regression trees and forests | 6 | 4 | 4 | 2 | 1 | 0 |
Splines and friends | 2 | 4 | 3 | 4 | 4 | 0 |
Unsupervised Learning by Data Reduction: reducing redundant variables | 1 | 2 | 6 | 3 | 5 | 0 |
Unsupervised Learning with Clustering: building “types” of data | 2 | 4 | 3 | 5 | 3 | 0 |
From this, we can tell a few things:
responses_final %>%
group_by(id) %>% # for each person, make a sub-dataframe
arrange(rank) %>% # sorted each sub-dataframe by rank
summarize(choice = first(topic)) %>% # grab the first topic in each sub-df
group_by(choice) %>% # now, group by the first choices
summarize(n_first_choice = n()) %>% # and count the number of first choices
arrange(n_first_choice) # show the result
## # A tibble: 5 x 2
## choice n_first_choice
## <chr> <int>
## 1 "Unsupervised Learning by Data Reduction: reducing redundant v… 1
## 2 "Splines and friends " 2
## 3 "Unsupervised Learning with Clustering: building \"types\" of … 2
## 4 "Multilevel modelling: learning from context " 6
## 5 "Regression trees and forests " 6
To resolve this tie, let’s use two methods: Instant Runoff Voting and Borda Count. Let’s hope they agree, since there’s no guarantee!
To run an IRV, we need to pick the topic that had the lowest first-choice votes and select those folks’ second choices. So, the lowest first-choice is the Unsupervised Learning by Data Reduction topic. So, we remove this option, and re-compute the top-ranked choice for each person:
responses_final %>%
# remove the Unsupervised Learning by data reduction topic
filter(!str_detect(topic, '^Unsupervised Learning by')) %>%
# same steps as before
group_by(id) %>%
arrange(rank) %>%
# now this may select topics of rank 1
# (for folks whose first choice is still in the running)
# or 2 (for the one person who wanted unsupervised learning most)
summarize(choice = first(topic)) %>%
group_by(choice) %>%
summarize(n_first_choice = n()) %>%
arrange(n_first_choice) %>%
kable()
choice | n_first_choice |
---|---|
Splines and friends | 2 |
Unsupervised Learning with Clustering: building “types” of data | 3 |
Multilevel modelling: learning from context | 6 |
Regression trees and forests | 6 |
We’re still tied, so now we need to remove the lowest again: the Splines topic:
responses_final %>%
# remove the Unsupervised Learning by data reduction topic
filter(!str_detect(topic, '^Unsupervised Learning by')) %>%
# remove the splines topic
filter(!str_detect(topic, '^Splines')) %>%
# same steps as before
group_by(id) %>%
arrange(rank) %>%
# now this may select topics of rank 1
# (for folks whose first choice is still in the running)
# or 2 (for the one person who wanted unsupervised learning most)
# or possibly three (if the UL person also ranked splines 2nd)
summarize(choice = first(topic)) %>%
group_by(choice) %>%
summarize(n_first_choice = n()) %>%
arrange(n_first_choice) %>%
kable()
choice | n_first_choice |
---|---|
Unsupervised Learning with Clustering: building “types” of data | 4 |
Multilevel modelling: learning from context | 6 |
Regression trees and forests | 7 |
So, Regression trees and forests wins in an instant runoff 🥳!
If you don’t like this model of election, let’s look at the Borda count. Here, we give “points” that are proportional to the rank given. So, someone’s first choice gets 6 points, their second gets 5, and their last choice gets only one point. This one is very easy to compute using a tidy recipe:
responses_final %>%
# compute the "score" for each choice:
mutate(score = 7-as.numeric(rank)) %>%
# group by the topic
group_by(topic) %>%
# get the topic's total score:
summarize(overall_score = sum(score)) %>%
# show me the winners!
arrange(desc(overall_score)) %>%
kable()
topic | overall_score |
---|---|
Regression trees and forests | 80 |
Multilevel modelling: learning from context | 72 |
Unsupervised Learning with Clustering: building “types” of data | 65 |
Splines and friends | 64 |
Unsupervised Learning by Data Reduction: reducing redundant variables | 59 |
None of the above If you rank this option, tell me what you want in the free response below. Otherwise, leave this last! | 17 |
So, Regression forests & trees win here, too 🎉!