This assessment will test your knowledge & abilities to work with data in R. Complete what you can; incomplete answers will still be able to achieve partial credit, and non-working code or descriptions of what you think you should do will also receive partial credit. Also, if a question has a (c) next to it, it’s considered a “Challenge” question! So, you can still score OK if you leave it blank. Submit the HTML or PDF of the “knitted” document for maximum points.
Data for the assessment is available alongside this document on the course blackboard, or here.
The data sets shown in the tabs below are included with your midterm assessment. Each of the data sets may have one (or more) “tidy” issues, and thus may violate the three rules of tidy data we’ve discussed before in the course:
Using this information,1 describe the following data sets in terms of their “tidyness.” Are the data sets tidy? If so, why? If not, why not? Note that I’m asking for a short paragraph; no code is needed to answer these fully.
Sometimes, I also provide a rationale for why the data is displayed this way. This* does not mean *the data is tidy!
Note: click the tabs to cycle through each of the data sets.
This data set describes the finances of movies in terms of the movie budget (budget
), the domestic (US) gross box office revenue (domgross
) and the international box office revenue (intgross
) from 1970 to 2013.2
year | movie_name | finance | dollars |
---|---|---|---|
1970 | Beyond the Valley of the Dolls | budget | 1000000 |
1970 | Beyond the Valley of the Dolls | domgross | 9000000 |
1970 | Beyond the Valley of the Dolls | intgross | 9000000 |
1971 | Escape from the Planet of the Apes | budget | 2500000 |
1971 | Escape from the Planet of the Apes | domgross | 12300000 |
1971 | Escape from the Planet of the Apes | intgross | 12300000 |
1971 | Shaft | budget | 53012938 |
1971 | Shaft | domgross | 70327868 |
1971 | Shaft | intgross | 107190108 |
1971 | Straw Dogs | budget | 25000000 |
1971 | Straw Dogs | domgross | 10324441 |
1971 | Straw Dogs | intgross | 11253821 |
This dataset counts the numbers of bird species recorded in urban or rural parts of bioregions across Australia. The survey was conducted from 2014 to 2015. Shown below are the first 10 columns of six random rows from the dataframe, as there are many more bird species in Australia. This data is formatted this way in order to make the selection of specific species easy.
birds <- read_csv('./midterm-birds.csv')
birds %>% drop_na() %>% arrange(bioregions) %>% sample_n(6) %>% select(1:10) %>% kable()
survey_year | urban_rural | bioregions | Bassian Thrush | Chestnut-breasted Mannikin | Wild Duck | Willie Wagtail | Regent Bowerbird | Rufous Fantail | Spiny-cheeked Honeyeater |
---|---|---|---|---|---|---|---|---|---|
2014 | Urban | South Eastern Highlands | 0 | 0 | 0 | 2 | 0 | 0 | 0 |
2015 | Rural | Brigalow Belt South | 0 | 0 | 0 | 2 | 0 | 0 | 0 |
2015 | Rural | Flinders Lofty Block | 0 | 0 | 0 | 2 | 0 | 0 | 0 |
2015 | Urban | Victorian Midlands | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
2015 | Urban | South East Coastal Plain | 0 | 0 | 0 | 2 | 0 | 0 | 0 |
2015 | Urban | Sydney Basin | 0 | 0 | 3 | 3 | 0 | 2 | 0 |
This dataset describes 32,833 songs on Spotify in terms of statistics that Spotify record about the song.3 Below, I only a show a few columns from the dataset, but a few columns are explained.4 This data is formatted directly from the Spotify API, and so is structured for ease of use in a database.
spotify <- read_csv('./midterm-songs.csv')
spotify %>% select(track_name, track_artist,
track_popularity,
danceability, loudness, duration_ms) %>%
sample_n(10, weight=track_popularity) %>%
kable()
track_name | track_artist | track_popularity | danceability | loudness | duration_ms |
---|---|---|---|---|---|
Stellar | Jamila Woods | 49 | 0.684 | -9.751 | 122346 |
Peeping Tom - Original Mix | Jamie Berry | 53 | 0.720 | -4.713 | 210000 |
Transmission - 2010 Remaster | Joy Division | 48 | 0.497 | -8.088 | 215148 |
Tungba | Ajebutter22 | 54 | 0.749 | -5.850 | 188421 |
Bounce Back | Big Sean | 76 | 0.780 | -5.628 | 222360 |
Holy Diver | Dio | 35 | 0.469 | -9.913 | 350013 |
Hit and Run | Midnght | 43 | 0.775 | -10.914 | 172203 |
You should be sad | Halsey | 86 | 0.591 | -6.350 | 205473 |
HUMBLE. | Kendrick Lamar | 84 | 0.908 | -6.638 | 177000 |
Stay High - Habits Remix | Tove Lo | 69 | 0.736 | -7.036 | 258387 |
These are the number of trains that have run between any two stations for services operated by the National Society of French Railways, the state-owned rail company in France, from 2015 to 2018. Below, I just show the first ten originating stations (where the trains leave from) and the first four destination stations (where the trains arrive). This kind of data is often called “spatial interaction data,” and is used to measure the “interaction” between different spatial units. It is often presented in this format for usability: readers can scan across a row to quickly compare the number of trains that destinations receive from a specific origin.
departure_station | POITIERS | QUIMPER | RENNES | ST PIERRE DES CORPS |
---|---|---|---|---|
PARIS MONTPARNASSE | 31632 | 16242 | 38850 | 24930 |
TOURS | 0 | 0 | 0 | 0 |
LYON PART DIEU | 0 | 0 | 5814 | 0 |
PARIS EST | 0 | 0 | 0 | 0 |
NANCY | 0 | 0 | 0 | 0 |
STRASBOURG | 0 | 0 | 0 | 0 |
NANTES | 0 | 0 | 0 | 0 |
DUNKERQUE | 0 | 0 | 0 | 0 |
MARSEILLE ST CHARLES | 0 | 0 | 0 | 0 |
BORDEAUX ST JEAN | 0 | 0 | 0 | 0 |
The following table records the percentage energy supplied by different forms of energy across countries in the EU, as well as the “EU-28” and “EA-19” groups of European member nations. This kind of wide-but-short display format is often useful to fit tables like this alongside text in a document.
energy_type | EU_28 | EA_19 | EE | CY | MT | PL | NL | EL | IE | IT | LV |
---|---|---|---|---|---|---|---|---|---|---|---|
Conventional thermal | 45.9 | 43.6 | 93.9 | 91.4 | 90.9 | 90.2 | 83.9 | 68.6 | 68.3 | 66.0 | 61.0 |
Nuclear | 25.5 | 27.0 | 0.0 | 0.0 | 0.0 | 0.0 | 2.9 | 0.0 | 0.0 | 0.0 | 0.0 |
Hydro | 11.8 | 11.9 | 0.2 | 0.0 | 0.0 | 1.5 | 0.1 | 11.4 | 3.2 | 17.6 | 37.2 |
Wind | 12.2 | 12.2 | 6.0 | 4.6 | 0.0 | 8.1 | 10.9 | 12.4 | 28.6 | 6.2 | 1.8 |
Solar | 4.0 | 4.7 | 0.0 | 4.0 | 0.0 | 0.2 | 2.2 | 7.5 | 0.0 | 8.2 | 0.0 |
Geothermal & others | 0.4 | 0.6 | 0.0 | 0.0 | 9.1 | 0.0 | 0.0 | 0.0 | 0.0 | 2.0 | 0.0 |
Total | 100.0 | 100.0 | 100.0 | 100.0 | 100.0 | 100.0 | 100.0 | 100.0 | 100.0 | 100.0 | 100.0 |
This dataset describes MarioKart 64 “World Record” races. There are typically two “categories” in MarioKart racing: the “Single Lap” category—how fast a racer can complete a single lap—and “Three Lap,” which measures the time to complete a typical three-lap race. Along with the times (in seconds), the date for the most recent world record is recorded. This is often the format in which tables like this are viewed.
track | single_lap_record | single_lap_record_date | three_lap_record | three_lap_record_date |
---|---|---|---|---|
Banshee Boardwalk | 40.78 | 2020-06-11 | 124.09 | 2021-01-15 |
Bowser’s Castle | 43.15 | 2021-02-02 | 132.00 | 2021-02-02 |
Choco Mountain | 38.02 | 2020-07-17 | 115.93 | 2020-07-14 |
D.K.’s Jungle Parkway | 42.04 | 2020-10-26 | 131.62 | 2019-12-31 |
Frappe Snowland | 38.27 | 2019-12-25 | 119.95 | 2017-12-23 |
Kalimari Desert | 38.96 | 2020-11-07 | 123.94 | 2018-04-20 |
Koopa Troopa Beach | 30.78 | 2020-10-18 | 95.25 | 2020-07-13 |
Luigi Raceway | 37.58 | 2021-01-10 | 117.77 | 2020-11-06 |
Mario Raceway | 27.62 | 2021-01-26 | 87.51 | 2020-08-13 |
Moo Moo Farm | 27.80 | 2018-12-29 | 85.93 | 2020-02-18 |
Rainbow Road | 116.35 | 2020-09-04 | 351.87 | 2020-09-28 |
Royal Raceway | 55.50 | 2020-06-11 | 171.25 | 2020-09-07 |
Sherbet Land | 37.72 | 2021-02-19 | 115.15 | 2021-01-26 |
Toad’s Turnpike | 58.69 | 2020-09-20 | 177.80 | 2020-09-28 |
Wario Stadium | 85.82 | 2021-01-26 | 260.01 | 2019-10-11 |
Yoshi Valley | 31.25 | 2018-01-18 | 102.13 | 2021-01-26 |
The following are Paini et al.’s estimates of the potential cost (in millions of USD) of invasive species to the ecosystem of each country. This is a direct digitization of table S2 in the Supplemental Material, so the formatting of the table is decided by concerns of printing and typesetting.5
rank | country | damage | rank | country | damage | rank | country | damage |
---|---|---|---|---|---|---|---|---|
1 | China | $117,290 | 43 | Denmark | $1,417 | 85 | Burundi | $397.9 |
2 | USA | $70,381 | 44 | Nepal | $1,411 | 86 | Lithuania | $392.4 |
3 | Brazil | $33,760 | 45 | Sudan | $1,373 | 87 | Moldova | $387.5 |
4 | India | $33,065 | 46 | Portugal | $1,365 | 88 | Armenia | $336.0 |
5 | Japan | $23,490 | 47 | Belgium | $1,351 | 89 | Malaysia | $333.0 |
6 | Korea Republic of | $14,349 | 48 | Kazakhstan | $1,344 | 90 | Bosnia and Herzegovina | $327.4 |
7 | Turkey | $13,267 | 49 | Czech Republic | $1,336 | 91 | Kyrgyzstan | $302.0 |
8 | Argentina | $13,204 | 50 | Austria | $1,304 | 92 | Georgia (Republic) | $301.5 |
9 | France | $12,532 | 51 | Iraq | $1,234 | 93 | Tajikistan | $297.1 |
10 | Mexico | $11,277 | 52 | Kenya | $1,230 | 94 | Ireland | $277.6 |
11 | Iran | $11,276 | 53 | Mozambique | $1,218 | 95 | Lebanon | $276.8 |
12 | Nigeria | $10,251 | 54 | Cambodia | $1,121 | 96 | Nicaragua | $264.1 |
13 | Indonesia | $9,550 | 55 | Ghana | $1,114 | 97 | Rwanda | $255.1 |
14 | Thailand | $8,066 | 56 | Bulgaria | $1,112 | 98 | Mauritius | $227.6 |
15 | Australia | $7,815 | 57 | Madagascar | $1,074 | 99 | Macedonia | $218.4 |
16 | Vietnam | $7,490 | 58 | Malawi | $1,071 | 100 | Congo (Republic of) | $212.8 |
17 | Ukraine | $6,953 | 59 | Paraguay | $1,012 | 101 | Slovenia | $202.0 |
18 | Egypt | $6,737 | 60 | Guinea | $977.5 | 102 | Niger | $197.3 |
19 | Canada | $6,694 | 61 | Tunisia | $949.2 | 103 | Latvia | $187.3 |
20 | Pakistan | $6,630 | 62 | Ecuador | $934.7 | 104 | Panama | $161.2 |
21 | Germany | $6,481 | 63 | Switzerland | $924.1 | 105 | Togo | $153.2 |
22 | Bangladesh | $5,623 | 64 | Dominican Republic | $873.0 | 106 | Jordan | $116.4 |
23 | Spain | $5,576 | 65 | Jamaica | $871.7 | 107 | Guinea-Bissau | $114.3 |
24 | Russian Federation | $5,084 | 66 | Sri Lanka | $829.1 | 108 | Cyprus | $108.5 |
25 | Philippines | $4,839 | 67 | Yemen | $806.0 | 109 | Estonia | $102.1 |
26 | Greece | $4,342 | 68 Saudi Arabia | NA | 110 | Fiji | NA | NA |
27 | United Kingdom | $4,005 | 69 | Honduras | $794.3 | 111 | Mongolia | $64.7 |
28 | South Africa | $3,922 | 70 | Croatia | $755.6 | 112 | Luxembourg | $64.7 |
29 | Romania | $3,524 | 71 | Azerbaijan | $730.7 | 113 | Belize | $42.2 |
30 | Algeria | $2,862 | 72 | New Zealand | $639.7 | 114 | Cape Verde | $40.8 |
31 | Morocco | $2,531 | 73 | Albania | $637.2 | 115 | Gambia | $37.6 |
32 | Colombia | $2,476 | 74 | Finland | $600.7 | 116 | Suriname | $36.5 |
33 | Poland | $2,449 | 75 | Slovakia | $573.4 | 117 | Trinidad and Tobago | $28.8 |
34 | Ethiopia | $2,312 | 76 | Burkina Faso | $557.7 | 118 | Vanuatu | $23.5 |
35 | Venezuela | $2,167 | 77 | Costa Rica | $556.8 | 119 | Barbados | $20.3 |
36 | Chile | $2,095 | 78 | Sweden | $546.9 | 120 Equatorial Guinea | NA | NA |
37 | Netherlands | $1,981 | 79 | Israel | $518.4 | 121 | Malta | $14.4 |
38 | Hungary | $1,979 | 80 | Uruguay | $509.1 | 122 | Qatar | $5.0 |
39 | Belarus | $1,777 | 81 | Laos | $508.2 | 123 | Iceland | $4.8 |
40 | Peru | $1,580 | 82 | Mali | $504.7 | 124 | Singapore | $0.7 |
41 | Cameroon | $1,574 | 83 | El Salvador | $475.1 | NA | NA | NA |
42 | Italy | $1,447 | 84 | Norway | $419.4 | NA | NA | NA |
For each of the data sets in Section 1, can you create a tidy version of the data set?
In this section, I’ll ask some specific questions about two of the data sets: Movies & Songs.
Which ten movies lost the most money domestically? Are these the same movies that lost the most money overall?
What is the average budget for a movie in each year?
Which movie had the largest gap between domestic and overseas box office performance?
Make a visualization that shows how the budget for movies has increased over time. Discuss how you designed the plot in order to emphasize this message.
Make a visualization that shows how the typical profit movies make has generally not changed over time, but that a few outliers do make increasingly more money. Discuss how you designed the plot in order to emphasize this message.
You’re a data scientist working for a movie studio. Your executive is considering whether to take a risk on making a “new” movie, or whether it’d be a safer bet to make a sequel to an existing movie. So, she asks:
Do sequels make more profit per dollar spent than non-sequels?
Can you answer her question?6
What’s your best guess about the length of a track with higher than 75% popularity? How about your best guess for the popularity of a track between 2 and 3 minutes long? Which “piece” of information (popularity > 75% or duration between 2 & 3 minutes) gives you more useful information, and why do you think that?7
What is the typical “energy” of each of the playlist genres? How about the typical “valence,” meaning “happiness,” of the genres?
Make four plots8 to visualize the relationship between danceability and a few variables:
tempo
energy
valence
playlist_genre
Make sure to take into account whether the relationship is linear and address overplotting if necessary. Given these plots, what kinds of songs tend to be danceable?
Let’s assume that the difference between a band’s median track popularity and its maximum track popularity represents that band’s one-hit-wonderness. If this is the case, what are the Top 10 “one-hit-wonder” bands in the dataset? Given the results, does this comport with your understanding of what a “one hit wonder” is?
You may also find the original Tidy Data paper useful in describing the different commonly-encountered issues in data formatting.↩︎
Remember: I’m just using the knitr::kable()
function to print the table all pretty-like in the RMarkdown.↩︎
This is scraped using the spotifyr
package.↩︎
danceability
measures how suitable a track is for dancing, varies from 0 (not danceable) to 1 (danceable). energy
is a measure of the “intensity” of the song, varies from 0 (very relaxing, Shoegaze) to 1 (very energetic, Death Metal). loudness
is the average decibel value of the track, varies from -60 to zero. speechiness
gives the level of “talkiness” in a track, varies from 0 (no spoken words) to 1 (all spoken words), but tracks over .66 are probably nearly all spoken word, and tracks below .33 are probably songs. acousticness
is the same as speechiness
, but measuring whether instruments are not amplified; liveness
but for whether the track has a live audience. valence
tells you whether a track is “happy,” with higher scores indicate happier music. Finally, tempo
records the average beats per minute for the track and duration_ms
provides the duration of the song in milliseconds. ↩︎
Note that I’m using the name_repair
option of read_csv()
in order to get exactly the column names that are in midterm-risk.csv
. Without this option, readr::read_csv
will append extra values to the column names in order to ensure they are each unique. Try this by removing the name_repair
argument, or setting it equal to "unique"
instead of "minimal"
.↩︎
Note that you can use the str_detect()
function in the tidyverse’s stringr
package to give TRUE
when a movie name contains 2
or a II
, and FALSE
otherwise. Also, it’s ok if you accidentally pick up movies that have III
or IIII
or H20
in their name using this strategy; we’re just making an educated guess.↩︎
You may find it helpful to make a plot!↩︎
It’s OK if they’re totally separate plots! That is, I don’t expect you to use facet_grid()
↩︎