This assessment will test your knowledge & abilities to work with data in R. Complete what you can; incomplete answers will still be able to achieve partial credit, and non-working code or descriptions of what you think you should do will also receive partial credit. Also, if a question has a (c) next to it, it’s considered a “Challenge” question! So, you can still score OK if you leave it blank. Submit the HTML or PDF of the “knitted” document for maximum points.
Data for the assessment is available alongside this document on the course blackboard, or here.
The data sets shown in the tabs below are included with your midterm assessment. Each of the data sets may have one (or more) “tidy” issues, and thus may violate the three rules of tidy data we’ve discussed before in the course:
Using this information,1 describe the following data sets in terms of their “tidyness.” Are the data sets tidy? If so, why? If not, why not? Note that I’m asking for a short paragraph; no code is needed to answer these fully.
Sometimes, I also provide a rationale for why the data is displayed this way. This* does not mean *the data is tidy!
Note: click the tabs to cycle through each of the data sets.
This data set describes the finances of movies in terms of the movie budget (budget
), the domestic (US) gross box office revenue (domgross
) and the international box office revenue (intgross
) from 1970 to 2013.2
<- read_csv('./midterm-movies.csv')
movies %>% head(12) %>% kable() movies
year | movie_name | finance | dollars |
---|---|---|---|
1970 | Beyond the Valley of the Dolls | budget | 1000000 |
1970 | Beyond the Valley of the Dolls | domgross | 9000000 |
1970 | Beyond the Valley of the Dolls | intgross | 9000000 |
1971 | Escape from the Planet of the Apes | budget | 2500000 |
1971 | Escape from the Planet of the Apes | domgross | 12300000 |
1971 | Escape from the Planet of the Apes | intgross | 12300000 |
1971 | Shaft | budget | 53012938 |
1971 | Shaft | domgross | 70327868 |
1971 | Shaft | intgross | 107190108 |
1971 | Straw Dogs | budget | 25000000 |
1971 | Straw Dogs | domgross | 10324441 |
1971 | Straw Dogs | intgross | 11253821 |
This dataset is not tidy. The column “finance” contains the different variable names that need to get turned into columns. This means that the column “dollars” does all contain dollar amounts, but they come from the same sample: the movie. Your target should have a dataframe where each row is a movie, and each column expresses the year of the movie, the budget of the movie, the domestic gross of the movie, and the international gross of the movie.
This dataset counts the numbers of bird species recorded in urban or rural parts of bioregions across Australia. The survey was conducted from 2014 to 2015. Shown below are the first 10 columns of six random rows from the dataframe, as there are many more bird species in Australia. This data is formatted this way in order to make the selection of specific species easy.
<- read_csv('./midterm-birds.csv')
birds %>% drop_na() %>% arrange(bioregions) %>% sample_n(6) %>% select(1:10) %>% kable() birds
survey_year | urban_rural | bioregions | Bassian Thrush | Chestnut-breasted Mannikin | Wild Duck | Willie Wagtail | Regent Bowerbird | Rufous Fantail | Spiny-cheeked Honeyeater |
---|---|---|---|---|---|---|---|---|---|
2015 | Rural | Victorian Midlands | 0 | 0 | 0 | 6 | 0 | 1 | 0 |
2015 | Rural | Flinders Lofty Block | 0 | 0 | 0 | 2 | 0 | 0 | 0 |
2015 | Rural | South Eastern Queensland | 0 | 1 | 0 | 8 | 1 | 2 | 0 |
2015 | Rural | South East Coastal Plain | 1 | 0 | 0 | 2 | 0 | 0 | 0 |
2015 | Urban | Brigalow Belt South | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
2014 | Urban | South Eastern Highlands | 0 | 0 | 0 | 2 | 0 | 0 | 0 |
This dataset is not tidy. The “species” variable is implicit, and spread along columns… In order to tidy it, your final dataframe should have year, urban/rural classification, bioregion, species, and count.
This dataset describes 32,833 songs on Spotify in terms of statistics that Spotify record about the song.3 Below, I only a show a few columns from the dataset, but a few columns are explained.4 This data is formatted directly from the Spotify API, and so is structured for ease of use in a database.
<- read_csv('./midterm-songs.csv')
spotify %>% select(track_name, track_artist,
spotify
track_popularity, %>%
danceability, loudness, duration_ms) sample_n(10, weight=track_popularity) %>%
kable()
track_name | track_artist | track_popularity | danceability | loudness | duration_ms |
---|---|---|---|---|---|
My Best Life (feat. Mike Waters) - Club Mix | KSHMR | 66 | 0.603 | -3.693 | 198533 |
We Should Tell the Truth | metr | 44 | 0.808 | -12.306 | 267292 |
Nothing in Return | Monsune | 52 | 0.672 | -5.747 | 232200 |
Are You Gonna Be My Girl - UK Acoustic Version | Jet | 52 | 0.672 | -5.287 | 237173 |
What About Your Friends (Album Radio Edit) [W/Rap] | TLC | 27 | 0.752 | -4.454 | 244560 |
More Than a Feeling | Boston | 78 | 0.377 | -8.039 | 285133 |
Takeaway | The Chainsmokers | 85 | 0.528 | -8.144 | 209880 |
Stay The Night - Featuring Hayley Williams Of Paramore | Zedd | 61 | 0.596 | -3.109 | 217347 |
Killing My Love - Radio Edit | THIS IS ELLE | 18 | 0.206 | -5.168 | 194047 |
That’s Tuff (feat. Quavo) | Rich The Kid | 75 | 0.903 | -4.313 | 154000 |
This dataset is tidy🎂Each row denotes a song on spotify. The only way you may think it’s not tidy is if the song variable contains information on the remix and collaborator(s) on the song. This is true, but the distinction between where a variable stops and where it starts is not always so clear. If the track title includes “(feat. Jay-Z)”, then you can use feature engineering to build a new variable that tries to extract this information. However, each cell has not intentionally coded a value that includes “feat.” or “Remix” info, so it’s less reasonable to suggest that all of track_title
contains two (or more) variables.
These are the number of trains that have run between any two stations for services operated by the National Society of French Railways, the state-owned rail company in France, from 2015 to 2018. Below, I just show the first ten originating stations (where the trains leave from) and the first four destination stations (where the trains arrive). This kind of data is often called “spatial interaction data,” and is used to measure the “interaction” between different spatial units. It is often presented in this format for usability: readers can scan across a row to quickly compare the number of trains that destinations receive from a specific origin.
<- read_csv('./midterm-trains.csv')
trains %>%
trains select(1:5) %>%
slice(1:10) %>%
kable()
departure_station | POITIERS | QUIMPER | RENNES | ST PIERRE DES CORPS |
---|---|---|---|---|
PARIS MONTPARNASSE | 31632 | 16242 | 38850 | 24930 |
TOURS | 0 | 0 | 0 | 0 |
LYON PART DIEU | 0 | 0 | 5814 | 0 |
PARIS EST | 0 | 0 | 0 | 0 |
NANCY | 0 | 0 | 0 | 0 |
STRASBOURG | 0 | 0 | 0 | 0 |
NANTES | 0 | 0 | 0 | 0 |
DUNKERQUE | 0 | 0 | 0 | 0 |
MARSEILLE ST CHARLES | 0 | 0 | 0 | 0 |
BORDEAUX ST JEAN | 0 | 0 | 0 | 0 |
This dataset is not tidy! the “destination” variable is spread implicitly over the columns. That is, the data is wider than it should be to be tidy. The tidy dataset would have source, destination, and count columns.
The following table records the percentage energy supplied by different forms of energy across countries in the EU, as well as the “EU-28” and “EA-19” groups of European member nations. This kind of wide-but-short display format is often useful to fit tables like this alongside text in a document.
<- read_csv('./midterm-energy.csv')
energy %>% select(1:12) %>% kable() energy
energy_type | EU_28 | EA_19 | EE | CY | MT | PL | NL | EL | IE | IT | LV |
---|---|---|---|---|---|---|---|---|---|---|---|
Conventional thermal | 45.9 | 43.6 | 93.9 | 91.4 | 90.9 | 90.2 | 83.9 | 68.6 | 68.3 | 66.0 | 61.0 |
Nuclear | 25.5 | 27.0 | 0.0 | 0.0 | 0.0 | 0.0 | 2.9 | 0.0 | 0.0 | 0.0 | 0.0 |
Hydro | 11.8 | 11.9 | 0.2 | 0.0 | 0.0 | 1.5 | 0.1 | 11.4 | 3.2 | 17.6 | 37.2 |
Wind | 12.2 | 12.2 | 6.0 | 4.6 | 0.0 | 8.1 | 10.9 | 12.4 | 28.6 | 6.2 | 1.8 |
Solar | 4.0 | 4.7 | 0.0 | 4.0 | 0.0 | 0.2 | 2.2 | 7.5 | 0.0 | 8.2 | 0.0 |
Geothermal & others | 0.4 | 0.6 | 0.0 | 0.0 | 9.1 | 0.0 | 0.0 | 0.0 | 0.0 | 2.0 | 0.0 |
Total | 100.0 | 100.0 | 100.0 | 100.0 | 100.0 | 100.0 | 100.0 | 100.0 | 100.0 | 100.0 | 100.0 |
This dataset is not tidy! Its transpose, however, is! One “measurement” or “sample” should correspond to a row. So, generally speaking, we’d want one set of observations to all hang logically together, recording all the information we have about a unit of analysis. So, the clearest “tidy” version for this data is long, with one column of “energy type,” one column of “country”, and then the country column. How to deal with the EU/EA groupings of countries can be done either through redefining what the “country” variable means (it could be “energy_market” or “supply_zone”), or the values could be separated into their own tables, or could be used to indicate the aggregation into which each country falls. For the “total” energy type, though, that needs to be removed from this representation. I wouldn’t expect the latter, but either of the first two are reasonable.
One common interpretation of the tidy version of this data is just the transpose of the current dataframe. I can understand this, but it’s not as clear; the “energy type” variable is split across columns in this case.
This dataset describes MarioKart 64 “World Record” races. There are typically two “categories” in MarioKart racing: the “Single Lap” category—how fast a racer can complete a single lap—and “Three Lap,” which measures the time to complete a typical three-lap race. Along with the times (in seconds), the date for the most recent world record is recorded. This is often the format in which tables like this are viewed.
<- read_csv('./midterm-races.csv')
mk_records %>% kable() mk_records
track | single_lap_record | single_lap_record_date | three_lap_record | three_lap_record_date |
---|---|---|---|---|
Banshee Boardwalk | 40.78 | 2020-06-11 | 124.09 | 2021-01-15 |
Bowser’s Castle | 43.15 | 2021-02-02 | 132.00 | 2021-02-02 |
Choco Mountain | 38.02 | 2020-07-17 | 115.93 | 2020-07-14 |
D.K.’s Jungle Parkway | 42.04 | 2020-10-26 | 131.62 | 2019-12-31 |
Frappe Snowland | 38.27 | 2019-12-25 | 119.95 | 2017-12-23 |
Kalimari Desert | 38.96 | 2020-11-07 | 123.94 | 2018-04-20 |
Koopa Troopa Beach | 30.78 | 2020-10-18 | 95.25 | 2020-07-13 |
Luigi Raceway | 37.58 | 2021-01-10 | 117.77 | 2020-11-06 |
Mario Raceway | 27.62 | 2021-01-26 | 87.51 | 2020-08-13 |
Moo Moo Farm | 27.80 | 2018-12-29 | 85.93 | 2020-02-18 |
Rainbow Road | 116.35 | 2020-09-04 | 351.87 | 2020-09-28 |
Royal Raceway | 55.50 | 2020-06-11 | 171.25 | 2020-09-07 |
Sherbet Land | 37.72 | 2021-02-19 | 115.15 | 2021-01-26 |
Toad’s Turnpike | 58.69 | 2020-09-20 | 177.80 | 2020-09-28 |
Wario Stadium | 85.82 | 2021-01-26 | 260.01 | 2019-10-11 |
Yoshi Valley | 31.25 | 2018-01-18 | 102.13 | 2021-01-26 |
This data is not tidy! the variable “race type” is spread over the columns, but is mixed into each of the relevant features. This will be somewhat more challenging to tidy! The target dataframe should have one world record for each row, with columns track, event, time, and date as columns.
The following are Paini et al.’s estimates of the potential cost (in millions of USD) of invasive species to the ecosystem of each country. This is a direct digitization of table S2 in the Supplemental Material, so the formatting of the table is decided by concerns of printing and typesetting.5
<- read_csv('./midterm-risk.csv', name_repair='minimal')
risk %>% kable() risk
rank | country | damage | rank | country | damage | rank | country | damage |
---|---|---|---|---|---|---|---|---|
1 | China | $117,290 | 43 | Denmark | $1,417 | 85 | Burundi | $397.9 |
2 | USA | $70,381 | 44 | Nepal | $1,411 | 86 | Lithuania | $392.4 |
3 | Brazil | $33,760 | 45 | Sudan | $1,373 | 87 | Moldova | $387.5 |
4 | India | $33,065 | 46 | Portugal | $1,365 | 88 | Armenia | $336.0 |
5 | Japan | $23,490 | 47 | Belgium | $1,351 | 89 | Malaysia | $333.0 |
6 | Korea Republic of | $14,349 | 48 | Kazakhstan | $1,344 | 90 | Bosnia and Herzegovina | $327.4 |
7 | Turkey | $13,267 | 49 | Czech Republic | $1,336 | 91 | Kyrgyzstan | $302.0 |
8 | Argentina | $13,204 | 50 | Austria | $1,304 | 92 | Georgia (Republic) | $301.5 |
9 | France | $12,532 | 51 | Iraq | $1,234 | 93 | Tajikistan | $297.1 |
10 | Mexico | $11,277 | 52 | Kenya | $1,230 | 94 | Ireland | $277.6 |
11 | Iran | $11,276 | 53 | Mozambique | $1,218 | 95 | Lebanon | $276.8 |
12 | Nigeria | $10,251 | 54 | Cambodia | $1,121 | 96 | Nicaragua | $264.1 |
13 | Indonesia | $9,550 | 55 | Ghana | $1,114 | 97 | Rwanda | $255.1 |
14 | Thailand | $8,066 | 56 | Bulgaria | $1,112 | 98 | Mauritius | $227.6 |
15 | Australia | $7,815 | 57 | Madagascar | $1,074 | 99 | Macedonia | $218.4 |
16 | Vietnam | $7,490 | 58 | Malawi | $1,071 | 100 | Congo (Republic of) | $212.8 |
17 | Ukraine | $6,953 | 59 | Paraguay | $1,012 | 101 | Slovenia | $202.0 |
18 | Egypt | $6,737 | 60 | Guinea | $977.5 | 102 | Niger | $197.3 |
19 | Canada | $6,694 | 61 | Tunisia | $949.2 | 103 | Latvia | $187.3 |
20 | Pakistan | $6,630 | 62 | Ecuador | $934.7 | 104 | Panama | $161.2 |
21 | Germany | $6,481 | 63 | Switzerland | $924.1 | 105 | Togo | $153.2 |
22 | Bangladesh | $5,623 | 64 | Dominican Republic | $873.0 | 106 | Jordan | $116.4 |
23 | Spain | $5,576 | 65 | Jamaica | $871.7 | 107 | Guinea-Bissau | $114.3 |
24 | Russian Federation | $5,084 | 66 | Sri Lanka | $829.1 | 108 | Cyprus | $108.5 |
25 | Philippines | $4,839 | 67 | Yemen | $806.0 | 109 | Estonia | $102.1 |
26 | Greece | $4,342 | 68 | Saudi Arabia | NA | 110 | Fiji | NA,NA |
27 | United Kingdom | $4,005 | 69 | Honduras | $794.3 | 111 | Mongolia | $64.7 |
28 | South Africa | $3,922 | 70 | Croatia | $755.6 | 112 | Luxembourg | $64.7 |
29 | Romania | $3,524 | 71 | Azerbaijan | $730.7 | 113 | Belize | $42.2 |
30 | Algeria | $2,862 | 72 | New Zealand | $639.7 | 114 | Cape Verde | $40.8 |
31 | Morocco | $2,531 | 73 | Albania | $637.2 | 115 | Gambia | $37.6 |
32 | Colombia | $2,476 | 74 | Finland | $600.7 | 116 | Suriname | $36.5 |
33 | Poland | $2,449 | 75 | Slovakia | $573.4 | 117 | Trinidad and Tobago | $28.8 |
34 | Ethiopia | $2,312 | 76 | Burkina Faso | $557.7 | 118 | Vanuatu | $23.5 |
35 | Venezuela | $2,167 | 77 | Costa Rica | $556.8 | 119 | Barbados | $20.3 |
36 | Chile | $2,095 | 78 | Sweden | $546.9 | 120 | Equatorial Guinea | NA,NA |
37 | Netherlands | $1,981 | 79 | Israel | $518.4 | 121 | Malta | $14.4 |
38 | Hungary | $1,979 | 80 | Uruguay | $509.1 | 122 | Qatar | $5.0 |
39 | Belarus | $1,777 | 81 | Laos | $508.2 | 123 | Iceland | $4.8 |
40 | Peru | $1,580 | 82 | Mali | $504.7 | 124 | Singapore | $0.7 |
41 | Cameroon | $1,574 | 83 | El Salvador | $475.1 | NA | NA | NA |
42 | Italy | $1,447 | 84 | Norway | $419.4 | NA | NA | NA |
This data is not tidy! It should only contain three columns in the “cleaned” version. The most reasonable way to clean this datay may also not involve pivots…
For each of the data sets in Section 1, can you create a tidy version of the data set?
hint: this may need to take rows and turn them into columns!
This one is a very straightforward pivot_wider()
:
<- movies %>%
movies_tidy pivot_wider(id_cols=c(year, movie_name),
names_from=finance,
values_from=dollars)
%>% head(6) %>% kable() movies_tidy
year | movie_name | budget | domgross | intgross |
---|---|---|---|---|
1970 | Beyond the Valley of the Dolls | 1000000 | 9000000 | 9000000 |
1971 | Escape from the Planet of the Apes | 2500000 | 12300000 | 12300000 |
1971 | Shaft | 53012938 | 70327868 | 107190108 |
1971 | Straw Dogs | 25000000 | 10324441 | 11253821 |
1971 | The French Connection | 2200000 | 41158757 | 41158757 |
1971 | Willy Wonka & the Chocolate Factory | 3000000 | 4000000 | 4000000 |
hint: this may need to take columns and turn them into rows!
This one is a very straight-forward pivot_longer()
, where the bird columns are converted into rows:
<- birds %>% pivot_longer(`Bassian Thrush`:`Grey Fantail`)
birds_tidy %>% head(6) %>% kable() birds_tidy
survey_year | urban_rural | bioregions | name | value |
---|---|---|---|---|
2014 | Urban | South Eastern Queensland | Bassian Thrush | 0 |
2014 | Urban | South Eastern Queensland | Chestnut-breasted Mannikin | 2 |
2014 | Urban | South Eastern Queensland | Wild Duck | 1 |
2014 | Urban | South Eastern Queensland | Willie Wagtail | 12 |
2014 | Urban | South Eastern Queensland | Regent Bowerbird | 0 |
2014 | Urban | South Eastern Queensland | Rufous Fantail | 3 |
hint: pay attention to the three rules; sometimes, you get lucky!
This one is already tidy:
<- spotify
spotify %>% head(6) %>% kable() spotify
track_id | track_name | track_artist | track_popularity | track_album_id | track_album_name | track_album_release_date | playlist_name | playlist_id | playlist_genre | playlist_subgenre | danceability | energy | key | loudness | mode | speechiness | acousticness | instrumentalness | liveness | valence | tempo | duration_ms |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
6f807x0ima9a1j3VPbc7VN | I Don’t Care (with Justin Bieber) - Loud Luxury Remix | Ed Sheeran | 66 | 2oCs0DGTsRO98Gh5ZSl2Cx | I Don’t Care (with Justin Bieber) [Loud Luxury Remix] | 2019-06-14 | Pop Remix | 37i9dQZF1DXcZDD7cfEKhW | pop | dance pop | 0.748 | 0.916 | 6 | -2.634 | 1 | 0.0583 | 0.1020 | 0.00e+00 | 0.0653 | 0.518 | 122.036 | 194754 |
0r7CVbZTWZgbTCYdfa2P31 | Memories - Dillon Francis Remix | Maroon 5 | 67 | 63rPSO264uRjW1X5E6cWv6 | Memories (Dillon Francis Remix) | 2019-12-13 | Pop Remix | 37i9dQZF1DXcZDD7cfEKhW | pop | dance pop | 0.726 | 0.815 | 11 | -4.969 | 1 | 0.0373 | 0.0724 | 4.21e-03 | 0.3570 | 0.693 | 99.972 | 162600 |
1z1Hg7Vb0AhHDiEmnDE79l | All the Time - Don Diablo Remix | Zara Larsson | 70 | 1HoSmj2eLcsrR0vE9gThr4 | All the Time (Don Diablo Remix) | 2019-07-05 | Pop Remix | 37i9dQZF1DXcZDD7cfEKhW | pop | dance pop | 0.675 | 0.931 | 1 | -3.432 | 0 | 0.0742 | 0.0794 | 2.33e-05 | 0.1100 | 0.613 | 124.008 | 176616 |
75FpbthrwQmzHlBJLuGdC7 | Call You Mine - Keanu Silva Remix | The Chainsmokers | 60 | 1nqYsOef1yKKuGOVchbsk6 | Call You Mine - The Remixes | 2019-07-19 | Pop Remix | 37i9dQZF1DXcZDD7cfEKhW | pop | dance pop | 0.718 | 0.930 | 7 | -3.778 | 1 | 0.1020 | 0.0287 | 9.40e-06 | 0.2040 | 0.277 | 121.956 | 169093 |
1e8PAfcKUYoKkxPhrHqw4x | Someone You Loved - Future Humans Remix | Lewis Capaldi | 69 | 7m7vv9wlQ4i0LFuJiE2zsQ | Someone You Loved (Future Humans Remix) | 2019-03-05 | Pop Remix | 37i9dQZF1DXcZDD7cfEKhW | pop | dance pop | 0.650 | 0.833 | 1 | -4.672 | 1 | 0.0359 | 0.0803 | 0.00e+00 | 0.0833 | 0.725 | 123.976 | 189052 |
7fvUMiyapMsRRxr07cU8Ef | Beautiful People (feat. Khalid) - Jack Wins Remix | Ed Sheeran | 67 | 2yiy9cd2QktrNvWC2EUi0k | Beautiful People (feat. Khalid) [Jack Wins Remix] | 2019-07-11 | Pop Remix | 37i9dQZF1DXcZDD7cfEKhW | pop | dance pop | 0.675 | 0.919 | 8 | -5.385 | 1 | 0.1270 | 0.0799 | 0.00e+00 | 0.1430 | 0.585 | 124.982 | 163049 |
hint: think very carefully about what the variables are here!
This one again is a pivot_longer, but you need to make sure to duplicate only the departure station!
<- trains %>% pivot_longer(-departure_station,
trains_tidy names_to='destination_station',
values_to='n_trains')
%>% head(6) %>% kable() trains_tidy
departure_station | destination_station | n_trains |
---|---|---|
PARIS MONTPARNASSE | POITIERS | 31632 |
PARIS MONTPARNASSE | QUIMPER | 16242 |
PARIS MONTPARNASSE | RENNES | 38850 |
PARIS MONTPARNASSE | ST PIERRE DES CORPS | 24930 |
PARIS MONTPARNASSE | PARIS MONTPARNASSE | 0 |
PARIS MONTPARNASSE | NANCY | 0 |
hint: sometimes, we must pivot one direction before we pivot another!
This one is a pivot_longer()
, with a filter()
to make the energy_type()
column a true variable. Alternatively, you could pivot_wider()
after this, as discussed above, but this is not quite tidy: the energy_type
variable would be spread over columns.
<- energy %>%
energy_tidy pivot_longer(EU_28:GE, names_to="market", values_to="percent") %>%
filter(energy_type != "Total")
%>% head(6) %>% kable() energy_tidy
energy_type | market | percent |
---|---|---|
Conventional thermal | EU_28 | 45.9 |
Conventional thermal | EA_19 | 43.6 |
Conventional thermal | EE | 93.9 |
Conventional thermal | CY | 91.4 |
Conventional thermal | MT | 90.9 |
Conventional thermal | PL | 90.2 |
hint: sometimes, you may need to pivot twice in the same direction!
This one can be done by splitting and re-combining the data. It uses two pivots, either pivoting twice on the original data, or by splitting the dataset. This kind of “split and pivot” can be more complicated, but will be easier to understand. As a rule, split the data into bits that themselves are tidy, and then join them back together if that’s useful.
# clean only times
<- mk_records %>%
times select(track, ends_with("record")) %>%
pivot_longer(ends_with("record"),
names_to='race_type',
values_to='duration') %>%
separate(race_type, c('event', NA, NA, NA))
# clean only dates
<- mk_records %>%
dates select(track, ends_with('date')) %>%
pivot_longer(ends_with("date"),
names_to='race_type',
values_to='datetime') %>%
separate(race_type, c('event', NA, NA, NA, NA))
# merge everything back together, could use the merge or the *_join functions from the reading.
<- inner_join(times, dates, by=c("track", "event"))
mk_tidy
%>% head(6) %>% kable() mk_tidy
hint: this data frame is uniquely messy! Try splitting it into parts that all look tidy, and then bringing them back together.
This one basically must be split and recombined. In theory, one could use the “melt and cast” strategy from the Tidy Data paper to first make the data as long as possible, then pivot it based on the columns you want. The pivot_longer()
function is less powerul in this way, because it does not convert data types by default. So, the easiest direct approach is the split & recombine, as follows:
<- rbind(risk[1:3], #first set of columns
risk_tidy 4:6], #second set of columns
risk[7:9] #third set of columns
risk[
)%>% head(6) %>% kable() risk_tidy
For the melt & cast strategy from the very first reading, we leverage the fact that the “melt” function forces everything into the same type (if it can) so that it can all be stacked on top of one another. This usually means everything is converted into character data, and we have to manually convert it back at the end:
library(reshape2)
##
## Attaching package: 'reshape2'
## The following object is masked from 'package:tidyr':
##
## smiths
%>%
risk melt(id.vars = 'country') %>%
pivot_wider(id_cols=country,
names_from=variable,
values_from=value) %>%
mutate(rank=as.numeric(rank)) %>%
head(7) %>% kable()
country | rank | damage |
---|---|---|
China | 1 | $117,290 |
USA | 2 | $70,381 |
Brazil | 3 | $33,760 |
India | 4 | $33,065 |
Japan | 5 | $23,490 |
Korea Republic of | 6 | $14,349 |
Turkey | 7 | $13,267 |
In this section, I’ll ask some specific questions about two of the data sets: Movies & Songs.
Which ten movies lost the most money domestically? Are these the same movies that lost the most money overall?
There are a few ways you can do this. The very simple way is to compute the profit, sort on the profit, and then show the two separate dataframes:
%>%
movies_tidy ### do the domestic profits
# compute the domestic profit
mutate(domestic_profit = domgross - budget) %>%
# sort the data by domestic profit
arrange(domestic_profit) %>%
# grab the lowest 10 profits, which will be the
# first ten rows:
head(10) %>%
kable()
year | movie_name | budget | domgross | intgross | domestic_profit |
---|---|---|---|---|---|
2012 | John Carter | 2.75e+08 | 73058679 | 282778100 | -201941321 |
2013 | 47 Ronin | 2.25e+08 | 38362475 | 145803842 | -186637525 |
2013 | The Lone Ranger | 2.75e+08 | 89289910 | 259989910 | -185710090 |
2012 | Battleship | 2.09e+08 | 65235400 | 304142720 | -143764600 |
2007 | The Golden Compass | 2.05e+08 | 70107728 | 372234864 | -134892272 |
2013 | Jack the Giant Slayer | 1.95e+08 | 65187603 | 197387603 | -129812397 |
2011 | Mars Needs Moms | 1.50e+08 | 21392758 | 39549758 | -128607242 |
2010 | Prince of Persia: The Sands of Time | 2.00e+08 | 90759676 | 335059676 | -109240324 |
2011 | Hugo | 1.80e+08 | 73864507 | 185033215 | -106135493 |
2005 | Stealth | 1.38e+08 | 32116746 | 76416746 | -105883254 |
You can do this again to get a second dataframe for the movies:
%>%
movies_tidy ### do the total profits
# compute the total profit
mutate(profit = intgross - budget) %>%
# sort the data by domestic profit
arrange(profit) %>%
# grab the lowest 10 profits, which will be the
# first ten rows:
head(10) %>%
kable()
year | movie_name | budget | domgross | intgross | profit |
---|---|---|---|---|---|
2011 | Mars Needs Moms | 1.50e+08 | 21392758 | 39549758 | -110450242 |
2002 | The Adventures of Pluto Nash | 1.00e+08 | 4411102 | 7094995 | -92905005 |
2013 | 47 Ronin | 2.25e+08 | 38362475 | 145803842 | -79196158 |
2001 | Monkeybone | 7.50e+07 | 5409517 | 5409517 | -69590483 |
1999 | The 13th Warrior | 1.25e+08 | 32698899 | 61698899 | -63301101 |
1997 | The Postman | 8.00e+07 | 17650704 | 17650704 | -62349296 |
2005 | Stealth | 1.38e+08 | 32116746 | 76416746 | -61583254 |
2001 | Osmosis Jones | 7.00e+07 | 13596911 | 13596911 | -56403089 |
2003 | Timeline | 8.00e+07 | 19480739 | 26703184 | -53296816 |
2013 | R.I.P.D. | 1.30e+08 | 33618855 | 79019947 | -50980053 |
This is nice, but makes it hard to see the exact relationship between the top 10 films in either category; we just see the 10 lossmaking movies for one kind of profit, and then 10 in the other. We cannot say, for instance, where the 10 domestic losers fall in the overall profit table. Interestingly, the only time we’re “slicing” the data happens at the head()
function. So, we can string these two things together, and just do the filtering at the end:
%>%
movies_tidy ### do the domestic profits
# compute the domestic profit
mutate(domestic_profit = domgross - budget) %>%
# sort the data by domestic profit
arrange(domestic_profit) %>%
# assign the "row number" to each of these movies, so that
# the movie with the smallest domestic profit gets "domestic_rank" of 1
# and the next-smallest has a "domestic_rank" of 2
mutate(domestic_rank = row_number()) %>%
### do the foreign profits
# compute the overall profit
mutate(profit = intgross - budget) %>%
# sort the data by this profit
arrange(profit) %>%
# again, assign the row number of the sorted data to a rank variable
mutate(rank = row_number()) %>%
# select out the columns we want to see
select(year, movie_name, domestic_rank, domestic_profit, rank, profit) %>%
### keep only the top 10 movies either domestic or foreign:
filter(domestic_rank <= 10 | rank <= 10) %>%
# sort by domestic rank, just because:
arrange(domestic_rank) %>%
kable()
year | movie_name | domestic_rank | domestic_profit | rank | profit |
---|---|---|---|---|---|
2012 | John Carter | 1 | -201941321 | 510 | 7778100 |
2013 | 47 Ronin | 2 | -186637525 | 3 | -79196158 |
2013 | The Lone Ranger | 3 | -185710090 | 97 | -15010090 |
2012 | Battleship | 4 | -143764600 | 1210 | 95142720 |
2007 | The Golden Compass | 5 | -134892272 | 1432 | 167234864 |
2013 | Jack the Giant Slayer | 6 | -129812397 | 394 | 2387603 |
2011 | Mars Needs Moms | 7 | -128607242 | 1 | -110450242 |
2010 | Prince of Persia: The Sands of Time | 8 | -109240324 | 1348 | 135059676 |
2011 | Hugo | 9 | -106135493 | 450 | 5033215 |
2005 | Stealth | 10 | -105883254 | 7 | -61583254 |
2013 | R.I.P.D. | 15 | -96381145 | 10 | -50980053 |
2002 | The Adventures of Pluto Nash | 16 | -95588898 | 2 | -92905005 |
1999 | The 13th Warrior | 17 | -92301101 | 5 | -63301101 |
2001 | Monkeybone | 33 | -69590483 | 4 | -69590483 |
1997 | The Postman | 46 | -62349296 | 6 | -62349296 |
2003 | Timeline | 53 | -60519261 | 9 | -53296816 |
2001 | Osmosis Jones | 59 | -56403089 | 8 | -56403089 |
Now, we see that the movies that lost the most money domestically can have wildly different profits internationally. For example, John Carter lost the most money domestically, but actually made a profit internationally! Same with Prince of Persia and Hugo. But, some movies did lose a lot of money both internationally and domestically, such as 47 Ronin or Mars Meeds Moms.
What is the average budget for a movie in each year?
%>%
movies_tidy group_by(year) %>%
summarize(avg_budget = mean(budget, na.rm=T)) %>%
tail(10) %>%
kable()
year | avg_budget |
---|---|
2004 | 49043124 |
2005 | 45754250 |
2006 | 41924722 |
2007 | 47113533 |
2008 | 49651485 |
2009 | 50151613 |
2010 | 51915116 |
2011 | 50357137 |
2012 | 64458120 |
2013 | 72097980 |
Which movie had the largest gap between domestic and overseas box office performance?
Depending on how you solve this problem, you may need to drop the NA values. You can use the drop_na()
function.
%>%
movies_tidy mutate(gap = abs(domgross - intgross)) %>%
drop_na(gap) %>%
arrange(gap) %>%
select(year, movie_name, gap) %>%
tail(10) %>% kable()
year | movie_name | gap |
---|---|---|
2009 | Ice Age: Dawn of the Dinosaurs | 690394990 |
2012 | The Hobbit: An Unexpected Journey | 711700000 |
2003 | The Lord of the Rings: The Return of the King | 763562762 |
2011 | Transformers: Dark of the Moon | 771403533 |
2011 | Pirates of the Caribbean: On Stranger Tides | 802600000 |
2013 | Iron Man 3 | 803700000 |
2012 | Skyfall | 804333804 |
2011 | Harry Potter and the Deathly Hallows: Part 2 | 947100000 |
1997 | Titanic | 1527000000 |
2009 | Avatar | 2023411357 |
Make a visualization that shows how the budget for movies has increased over time. Discuss how you designed the plot in order to emphasize this message.
You may think that something like the following plot illustrates this well:
ggplot(movies_tidy, aes(x=year, y=budget/1000000)) +
geom_jitter(alpha=.25) +
geom_smooth(se=F, color='orangered') +
xlab("Year Movie was Made") +
ylab("Budget (Millions USD)")
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
But, visually, half of this plot is empty! There are a few ways you can address this. First, you could just plot the average and focus in on that range:
ggplot(movies_tidy, aes(x=year, y=budget/1000000)) +
geom_jitter(alpha=.1) +
geom_smooth(se=F, color='orangered') +
xlab("Year Movie was Made") +
ylab("Budget (Millions USD)") +
ylim(0,100)
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
## Warning: Removed 191 rows containing non-finite values (stat_smooth).
## Warning: Removed 204 rows containing missing values (geom_point).
Or, you could change the scale:
ggplot(movies_tidy, aes(x=year, y=budget/1000000)) +
geom_jitter(alpha=.1) +
geom_smooth(se=F, color='orangered') +
xlab("Year Movie was Made") +
ylab("Budget (Millions USD)") +
scale_y_log10(limits=c(.1, 200))
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
## Warning: Removed 35 rows containing non-finite values (stat_smooth).
## Warning: Removed 44 rows containing missing values (geom_point).
Make a visualization that shows how the typical profit movies make has generally not changed over time, but that a few outliers do make increasingly more money. Discuss how you designed the plot in order to emphasize this message.
A very simple plot to do this might consider the following:
%>%
movies_tidy mutate(profit=intgross - budget) %>%
ggplot(aes(x=year, y=profit)) +
geom_jitter(alpha=.1) + geom_smooth()
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
## Warning: Removed 11 rows containing non-finite values (stat_smooth).
## Warning: Removed 11 rows containing missing values (geom_point).
A more sophisticated perspective might aim to strategically define the outliers, and then label them separately:
<- movies_tidy %>%
outlier_earners # get the profit
mutate(profit = intgross - budget) %>%
# drop any movies that have NA profits
drop_na(profit) %>%
### within each year
group_by(year) %>%
# arrange the movies by profit
arrange(desc(profit)) %>%
# then compute the difference between the movie and the next-highest earner
mutate(profit_gap_to_next_movie = profit - lead(profit)) %>%
# sort on this value
arrange(profit_gap_to_next_movie) %>%
# and drop any NAs from this
drop_na(profit_gap_to_next_movie) %>%
# take the 20 movies that have the largest gap between their
# earnings and the next highest earnings:
tail(20)
%>% kable() outlier_earners
year | movie_name | budget | domgross | intgross | profit | profit_gap_to_next_movie |
---|---|---|---|---|---|---|
2003 | The Lord of the Rings: The Return of the King | 9.40e+07 | 377845905 | 1141408667 | 1047408667 | 234943111 |
1973 | The Exorcist | 1.20e+07 | 204868002 | 402735134 | 390735134 | 236618807 |
1994 | The Lion King | 7.93e+07 | 422780140 | 952880140 | 873580140 | 249179615 |
1999 | Star Wars: Episode I - The Phantom Menace | 1.15e+08 | 474544677 | 1007044677 | 892044677 | 259238385 |
1972 | The Godfather | 7.00e+06 | 134966411 | 268500000 | 261500000 | 261098198 |
2011 | Harry Potter and the Deathly Hallows: Part 2 | 1.25e+08 | 381011219 | 1328111219 | 1203111219 | 274317143 |
1977 | Close Encounters of the Third Kind | 2.00e+07 | 166000000 | 337700000 | 317700000 | 283448575 |
1985 | Back to the Future | 1.90e+07 | 210609762 | 383874862 | 364874862 | 286285161 |
1994 | Forrest Gump | 5.50e+07 | 329694499 | 679400525 | 624400525 | 290780396 |
1981 | Raiders of the Lost Ark | 2.00e+07 | 248159971 | 389925971 | 369925971 | 296955634 |
1975 | Jaws | 1.20e+07 | 260000000 | 470700000 | 458700000 | 320023583 |
2001 | The Lord of the Rings: The Fellowship of the Ring | 1.09e+08 | 315544750 | 887217688 | 778217688 | 333459969 |
1996 | Independence Day | 7.50e+07 | 306169255 | 817400878 | 742400878 | 334500878 |
1983 | Star Wars: Episode VI - Return of the Jedi | 3.25e+07 | 309205079 | 572700000 | 540200000 | 345736426 |
1980 | Star Wars: Episode V - The Empire Strikes Back | 2.30e+07 | 290271960 | 534171960 | 511171960 | 431218421 |
1977 | Star Wars | 1.10e+07 | 460998007 | 797900000 | 786900000 | 469200000 |
1993 | Jurassic Park | 6.30e+07 | 395708305 | 1035626872 | 972626872 | 556340869 |
1982 | E.T.: The Extra-Terrestrial | 1.05e+07 | 435110554 | 792965326 | 782465326 | 620265326 |
1997 | Titanic | 2.00e+08 | 658672302 | 2185672302 | 1985672302 | 1273985623 |
2009 | Avatar | 4.25e+08 | 760507625 | 2783918982 | 2358918982 | 1561950287 |
You can see that, clearly, these are the big blockbuster movies that earn a ton of money. So, we can use these as separate datasets in tandem to highlight their location on the plot:
= movies_tidy %>%
non_outliers # This ensures that movies that are in the outliers are
# dropped from the "non-outliers"
# breaking it down,
# ! means "not",
# movie_name %in% outlier_earners$movie_name is True when
# the movie_name is contained within the set of outlier_earners
# movie names.
filter(!(movie_name %in% outlier_earners$movie_name)) %>%
# Then, we'll need the profit variable to do the plotting
mutate(profit=intgross - budget)
ggplot() +
# non-outlier points
geom_jitter(data=non_outliers,
aes(x=year, y=profit), color='black', alpha=.1) +
# non-outlier profit trend
geom_smooth(data=non_outliers,
aes(x=year, y=profit), se=F) +
# outliers
geom_point(data=outlier_earners,
aes(x=year, y=profit), color='red') +
# trend among outliers
geom_smooth(data=outlier_earners,
aes(x=year, y=profit), color='red', se=F)
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
## Warning: Removed 11 rows containing non-finite values (stat_smooth).
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
## Warning: Removed 11 rows containing missing values (geom_point).
You’re a data scientist working for a movie studio. Your executive is considering whether to take a risk on making a “new” movie, or whether it’d be a safer bet to make a sequel to an existing movie. So, she asks:
Do sequels make more profit per dollar spent than non-sequels?
Can you answer her question?6
So, let’s first look at the movies that have a “2” or “II” in their title:
= str_detect(movies_tidy$movie_name, c("2", "II"))
is_sequel %>%
movies_tidy mutate(is_sequel = is_sequel) %>%
filter(is_sequel) %>% kable()
year | movie_name | budget | domgross | intgross | is_sequel |
---|---|---|---|---|---|
1974 | The Godfather: Part II | 13000000 | 57300000 | 57300000 | TRUE |
1981 | Friday the 13th Part 2 | 1250000 | 21722776 | 21722776 | TRUE |
1981 | Halloween II | 2500000 | 25533818 | 25533818 | TRUE |
1983 | Superman III | 39000000 | 59950623 | 59950623 | TRUE |
1984 | Star Trek III: The Search for Spock | 18000000 | 76471046 | 87000000 | TRUE |
1987 | Evil Dead II | 3500000 | 5923044 | 5923044 | TRUE |
1989 | Back to the Future Part II | 40000000 | 118450002 | 332000000 | TRUE |
1990 | Back to the Future Part III | 40000000 | 87666629 | 243700000 | TRUE |
1990 | Child's Play 2 | 13000000 | 26904572 | 34166572 | TRUE |
1990 | Die Hard 2 | 70000000 | 117323878 | 239814025 | TRUE |
1991 | Teenage Mutant Ninja Turtles II: The Secret of the Ooze | 25000000 | 78656813 | 78656813 | TRUE |
1991 | Terminator 2: Judgement Day | 100000000 | 204859496 | 516816151 | TRUE |
1991 | The Naked Gun 2 1/2: The Smell of Fear | 23000000 | 86930411 | 86930411 | TRUE |
1992 | Home Alone 2: Lost in New York | 20000000 | 173585516 | 358994850 | TRUE |
1998 | Halloween H20: 20 Years Later | 17000000 | 55041738 | 55041738 | TRUE |
1999 | Toy Story 2 | 90000000 | 245852179 | 511358276 | TRUE |
2000 | 28 Days | 43000000 | 37035515 | 62063972 | TRUE |
2000 | Pokemon: The Movie 2000 | 30000000 | 43746923 | 133946923 | TRUE |
2002 | 28 Days Later… | 8000000 | 45064915 | 82955633 | TRUE |
2002 | Blade II | 54000000 | 81676888 | 154338601 | TRUE |
2002 | Star Wars: Episode II - Attack of the Clones | 115000000 | 310676740 | 656695615 | TRUE |
2002 | The Santa Clause 2 | 65000000 | 139225854 | 172825854 | TRUE |
2003 | 21 grams | 20000000 | 16248701 | 59667625 | TRUE |
2003 | Cradle 2 the Grave | 25000000 | 34657731 | 56434942 | TRUE |
2003 | Jeepers Creepers II | 25000000 | 35623801 | 35623801 | TRUE |
2003 | Legally Blonde 2: Red, White & Blonde | 25000000 | 90639088 | 125339088 | TRUE |
2003 | X2 (X-Men 2) | 125000000 | 214949694 | 407711549 | TRUE |
2004 | 2046 | 12000000 | 1442338 | 19202856 | TRUE |
2004 | The Princess Diaries 2: Royal Engagement | 45000000 | 95149435 | 122071435 | TRUE |
2005 | Miss Congeniality 2: Armed & Fabulous | 60000000 | 48478006 | 101382396 | TRUE |
2005 | Star Wars: Episode III - Revenge of the Sith | 115000000 | 380270577 | 848998877 | TRUE |
2006 | Clerks II | 5000000 | 24148068 | 25894473 | TRUE |
2006 | The Grudge 2 | 20000000 | 39143839 | 70743839 | TRUE |
2007 | 28 Weeks Later | 15000000 | 28638916 | 64232714 | TRUE |
2007 | Hostel: Part II | 7500000 | 17544812 | 33606409 | TRUE |
2007 | White Noise 2: The Light | 10000000 | NA | 8243567 | TRUE |
2008 | The Sisterhood of the Traveling Pants 2 | 27000000 | 44089964 | 44270131 | TRUE |
2009 | 2012 | 200000000 | 166112167 | 788408539 | TRUE |
2009 | The Boondock Saints II: All Saints Day | 8000000 | 10273187 | 10273187 | TRUE |
2009 | The Taking of Pelham 123 | 110000000 | 65452312 | 152364370 | TRUE |
2010 | 127 Hours | 18000000 | 18335230 | 60735230 | TRUE |
2010 | Iron Man 2 | 170000000 | 312433331 | 623561331 | TRUE |
2010 | Paranormal Activity 2 | 3000000 | 84752907 | 177512032 | TRUE |
2010 | Sex and the city 2 | 95000000 | 95347692 | 294680778 | TRUE |
2011 | Cars 2 | 200000000 | 191450875 | 560155383 | TRUE |
2012 | Men in Black III | 215000000 | 179020854 | 624821154 | TRUE |
2012 | The Twilight Saga: Breaking Dawn - Part 2 | 136200000 | 292324737 | 832660037 | TRUE |
2013 | 2 Guns | 61000000 | 75612460 | 132493015 | TRUE |
2013 | 42 | 40000000 | 95020213 | 95020213 | TRUE |
2013 | Cloudy with a Chance of Meatballs 2 | 78000000 | 119640264 | 271725448 | TRUE |
2013 | Kick-Ass 2 | 28000000 | 28795985 | 60839197 | TRUE |
2013 | Red 2 | 84000000 | 53262560 | 137162560 | TRUE |
2013 | The Smurfs 2 | 110000000 | 71017784 | 348545841 | TRUE |
Looks pretty good! So, let’s look at the “return”, or profit per dollar, of these two sets of movies:
%>%
movies_tidy mutate(is_sequel=is_sequel) %>%
mutate(profit = intgross - budget) %>%
mutate(return_on_investment = profit/budget) %>%
group_by(is_sequel) %>%
summarise(median_return = median(return_on_investment, na.rm=T),
%>%
) kable()
is_sequel | median_return |
---|---|
FALSE | 1.622181 |
TRUE | 2.261692 |
The median sequel has about 37% higher returns than the median non-sequel. But! there is a very long tail:
%>%
movies_tidy mutate(is_sequel=is_sequel) %>%
mutate(profit = intgross - budget) %>%
mutate(return_on_investment = profit/budget) %>%
ggplot(aes(x=return_on_investment, group=is_sequel, color=is_sequel)) +
geom_boxplot()
## Warning: Removed 11 rows containing non-finite values (stat_boxplot).
You can see that the non-sequels have some amazing returns on investment (400 here means that the profit is 400 times the budget!) but you can see that this long tail is very sparse, compared to the middle of the data. Let’s focus in on the medians, which are somewhere between 1 and 2:
%>%
movies_tidy mutate(is_sequel=is_sequel) %>%
mutate(profit = intgross - budget) %>%
mutate(return_on_investment = profit/budget) %>%
ggplot(aes(x=return_on_investment, group=is_sequel, color=is_sequel)) +
geom_boxplot() +
xlim(-1,4)
## Warning: Removed 464 rows containing non-finite values (stat_boxplot).
So, we see that sequels tend to have slightly higher returns to investment on average, but that the long tail of non-sequels can yield some seriously high returns, if you get lucky.
What’s your best guess about the length of a track with higher than 75% popularity? How about your best guess for the popularity of a track between 2 and 3 minutes long? Which “piece” of information (popularity > 75% or duration between 2 & 3 minutes) gives you more useful information, and why do you think that?7
ggplot(spotify, aes(x=duration_ms/1000/60, y=track_popularity)) +
geom_hex() +
xlab("Duration (minutes)") +
ylab("Track Popularity") +
scale_fill_viridis_c(option='plasma')
Just describing what we see, you can see that most of the popular songs are somewhere between 2.5 minutes and 3 minutes. You can also see that songs that are about this length(between 2.5 and 3 minutes) can have basically any popularity… seems like there are a lot of 3-minute tracks that have zero popularity! Thus, knowing a song is popular gives you way more information than knowing a song is 3 minutes long.
Doing this in a table, rather than graphically:
%>%
spotify mutate(duration_mins = duration_ms/1000/60) %>%
filter(track_popularity > 75) %>%
summarize(mean_duration = mean(duration_mins),
sd_duration=sd(duration_mins)) %>% kable()
mean_duration | sd_duration |
---|---|
3.554781 | 0.7395706 |
%>%
spotify mutate(duration_mins = duration_ms/1000/60) %>%
filter((2 <= duration_mins)&(duration_mins <= 3)) %>%
summarize(mean_popularity = mean(track_popularity),
sd_popularity=sd(track_popularity)) %>% kable()
mean_popularity | sd_popularity |
---|---|
45.48043 | 23.05022 |
You can see that the standard deviation of the track popularity is about half of the average popularity of songs between 2 and 3 minutes, while the standard deviation of the track duration is only about a fifth of the mean. This suggests that there is much more variability around the mean popularity for songs between 2 and 3 minutes than there is in the duration of songs at 75% popularity or above!
What is the typical “energy” of each of the playlist genres? How about the typical “valence,” meaning “happiness,” of the genres?
%>%
spotify group_by(playlist_genre) %>%
summarize(valence = median(valence), energy = median(energy)) %>%
kable()
playlist_genre | valence | energy |
---|---|---|
edm | 0.370 | 0.830 |
latin | 0.628 | 0.729 |
pop | 0.500 | 0.727 |
r&b | 0.542 | 0.596 |
rap | 0.517 | 0.665 |
rock | 0.531 | 0.775 |
Make four plots8 to visualize the relationship between danceability and a few variables:
tempo
energy
valence
playlist_genre
Make sure to take into account whether the relationship is linear and address overplotting if necessary. Given these plots, what kinds of songs tend to be danceable?
library(ggpubr)
<- ggplot(spotify,
tempo aes(x=tempo, y=danceability)) +
geom_point(alpha=.1) +
geom_smooth(aes(color=playlist_genre), se=F)
<- ggplot(spotify, aes(x=energy, y=danceability)) +
energy geom_point(alpha=.1) +
geom_smooth(aes(color=playlist_genre), se=F)
<- ggplot(spotify, aes(x=valence, y=danceability)) +
valence geom_point(alpha=.1) +
geom_smooth(aes(color=playlist_genre), se=F)
<- ggplot(spotify,
playlist_genre aes(group=playlist_genre,
y=danceability,
color=playlist_genre)) +
geom_boxplot()
ggarrange(tempo, energy, valence, playlist_genre, ncol=2, nrow=2)
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
Interpreting this, (1a) most songs are danceable if they are around 100/120 tempo; as songs become too fast or too slow, they become less danceable; (1b) but, latin songs have a “second peak” (slightly below 200) where songs get a little bit more danceable again; (2) songs of moderate energy tend to be danceable, but latin songs that are high energy are more danceable, whereas low-energy rap songs tend to be more danceable; (3) as songs get happier, they always get more danceable, regardless of genre; (4) Rock songs are clearly less danceable in general, and the most danceable genre is rap, but this is closely followed by Latin.
Let’s assume that the difference between a band’s median track popularity and its maximum track popularity represents that band’s one-hit-wonderness. If this is the case, what are the Top 10 “one-hit-wonder” bands in the dataset? Given the results, does this comport with your understanding of what a “one hit wonder” is?
= spotify %>%
one_hit_wonders group_by(track_artist) %>%
summarize(one_hit_wonderness = max(track_popularity) - median(track_popularity)) %>%
arrange(desc(one_hit_wonderness))
%>% head(10) %>% kable() one_hit_wonders
track_artist | one_hit_wonderness |
---|---|
Lynyrd Skynyrd | 81 |
Duki | 76 |
50 Cent | 75 |
Childish Gambino | 75 |
KISS | 74 |
YG | 74 |
Sam Feldt | 73 |
John Newman | 72 |
Sheck Wes | 72 |
Thin Lizzy | 71 |
Sure; Lynyrd Skynyrd makes sense given the perennial popularity of “Sweet Home Alabama”, but you may not know who, say Duki or YG are. And, alternatively, KISS and Thin Lizzy are certainly not one hit wonders; they have a ton of very well-rated songs. So, one thing that may be missing in terms of the measure of one-hit-wonderness may be a measure of how many albums the artist has, or how many popular songs the artist has.
You may also find the original Tidy Data paper useful in describing the different commonly-encountered issues in data formatting.↩︎
Remember: I’m just using the knitr::kable()
function to print the table all pretty-like in the RMarkdown.↩︎
This is scraped using the spotifyr
package.↩︎
danceability
measures how suitable a track is for dancing, varies from 0 (not danceable) to 1 (danceable). energy
is a measure of the “intensity” of the song, varies from 0 (very relaxing, Shoegaze) to 1 (very energetic, Death Metal). loudness
is the average decibel value of the track, varies from -60 to zero. speechiness
gives the level of “talkiness” in a track, varies from 0 (no spoken words) to 1 (all spoken words), but tracks over .66 are probably nearly all spoken word, and tracks below .33 are probably songs. acousticness
is the same as speechiness
, but measuring whether instruments are not amplified; liveness
but for whether the track has a live audience. valence
tells you whether a track is “happy,” with higher scores indicate happier music. Finally, tempo
records the average beats per minute for the track and duration_ms
provides the duration of the song in milliseconds. ↩︎
Note that I’m using the name_repair
option of read_csv()
in order to get exactly the column names that are in midterm-risk.csv
. Without this option, readr::read_csv
will append extra values to the column names in order to ensure they are each unique. Try this by removing the name_repair
argument, or setting it equal to "unique"
instead of "minimal"
.↩︎
Note that you can use the str_detect()
function in the tidyverse’s stringr
package to give TRUE
when a movie name contains 2
or a II
, and FALSE
otherwise. Also, it’s ok if you accidentally pick up movies that have III
or IIII
or H20
in their name using this strategy; we’re just making an educated guess.↩︎
You may find it helpful to make a plot!↩︎
It’s OK if they’re totally separate plots! That is, I don’t expect you to use facet_grid()
↩︎