forked from PsyTeachR/ads-v2
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathapp-twitter.qmd
335 lines (249 loc) · 12.2 KB
/
app-twitter.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
# Twitter Data {#sec-twitter-data}
This appendix takes a problem-based approach to demonstrate how to use tidyverse functions to summarise and visualise twitter data.
```{r setup-app-h, message=FALSE}
library(tidyverse) # data wrangling functions
library(lubridate) # for handling dates and times
```
## Single Data File
### Export Data
You can export your organisations' twitter data from <https://analytics.twitter.com/>. Go to the Tweets tab, choose a time period, and export the last month's data by day (or use the files from the [class data](data/data.zip)).
### Import Data
```{r, message=FALSE}
file <- "data/tweets/daily_tweet_activity_metrics_LisaDeBruine_20210801_20210901_en.csv"
daily_tweets <- read_csv(file, show_col_types = FALSE)
```
### Select Relevant Data
The file contains a bunch of columns about "promoted" tweets that will be blank unless your organisation pays for those. Let's get rid of them. We can use the select helper `starts_with()` to get all the columns that start with `"promoted")` and remove them by prefacing the function with `!`. Now there should be 20 columns, which we can inspect with `glimpse()`.
```{r, message=FALSE}
daily_tweets <- read_csv(file) %>%
select(!starts_with("promoted")) %>%
glimpse()
```
### Plot Likes per Day
Now let's plot likes per day. The `scale_x_date()` function lets you formats an x-axis with dates.
```{r likes-per-day-plot, fig.cap="Likes per day."}
ggplot(daily_tweets, aes(x = Date, y = likes)) +
geom_line() +
scale_x_date(name = "",
date_breaks = "1 day",
date_labels = "%d",
expand = expansion(add = c(.5, .5))) +
ggtitle("Likes: August 2021")
```
### Plot Multiple Engagements
What if we want to plot likes, retweets, and replies on the same plot? We need to get all of the numbers in the same column and a column that contains the "engagement type" that we can use to determine different line colours. When you have data in different columns that you need to get into the same column, it's wide and you need to pivot the data longer.
```{r}
long_tweets <- daily_tweets %>%
select(Date, likes, retweets, replies) %>%
pivot_longer(cols = c(likes, retweets, replies),
names_to = "engage_type",
values_to = "n")
head(long_tweets)
```
Now we can plot the number of engagements per day by engagement type by making the line colour determined by the value of the `engage_type` column.
```{r eng-per-day-plot, fig.cap="Engagements per day by engagement type."}
ggplot(long_tweets, aes(x = Date, y = n, colour = engage_type)) +
geom_line() +
scale_x_date(name = "",
date_breaks = "1 day",
date_labels = "%d",
expand = expansion(add = c(.5, .5))) +
scale_y_continuous(name = "Engagements per Day") +
scale_colour_discrete(name = "") +
ggtitle("August 2021") +
theme(legend.position = c(.9, .8),
panel.grid.minor = element_blank())
```
## Multiple Data Files
Maybe now you want to compare the data from several months. First, get a list of all the files you want to combine. It's easiest if they're all in the same directory, although you can use a pattern to select the files you want if they have a systematic naming structure.
```{r}
files <- list.files(
path = "data/tweets",
pattern = "daily_tweet_activity_metrics",
full.names = TRUE
)
```
Then use `map_df()` to iterate over the list of file paths, open them with `read_csv()`, and return a big data frame with all the combined data. Then we can pipe that to the `select()` function to get rid of the "promoted" columns.
```{r, message=FALSE}
all_daily_tweets <- purrr::map_df(files, read_csv) %>%
select(!starts_with("promoted"))
```
Now you can make a plot of likes per day for all of the months.
```{r}
ggplot(all_daily_tweets, aes(x = Date, y = likes)) +
geom_line(colour = "dodgerblue") +
scale_y_continuous(name = "Likes per Day") +
scale_x_date(name = "",
date_breaks = "1 month",
date_labels = "%B",
expand = expansion(add = c(.5, .5))) +
ggtitle("Likes 2021")
```
::: {.callout-note}
Notice that we changed the date breaks and labels for the x-axis. `%B` is the date code for the full month name. See `?strptime` for all of the date codes.
:::
### Likes by Month
If you want to plot likes by month, first you need a column for the month. Use `mutate()` to make a new column, using `lubridate::month()` to extract the month name from the `Date` column.
Then group by the new `month` column and calculate the sum of `likes`. The `group_by()` function causes all of the subsequent functions to operate inside of groups, until you call `ungroup()`. In the example below, the `sum(likes)` function calculates the sum total of the `likes` column separately for each month.
```{r}
likes_by_month <- all_daily_tweets %>%
mutate(month = month(Date, label = TRUE)) %>%
group_by(month) %>%
summarise(total_likes = sum(likes)) %>%
ungroup()
likes_by_month
```
A column plot might make more sense than a line plot for this summary.
```{r likes-by-month-plot, fig.cap = "Likes by month."}
ggplot(likes_by_month, aes(x = month, y = total_likes, fill = month)) +
geom_col(color = "black", show.legend = FALSE) +
scale_x_discrete(name = "") +
scale_y_continuous(name = "Total Likes per Month",
breaks = seq(0, 10000, 1000),
labels = paste0(0:10, "K")) +
scale_fill_brewer(palette = "Spectral")
```
::: {.callout-note .try}
How would you change the code in this section to plot the number of tweets published per week?
Hint: if the <pkg>lubridate</pkg> function for the month is `month()`, what is the function for getting the week likely to be?
```{r, webex.hide="Summarise Data"}
tweets_by_week <- all_daily_tweets %>%
mutate(week = week(Date)) %>%
group_by(week) %>%
summarise(start_date = min(Date),
total_tweets = sum(`Tweets published`)) %>%
ungroup()
```
```{r, webex.hide="Plot Data"}
ggplot(tweets_by_week, aes(x = start_date, y = total_tweets)) +
geom_col(fill = "hotpink") +
scale_x_date(name = "",
date_breaks = "1 month",
date_labels = "%B") +
scale_y_continuous(name = "Total Tweets per Week")
```
:::
## Data by Tweet
You can also download your twitter data by tweet instead of by day. This usually takes a little longer to download. We can use the same pattern to read and combine all of the tweet data files.
The `^` at the start of the pattern means that the file name has to start with this. This means it won't match the "daily_tweet..." files.
```{r}
tweet_files <- list.files(
path = "data/tweets",
pattern = "^tweet_activity_metrics",
full.names = TRUE
)
```
First, let's open only the first file and see if we need to do anything to it.
```{r, message=FALSE}
tweets <- read_csv(tweet_files[1])
```
If you look at the file in the Viewer, you can set that the `Tweet id` column is using scientific notation (`1.355500e+18`) instead of the full 18-digit tweet ID, which gives different IDs the same value. We won't ever want to *add* ID numbers,so it's safe to represent these as characters. Set up the map over all the files with the `col_types` specified, then get rid of all the promoted columns and add `month` and `hour` columns (reading the date from the `time` column in these data).
```{r, warning=FALSE}
ct <- cols("Tweet id" = col_character())
all_tweets <- map_df(tweet_files, read_csv, col_types = ct) %>%
select(!starts_with("promoted")) %>%
mutate(month = lubridate::month(time, label = TRUE),
hour = lubridate::hour(time))
```
### Impressions per Tweet
Now we can look at the distribution of impressions per tweet for each month.
```{r imp-month-plot, fig.cap="Impressions per tweet per month."}
ggplot(all_tweets, aes(x = month, y = impressions, fill = month)) +
geom_violin(show.legend = FALSE, alpha = 0.8) +
scale_fill_brewer(palette = "Spectral") +
scale_x_discrete(name = "") +
scale_y_continuous(name = "Impressions per Tweet",
breaks = c(0, 10^(2:7)),
labels = c(0, 10, 100, "1K", "10K", "100K", "1M"),
trans = "pseudo_log") +
ggtitle("Distribution of Twitter Impressions per Tweet in 2021")
```
::: {.callout-note .try}
The y-axis has been transformed to "pseudo_log" to show very skewed data more clearly (most tweets get a few hundred impressions, but some a few can get thousands). See what the plot looks like if you change the y-axis transformation.
:::
### Top Tweet
You can display Lisa's top tweet for the year.
```{r, results='asis'}
top_tweet <- all_tweets %>%
slice_max(order_by = likes, n = 1)
glue::glue("[Top tweet]({top_tweet$`Tweet permalink`}) with {top_tweet$likes} likes:
---------------------------
{top_tweet$`Tweet text`}
---------------------------
") %>% cat()
```
### Word Cloud
Or you can make a word cloud of the top words they tweet about. (You'll learn how to do this in @sec-word-clouds).
```{r, echo = FALSE, message=FALSE}
library(tidytext)
library(ggwordcloud)
omitted <- c(stop_words$word, 0:9,
"=", "+", "lt", "gt",
"im", "id", "ill", "ive", "isnt",
"doesnt", "dont", "youre", "didnt")
words <- all_tweets %>%
unnest_tokens(output = "word",
input = "Tweet text",
token = "tweets") %>%
count(word) %>%
filter(!word %in% omitted) %>%
slice_max(order_by = n, n = 50, with_ties = FALSE)
ggplot(words, aes(label = word, colour = word, size = n)) +
geom_text_wordcloud_area() +
scale_size_area(max_size = 17) +
theme_minimal(base_size = 14) +
scale_color_hue(h = c(100, 420), l = 50)
```
### Tweets by Hour
In order to make a plot of tweets by hour, colouring the data by wherther or not the sun is up, we can join data from a table of sunrise and sunset times by day for Glasgow (or [download the table for your region](https://www.schoolsobservatory.org/learn/astro/nightsky/sunrs_set){target="_blank"}).
The `Day` column originally read in as a character column, so convert it to a date on import using the `col_types` argument.
```{r}
sun <- read_csv("data/sunfact2021.csv",
col_types = cols(
Day = col_date(format="%d/%m/%Y"),
RiseTime = col_double(),
SetTime = col_double()
))
```
Create a matching `Day` column for `all_tweets`, plus an `hour` column for plotting (the factor structure starts the day at 8:00), and a `tweet_time` column for comparing to the `RiseTime` and `SetTime` columns, which are decimal hours.
Then join the `sun` table and create a `timeofday` column that equals "day" if the sun is up and "night" if the sun has set.
```{r}
sun_tweets <- all_tweets %>%
select(time) %>%
mutate(Day = date(time),
hour = factor(hour(time),
levels = c(8:23, 0:7)),
tweet_time = hour(time) + minute(time)/60) %>%
left_join(sun, by = "Day") %>%
mutate(timeofday = ifelse(tweet_time>RiseTime &
tweet_time<SetTime,
yes = "day",
no = "night"))
```
Check a few random rows to make sure you did everything correctly.
```{r}
slice_sample(sun_tweets, n = 10)
```
Plot the `hour` along the x-axis and set the fill and colour by `timeofday`. Use `scale_*_manual()` functions to set custom colours for day and night.
```{r hour-tweet-plot, fig.cap="Tweets per hour of the day"}
map <- aes(x = hour,
fill = timeofday,
colour = timeofday)
ggplot(sun_tweets, mapping = map) +
geom_bar(show.legend = FALSE) +
labs(x = "",
y = "",
title = "Number of tweets by hour of the day") +
scale_x_discrete(breaks = c(8:23, 0:7)[c(T, F, F, F)],
drop = FALSE) +
scale_y_continuous(expand = c(0, 0, .1, 0)) +
scale_fill_manual(values = c("gold", "midnightblue")) +
scale_colour_manual(values = c("darkgoldenrod1", "black")) +
facet_wrap(~month(time, label = TRUE, abbr = FALSE), nrow = 3) +
theme(strip.background = element_rect(fill = "white",
color = "transparent"),
panel.grid = element_blank())
```
::: {.callout-note .try}
Go through each line of the plot above and work out what each function and argument does by changing or removing it.
:::