app-twitter.qmd

# Twitter Data {#sec-twitter-data}

This appendix takes a problem-based approach to demonstrate how to use tidyverse functions to summarise and visualise twitter data.

```{r setup-app-h, message=FALSE}
library(tidyverse)   # data wrangling functions
library(lubridate)   # for handling dates and times
```

## Single Data File

### Export Data

You can export your organisations' twitter data from <https://analytics.twitter.com/>. Go to the Tweets tab, choose a time period, and export the last month's data by day (or use the files from the [class data](data/data.zip)).

### Import Data

```{r, message=FALSE}
file <- "data/tweets/daily_tweet_activity_metrics_LisaDeBruine_20210801_20210901_en.csv"

daily_tweets <- read_csv(file, show_col_types = FALSE)
```

### Select Relevant Data

The file contains a bunch of columns about "promoted" tweets that will be blank unless your organisation pays for those. Let's get rid of them. We can use the select helper `starts_with()` to get all the columns that start with `"promoted")` and remove them by prefacing the function with `!`. Now there should be 20 columns, which we can inspect with `glimpse()`. 

```{r, message=FALSE}
daily_tweets <- read_csv(file) %>%
  select(!starts_with("promoted")) %>%
  glimpse()
```


### Plot Likes per Day

Now let's plot likes per day. The `scale_x_date()` function lets you formats an x-axis with dates.

```{r likes-per-day-plot, fig.cap="Likes per day."}
ggplot(daily_tweets, aes(x = Date, y = likes)) +
  geom_line() +
  scale_x_date(name = "", 
               date_breaks = "1 day",
               date_labels = "%d",
               expand = expansion(add = c(.5, .5))) +
  ggtitle("Likes: August 2021")
```


### Plot Multiple Engagements

What if we want to plot likes, retweets, and replies on the same plot? We need to get all of the numbers in the same column and a column that contains the "engagement type" that we can use to determine different line colours. When you have data in different columns that you need to get into the same column, it's wide and you need to pivot the data longer.

```{r}
long_tweets <- daily_tweets %>%
  select(Date, likes, retweets, replies) %>%
  pivot_longer(cols = c(likes, retweets, replies),
               names_to = "engage_type",
               values_to = "n")

head(long_tweets)
```

Now we can plot the number of engagements per day by engagement type by making the line colour determined by the value of the `engage_type` column. 

```{r eng-per-day-plot, fig.cap="Engagements per day by engagement type."}
ggplot(long_tweets, aes(x = Date, y = n, colour = engage_type)) +
  geom_line() +
  scale_x_date(name = "", 
               date_breaks = "1 day",
               date_labels = "%d",
               expand = expansion(add = c(.5, .5))) +
  scale_y_continuous(name = "Engagements per Day") + 
  scale_colour_discrete(name = "") +
  ggtitle("August 2021") +
  theme(legend.position = c(.9, .8),
        panel.grid.minor = element_blank())
  
```


## Multiple Data Files

Maybe now you want to compare the data from several months. First, get a list of all the files you want to combine. It's easiest if they're all in the same directory, although you can use a pattern to select the files you want if they have a systematic naming structure.

```{r}
files <- list.files(
  path = "data/tweets", 
  pattern = "daily_tweet_activity_metrics",
  full.names = TRUE
)
```

Then use `map_df()` to iterate over the list of file paths, open them with `read_csv()`, and return a big data frame with all the combined data. Then we can pipe that to the `select()` function to get rid of the "promoted" columns.

```{r, message=FALSE}
all_daily_tweets <- purrr::map_df(files, read_csv) %>%
  select(!starts_with("promoted"))
```

Now you can make a plot of likes per day for all of the months.

```{r}
ggplot(all_daily_tweets, aes(x = Date, y = likes)) +
  geom_line(colour = "dodgerblue") +
  scale_y_continuous(name = "Likes per Day") + 
  scale_x_date(name = "", 
               date_breaks = "1 month",
               date_labels = "%B",
               expand = expansion(add = c(.5, .5))) +
  ggtitle("Likes 2021")
  
```


::: {.callout-note}
Notice that we changed the date breaks and labels for the x-axis. `%B` is the date code for the full month name. See `?strptime` for all of the date codes.
:::


### Likes by Month

If you want to plot likes by month, first you need a column for the month. Use `mutate()` to make a new column, using `lubridate::month()` to extract the month name from the `Date` column. 

Then group by the new `month` column and calculate the sum of `likes`. The `group_by()` function causes all of the subsequent functions to operate inside of groups, until you call `ungroup()`. In the example below, the `sum(likes)` function calculates the sum total of the `likes` column separately for each month.

```{r}
likes_by_month <- all_daily_tweets %>%
  mutate(month = month(Date, label = TRUE)) %>%
  group_by(month) %>%
  summarise(total_likes = sum(likes)) %>%
  ungroup()

likes_by_month
```


A column plot might make more sense than a line plot for this summary.

```{r likes-by-month-plot, fig.cap = "Likes by month."}
ggplot(likes_by_month, aes(x = month, y = total_likes, fill = month)) +
  geom_col(color = "black", show.legend = FALSE) +
  scale_x_discrete(name = "") +
  scale_y_continuous(name = "Total Likes per Month",
                     breaks = seq(0, 10000, 1000),
                     labels = paste0(0:10, "K")) +
  scale_fill_brewer(palette = "Spectral")
```


::: {.callout-note .try}
How would you change the code in this section to plot the number of tweets published per week? 

Hint: if the <pkg>lubridate</pkg> function for the month is `month()`, what is the function for getting the week likely to be?

```{r, webex.hide="Summarise Data"}
tweets_by_week <- all_daily_tweets %>%
  mutate(week = week(Date)) %>%
  group_by(week) %>%
  summarise(start_date = min(Date),
            total_tweets = sum(`Tweets published`)) %>%
  ungroup()
```

```{r, webex.hide="Plot Data"}
ggplot(tweets_by_week, aes(x = start_date, y = total_tweets)) +
  geom_col(fill = "hotpink") +
  scale_x_date(name = "",
               date_breaks = "1 month", 
               date_labels = "%B") +
  scale_y_continuous(name = "Total Tweets per Week")
```

:::


## Data by Tweet

You can also download your twitter data by tweet instead of by day. This usually takes a little longer to download. We can use the same pattern to read and combine all of the tweet data files.

The `^` at the start of the pattern means that the file name has to start with this. This means it won't match the "daily_tweet..." files.

```{r}
tweet_files <- list.files(
  path = "data/tweets", 
  pattern = "^tweet_activity_metrics",
  full.names = TRUE
)
```

First, let's open only the first file and see if we need to do anything to it.

```{r, message=FALSE}
tweets <- read_csv(tweet_files[1])
```

If you look at the file in the Viewer, you can set that the `Tweet id` column is using scientific notation (`1.355500e+18`) instead of the full 18-digit tweet ID, which gives different IDs the same value. We won't ever want to *add* ID numbers,so it's safe to represent these as characters. Set up the map over all the files with the `col_types` specified, then get rid of all the promoted columns and add `month` and `hour` columns (reading the date from the `time` column in these data).

```{r, warning=FALSE}
ct <- cols("Tweet id" = col_character())
all_tweets <- map_df(tweet_files, read_csv, col_types = ct) %>%
  select(!starts_with("promoted")) %>%
  mutate(month = lubridate::month(time, label = TRUE),
         hour = lubridate::hour(time))
```

### Impressions per Tweet

Now we can look at the distribution of impressions per tweet for each month.

```{r imp-month-plot, fig.cap="Impressions per tweet per month."}
ggplot(all_tweets, aes(x = month, y = impressions, fill = month)) +
  geom_violin(show.legend = FALSE, alpha = 0.8) +
  scale_fill_brewer(palette = "Spectral") +
  scale_x_discrete(name = "") +
  scale_y_continuous(name = "Impressions per Tweet",
                     breaks = c(0, 10^(2:7)),
                     labels = c(0, 10, 100, "1K", "10K", "100K", "1M"),
                     trans = "pseudo_log") +
  ggtitle("Distribution of Twitter Impressions per Tweet in 2021")
```

::: {.callout-note .try}
The y-axis has been transformed to "pseudo_log" to show very skewed data more clearly (most tweets get a few hundred impressions, but some a few can get thousands). See what the plot looks like if you change the y-axis transformation.
:::

### Top Tweet

You can display Lisa's top tweet for the year.

```{r, results='asis'}
top_tweet <- all_tweets %>%
  slice_max(order_by = likes, n = 1)

glue::glue("[Top tweet]({top_tweet$`Tweet permalink`}) with {top_tweet$likes} likes:

---------------------------
{top_tweet$`Tweet text`}
---------------------------
") %>% cat()
```

### Word Cloud

Or you can make a word cloud of the top words they tweet about. (You'll learn how to do this in @sec-word-clouds).

```{r, echo = FALSE, message=FALSE}
library(tidytext)
library(ggwordcloud)

omitted <- c(stop_words$word, 0:9, 
             "=", "+", "lt", "gt",
             "im", "id", "ill", "ive", "isnt", 
             "doesnt", "dont", "youre", "didnt")

words <- all_tweets %>%
  unnest_tokens(output = "word", 
                input = "Tweet text", 
                token = "tweets") %>%
  count(word) %>%
  filter(!word %in% omitted) %>%
  slice_max(order_by = n, n = 50, with_ties = FALSE)

ggplot(words, aes(label = word, colour = word, size = n)) +
  geom_text_wordcloud_area() +
  scale_size_area(max_size = 17) +
  theme_minimal(base_size = 14) +
  scale_color_hue(h = c(100, 420), l = 50)
```

### Tweets by Hour

In order to make a plot of tweets by hour, colouring the data by wherther or not the sun is up, we can join data from a table of sunrise and sunset times by day for Glasgow (or [download the table for your region](https://www.schoolsobservatory.org/learn/astro/nightsky/sunrs_set){target="_blank"}).

The `Day` column originally read in as a character column, so convert it to a date on import using the `col_types` argument.

```{r}
sun <- read_csv("data/sunfact2021.csv", 
                col_types = cols(
                  Day = col_date(format="%d/%m/%Y"),
                  RiseTime = col_double(),
                  SetTime = col_double()
                ))
```

Create a matching `Day` column for `all_tweets`, plus an `hour` column for plotting (the factor structure starts the day at 8:00), and a `tweet_time` column for comparing to the `RiseTime` and `SetTime` columns, which are decimal hours.

Then join the `sun` table and create a `timeofday` column that equals "day" if the sun is up and "night" if the sun has set.

```{r}
sun_tweets <- all_tweets %>%
  select(time) %>%
  mutate(Day = date(time),
         hour = factor(hour(time), 
                       levels = c(8:23, 0:7)),
         tweet_time = hour(time) + minute(time)/60) %>%
  left_join(sun, by = "Day") %>%
  mutate(timeofday = ifelse(tweet_time>RiseTime & 
                            tweet_time<SetTime, 
                            yes = "day", 
                            no = "night"))
```

Check a few random rows to make sure you did everything correctly.

```{r}
slice_sample(sun_tweets, n = 10)
```

Plot the `hour` along the x-axis and set the fill and colour by `timeofday`. Use `scale_*_manual()` functions to set custom colours for day and night.

```{r hour-tweet-plot, fig.cap="Tweets per hour of the day"}
map <- aes(x = hour, 
           fill = timeofday, 
           colour = timeofday)

ggplot(sun_tweets, mapping = map) +
  geom_bar(show.legend = FALSE) +
  labs(x = "", 
       y = "",
       title = "Number of tweets by hour of the day") +
  scale_x_discrete(breaks = c(8:23, 0:7)[c(T, F, F, F)],
                   drop = FALSE) +
  scale_y_continuous(expand = c(0, 0, .1, 0)) +
  scale_fill_manual(values = c("gold", "midnightblue")) +
  scale_colour_manual(values = c("darkgoldenrod1", "black")) +
  facet_wrap(~month(time, label = TRUE, abbr = FALSE), nrow = 3) +
  theme(strip.background = element_rect(fill = "white", 
                                        color = "transparent"),
        panel.grid = element_blank())
```

::: {.callout-note .try}
Go through each line of the plot above and work out what each function and argument does by changing or removing it.
:::