Repo containing code to for R package academictwitteR to collect tweets from v2 API endpoint for the Academic Research Product Track.
To cite package ‘academictwitteR’ in publications use:
- Barrie, Christopher and Ho, Justin Chun-ting. (2021). academictwitteR: an R package to access the Twitter Academic Research Product Track v2 API endpoint. Journal of Open Source Software, 6(62), 3272, https://doi.org/10.21105/joss.03272
A BibTeX entry for LaTeX users is:
@article{BarrieHo2021,
doi = {10.21105/joss.03272},
url = {https://doi.org/10.21105/joss.03272},
year = {2021},
publisher = {The Open Journal},
volume = {6},
number = {62},
pages = {1-2},
author = {Christopher Barrie and Justin Chun-ting Ho},
title = {academictwitteR: an R package to access the Twitter Academic Research Product Track v2 API endpoint},
journal = {Journal of Open Source Software}
}
You can install the package with:
install.packages("academictwitteR")
Alternatively, you can install the development version with:
devtools::install_github("cjbarrie/academictwitteR", build_vignettes = TRUE)
Get started by reading vignette("academictwitteR-intro")
.
The academictwitteR package has been designed with the efficient storage of data in mind. Queries to the API include arguments to specify whether tweets be stored as a .rds file using the file
argument or as separate JSON files for tweet- and user-level information separately with argument data_path
.
Tweets are returned as a data.frame object and, when a file
argument has been included, will also be saved as a .rds file.
Getting tweets of specified users via get_user_tweets()
. This function captures tweets for a particular user or set of users and collects tweets between specified date ranges, avoiding rate limits by sleeping between calls. A call may look like:
bearer_token <- "" # Insert bearer token
users <- c("TwitterDev", "jack")
tweets <-
get_user_tweets(users,
"2010-01-01T00:00:00Z",
"2020-01-01T00:00:00Z",
bearer_token)
Getting tweets of specified string or series of strings via get_all_tweets()
. This function captures tweets containing a particular string or set of strings between specified date ranges, avoiding rate limits by sleeping between calls.
This function can also capture tweets for a particular hashtag or set of hashtags when specified with the # operator.
For a particular set of strings a call may look like:
bearer_token <- "" # Insert bearer token
tweets <-
get_all_tweets("apples OR oranges",
"2020-01-01T00:00:00Z",
"2020-01-05T00:00:00Z",
bearer_token)
For a particular set of hashtags a call may look like:
bearer_token <- "" # Insert bearer token
tweets <-
get_all_tweets(
"#BLM OR #BlackLivesMatter",
"2020-01-01T00:00:00Z",
"2020-01-05T00:00:00Z",
bearer_token
)
Alternatively, we can specify a character vector comprising several elements. For example, we if we wanted to search multiple hashtags, we could specify a query as follows:
bearer_token <- "" # Insert bearer token
htagquery <- c("#BLM", "#BlackLivesMatter", "#GeorgeFloyd")
tweets <-
get_all_tweets(
htagquery,
"2020-01-01T00:00:00Z",
"2020-01-05T00:00:00Z",
bearer_token
)
, which will achieve the same thing as typing out OR
between our strings.
Note that the "AND" operator is implicit when specifying more than one character string in the query. See here for information on building queries for search tweets. Thus, when searching for all elements of a character string, a call may look like:
bearer_token <- "" # Insert bearer token
tweets <-
get_all_tweets("apples oranges",
"2020-01-01T00:00:00Z",
"2020-01-05T00:00:00Z",
bearer_token)
, which will capture tweets containing both the words "apples" and "oranges." The same logic applies for hashtag queries.
Files are stores as JSON files in specified directory when a data_path
is specified. Tweet-level data is stored in files beginning "data_"; user-level data is stored in files beginning "users_".
If a filename is supplied, the functions will save the resulting tweet-level information as a .rds file.
Functions always return a data.frame object unless a data_path
is specified and bind_tweets
is set to FALSE
. When collecting large amounts of data, we recommend using the data_path
option with bind_tweets = FALSE
. This mitigates potential data loss in case the query is interrupted.
An example of such a query would be:
bearer_token <- "" # Insert bearer token
tweets <-
get_all_tweets(
"#BLM OR #BlackLivesMatter",
"2014-01-01T00:00:00Z",
"2020-01-01T00:00:00Z",
bearer_token,
data_path = "data/",
bind_tweets = FALSE
)
, which would collect all tweets containing the hashtags "#BLM" or "BlackLivesMatter" over a six-year period.
Users can then use the bind_tweets
convenience function to bundle the jsons into a data.frame object for analysis in R as such:
tweets <- bind_tweets(data_path = "data/")
users <- bind_tweets(data_path = "data/", user = TRUE)
get_all_tweets()
accepts a range of arguments, which can be combined to generate a more precise query.
Arguments | Description |
---|---|
query | Search query or queries e.g. "cat" |
exclude | Tweets containing the keyword(s) will be excluded "grumpy" e.g. |
is_retweet | If TRUE , only retweets will be returned; if FALSE , retweets will not be returned, only tweets will be returned; if NULL , both retweets and tweets will be returned. |
is_reply | If TRUE , only reply tweets will be returned |
is_quote | If TRUE , only quote tweets will be returned |
is_verified | If TRUE , only tweets whose authors are verified by Twitter will be returned |
place | Name of place e.g. "London" |
country | Name of country as ISO alpha-2 code e.g. "GB" |
point_radius | A vector of two point coordinates latitude, longitude, and point radius distance (in miles) |
bbox | A vector of four bounding box coordinates from west longitude to north latitude |
geo_query | If TRUE user will be prompted to enter relevant information for bounding box or point radius geo buffers |
remove_promoted | If TRUE , tweets created for promotion only on ads.twitter.com are removed |
has_hashtags | If TRUE , only tweets containing hashtags will be returned |
has_cashtags | If TRUE , only tweets containing cashtags will be returned |
has_links | If TRUE , only tweets containing links and media will be returned |
has_mentions | If TRUE , only tweets containing mentions will be returned |
has_media | If TRUE , only tweets containing a recognized media object, such as a photo, GIF, or video, as determined by Twitter will be returned |
has_images | If TRUE , only tweets containing a recognized URL to an image will be returned |
has_videos | If TRUE , only tweets containing contain native Twitter videos, uploaded directly to Twitter will be returned |
has_geo | If TRUE , only tweets containing Tweet-specific geolocation data provided by the Twitter user will be returned |
lang | A single BCP 47 language identifier e.g. "fr" |
An example would be:
bearer_token <- "" # Insert bearer token
tweets <-
get_all_tweets(
query = "cat",
exclude = "grumpy",
"2020-01-01T00:00:00Z",
"2020-01-02T00:00:00Z",
bearer_token,
has_images = TRUE,
has_hashtags = TRUE,
country = "GB",
lang = "en"
)
The above query will fetch all tweets that contain the word "cat" but not "grumpy", posted on 1 January 2020 in the UK, have an image attachment, include at least one hashtag, and are written in English.
The package offers two functions to deal with interruption and continue previous data collection session. If you have set a data_path and export_query was set to "TRUE" during the original collection, you can use resume_collection()
to resume a previous interrupted collection session. An example would be:
bearer_token <- ""
resume_collection(data_path = "data", bearer_token)
If a previous data collection session is completed, you can use update_collection()
to continue data collection with a new end date. This function is particularly useful for getting data for ongoing events. An example would be:
bearer_token <- ""
update_collection(data_path = "data", "2020-05-10T00:00:00Z", bearer_token)
For more information on the parameters and fields available from the v2 Twitter API endpoint see: https://developer.twitter.com/en/docs/twitter-api/tweets/search/api-reference/get-tweets-search-all.
The API call returns both the tweet data and the user information separately, but currently only the former is parsed. It is possible to obtain other user information such as user handle and display name. These can then be merged with the dataset using the author_id field.
bearer_token <- "" # Insert bearer token
users <- c("TwitterDev", "jack")
tweets_df <-
get_user_tweets(users,
"2020-01-01T00:00:00Z",
"2020-01-05T00:00:00Z",
bearer_token)
users_df <-
get_user_profile(unique(tweets_df$author_id), bearer_token)
Function originally taken from Gist by https://github.com/schochastics.
Please note that the academictwitteR project is released with a Contributor Code of Conduct. By contributing to this project, you agree to abide by its terms.