-
Notifications
You must be signed in to change notification settings - Fork 44
/
Copy path1_stringr_tutorial.qmd
304 lines (215 loc) · 11.6 KB
/
1_stringr_tutorial.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
---
title: 'Text analysis workshop: stringr package and regex'
author: "Casey O'Hara"
format:
html:
toc: true
number-sections: true
embed-resources: true
execute:
warning: false
message: false
---
# Overview
This tutorial will walk through an exercise in extracting specific information from untidily formatted blocks of text, i.e. sentences and paragraphs rather than a nice data frame or .csv.
The example comes from a paper I led that examined species ranges from different datasets, and found some discrepancies that resulted from systematic errors. Many of the coral species ranges for IUCN rangemaps extended off the continental shelf into very deep waters; but most corals require shallower water and are dependent upon photosynthesis. So I wanted to examine whether these corals, according to the IUCN's own information, could be found in waters deeper than 200 meters.
# Load packages and data
The data are narratives pulled from the IUCN API (http://apiv3.iucnredlist.org/) for coral species, in order to identify their maximum depth. We'll also pull up a set of data on species areas, but mostly just because that data includes scientific names for the corals so we can refer to species rather than ID numbers.
``` {r}
library(tidyverse)
# library(stringr)
### original dataset from the manuscript is here:
# data_dir <- 'https://raw.githubusercontent.com/OHI-Science/IUCN-AquaMaps/master/clip_depth'
coral_narrs <- read_csv('data/iucn_narratives.csv')
# head(coral_narrs)
### interested in species_id, habitat
coral_info <- read_csv('data/coral_spp_info.csv')
# head(coral_info)
### info for species mapped in both datasets
### create a dataframe with just ID, scientific name, and habitat
coral_habs_raw <- coral_narrs %>%
left_join(coral_info, by = 'iucn_sid') %>%
select(iucn_sid, sciname, habitat)
```
## examine a few habitat descriptions
``` {r}
coral_habs_raw$habitat[1:2]
```
# How can we extract depth information from these descriptions?
In pseudocode, we can think of a process as:
```r
coral_habs <- coral_habs_raw %>%
split into individual sentences %>%
keep the sentences with numbers in them %>%
isolate the numbers
```
# Intro to `stringr` functions
Here we'll play a little with some basic stringr functions, and pattern vs. vector of strings. Consider especially how we can use `str_split`, `str_detect`, `str_replace`; later we'll see how to make effective use of `str_extract` as well.
- `str_match`, `str_match_all`
- `str_detect`
- `str_split`
- `str_replace`, `str_replace_all`
- `str_subset`, `str_count`, `str_locate`
- `str_trim`, `tolower`, `toupper`, `tools::toTitleCase`
``` {r}
x <- "Everybody's got something to hide except for me and my monkey"
stringr::str_to_title(x)
str_to_lower(x)
str_split(x, 'hide'); str_split(x, 't')
str_replace(x, 'except for', 'including')
str_replace(x, ' ', '_')
str_replace_all(x, ' ', '_')
str_detect(x, 't'); str_detect(x, 'monk') ### is pattern in the string? T/F
str_match(x, 't'); str_match_all(x, 'y') ### return every instance of the pattern in the string
### more useful when using wildcards as a pattern...
str_extract(x, 't'); str_extract_all(x, 'y') ### return every instance of the pattern in the string
### more useful when using wildcards as a pattern...
str_locate(x, 't'); str_locate_all(x, 'y')
```
# Use `stringr` functions on coral data
First we can use `stringr::str_split()` to break down the habitat column into manageable chunks, i.e. sentences. What is an easily accessible delimiter we can use to separate a paragraph into sentences?
### Take 1:
``` {r}
coral_habs <- coral_habs_raw %>%
mutate(hab_cut = str_split(habitat, '.'))
coral_habs$hab_cut[1]
```
Well that didn't work! In a moment we'll see that a period is actually a special character we will later use as a wild card in a "regular expression" or "regex" pattern. Some other characters have special uses as well; so if we want them to be interpreted literally, we need to "escape" them.
Some languages use a single backslash to escape a character (or turn a letter into a special function, e.g. '\\n' indicates a line break). In R stringr functions, usually you end up having to use a double backslash (e.g. to get this to render a backslash-n, I had to type an extra backslash that doesn't show up)
Also: why is just separating on a period probably a bad idea? what else could we use?
### Take 2:
``` {r}
# coral_habs <- coral_habs_raw %>%
# mutate(hab_cut = str_split(habitat, '\. '))
### Error: '\.' is an unrecognized escape in character string starting "'\."
coral_habs <- coral_habs_raw %>%
mutate(hab_cut = str_split(habitat, '\\. '))
### creates a cell with a vector of broken up sentences!
```
![](expert_regex.jpg)
### Use `unnest()` to separate out vector into rows
The str_split function leaves the chopped string in a difficult format - a vector within a dataframe cell. `unnest()` will unpack that vector into individual rows for us.
``` {r}
coral_habs <- coral_habs_raw %>%
mutate(hab_cut = str_split(habitat, '\\. ')) %>%
unnest(hab_cut)
```
Note the number of observations skyrocketed! Each paragraph was a single observation (for one coral species); now each species description is separated out into rows containing sentences.
# Identify numbers and keep the sentences with number in 'em
Without wildcards, we'd have to identify each specific number. This would be annoying. Instead we can use some basic "regular expressions" or "regex" as wild card expressions. We put these in square brackets to create a list of everything we would want to match, e.g. `[aeiou]` would match any instance of lower case vowels.
Helpful for testing regex: https://regex101.com/
``` {r}
### Without wildcards
coral_habs <- coral_habs_raw %>%
mutate(hab_cut = str_split(habitat, '\\. ')) %>%
unnest(hab_cut) %>%
filter(str_detect(hab_cut, '1') | str_detect(hab_cut, '2'))
### With wildcards
coral_habs <- coral_habs_raw %>%
mutate(hab_cut = str_split(habitat, '\\. ')) %>%
unnest(hab_cut) %>%
filter(str_detect(hab_cut, '[0-9]'))
### also works with [3-7], [a-z], [A-Z], [a-z0-9A-Z]
```
### But not all numbers are depths
How can we differentiate further to get at depth info?
- exclude years? Knowing a bit about corals, can probably exclude any four-digit numbers; problems with that?
- match pattern of number followed by " m"
``` {r}
coral_depth <- coral_habs %>%
filter(str_detect(hab_cut, '[0-9] m')) %>%
mutate(depth = str_extract(hab_cut, '[0-9] m'))
```
Why didn't that work???? Only matched the single digit next to the "m"!
We need to use a quantifier:
- `+` means one or more times
- `*` means zero or more times
- `?` means zero or one time
- `{3}` means exactly three times
- `{2,4}` means two to four times; `{2,}` means two or more times
``` {r}
years <- coral_habs %>%
mutate(year = str_extract(hab_cut, '[0-9]{4}'))
### looks for four numbers together
coral_depth <- coral_habs %>%
filter(str_detect(hab_cut, '[0-9] m')) %>%
mutate(depth = str_extract(hab_cut, '[0-9]+ m'))
### looks for one or more numbers, followed by ' m'
### Still misses the ranges e.g. "3-30 m" - how to capture?
### let it also capture "-" in the brackets
coral_depth <- coral_habs %>%
filter(str_detect(hab_cut, '[0-9] m')) %>%
mutate(depth = str_extract(hab_cut, '[0-9-]+ m'))
```
Also can use a "not" operator inside the brackets:
- `'[^a-z]'` matches "anything that is not a lower case letter"
- BUT: `'^[a-z]'` matches a start of a string, then a lower case letter.
- NOTE: `^` outside brackets means start of a string, inside brackets means "not"
``` {r}
### split 'em (using the "not" qualifier), convert to numeric, keep the largest
coral_depth <- coral_habs %>%
filter(str_detect(hab_cut, '[0-9] m')) %>%
mutate(depth_char = str_extract(hab_cut, '[0-9-]+ m'),
depth_num = str_split(depth_char, '[^0-9]')) %>%
unnest(depth_num)
coral_depth <- coral_depth %>%
mutate(depth_num = as.numeric(depth_num)) %>%
filter(!is.na(depth_num)) %>%
group_by(iucn_sid, sciname) %>%
mutate(depth_num = max(depth_num),
n = n()) %>%
distinct()
```
Note, still some issues in here: some fields show size e.g. 1 m diameter; other fields have slightly different formatting of depth descriptors; so it's important to make sure the filters (a) get everything you want and (b) exclude everything you don't want. We could keep going but we'll move on for now...
# Other Examples
## start string, end string, and "or" operator
Combining multiple tests using "or", and adding string start and end characters.
``` {r}
coral_threats <- coral_narrs %>%
select(iucn_sid, threats) %>%
mutate(threats = tolower(threats),
threats_cut = str_split(threats, '\\. ')) %>%
unnest(threats_cut) %>%
filter(str_detect(threats_cut, '^a|s$'))
### NOTE: ^ outside brackets is start of a string, but inside brackets it's a negation
```
# And even more (not run in workshop)
## cleaning up column names in a data frame
Spaces and punctuation in column names can be a hassle, but often when reading in .csvs and Excel files, column names include extra stuff. Use regex and `str_replace` to get rid of these! (or `janitor::clean_names()`...)
``` {r}
crappy_colname <- 'Per-capita income ($US) (2015 dollars)'
tolower(crappy_colname) %>%
str_replace_all('[^a-z0-9]+', '_') %>%
str_replace('^_|_$', '') ### in case any crap at the start or end
```
## Lazy vs. greedy evaluation
When using quantifiers in regex patterns, we need to consider lazy vs. greedy evaluation of quantifiers. "Lazy" will find the shortest piece of a string that matches the pattern (gives up as early as it can); "greedy" will match the largest piece of a string that matches the pattern (takes as much as it can get). "Greedy" is the default behavior, but if we include a question mark after the quantifier we force it to evaluate in the lazy manner.
``` {r}
x <- "Everybody's got something to hide except for me and my monkey"
x %>% str_replace('b.+e', '...')
x %>% str_replace('b.+?e', '...')
```
## Lookaround (Lookahead and lookbehind) assertions
A little more advanced - Lookahead and lookbehind assertions are useful to match a pattern led by or followed by another pattern. The lookaround pattern is not included in the match, but helps to find the right neighborhood for the proper match.
``` {r}
y <- 'one fish two fish red fish blue fish'
y %>% str_locate('(?<=two) fish') ### match " fish" immediately preceded by "two"
y %>% str_locate('fish (?=blue)') ### match "fish " immediately followed by "blue"
y %>% str_replace_all('(?<=two|blue) fish', '...')
```
## Using regex in `list.files()` to automate file finding
`list.files()` is a ridiculously handy function when working with tons of data sets. At its most basic, it simply lists all the non-hidden files in a given location. But if you have a folder with more folders with more folders with data you want to pull in, you can get fancy with it:
* use `recursive = TRUE` to find files in subdirectories
* use `full.names = TRUE` to catch the entire path to the file (otherwise just gets the basename of the file)
* use `all.files = TRUE` if you need to find hidden files (e.g. `.gitignore`)
* use `pattern = 'whatever'` to only select files whose basename matches the pattern - including regex!
``` {r}
list.files(path = 'sample_files')
list.files(path = 'sample_files', pattern = 'jpg$', full.names = TRUE, recursive = TRUE)
list.files(path = '~/github/text_workshop/sample_files', pattern = '[0-9]{4}',
full.names = TRUE, recursive = TRUE)
raster_files <- list.files('sample_files', pattern = '^sample.+[0-9]{4}.tif$')
### note: should technically be '\\.tif$' - do you see why?
### then create a raster stack from the files in raster_files, or loop
### over them, or whatever you need to do!
```