-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathUsing_world_map2.Rmd
403 lines (279 loc) · 15.9 KB
/
Using_world_map2.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
---
title: "Using world_map2"
date: "`r Sys.Date()`"
output:
html_document:
toc: yes
toc_float: yes
toc: default
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(
echo = TRUE,
fig.align = "center",
fig.fullwidth = TRUE,
message = FALSE,
warning = FALSE
)
```
```{r libs_one}
library(tidyverse)
library(here)
library(visdat)
```
## Brief Introduction
The data set `world_map2`, a derivative and update of `ggplot2::map_data("world")`, was created for easier mapping of Global Studies data in the [Tidyverse](https://jhudatascience.org/tidyversecourse/). To not solely rely on matching country or region names, which can vary considerably per the data source, `world_map2` contains following the ISO country codes: [Alpha-2 code](https://en.wikipedia.org/wiki/ISO_3166-1_alpha-2), [Alpha-3 code](https://en.wikipedia.org/wiki/ISO_3166-1_alpha-3), and [Numeric code](https://en.wikipedia.org/wiki/ISO_3166-1_numeric). (To download and/or improve `world_map2`: [world_map2_project.rda](https://github.com/Thom-J-H/map_Gap_2_Tidy/blob/main/world_map2_project.rda)).
The process behind making `world_map2` is described in a [separate markdown](https://rpubs.com/Thom_JH/798825). This markdown showcases how to use `world_map2` with Global Studies data. This markdown uses data from the [Gapminder.org foundation](https://www.gapminder.org/data/) as examples of typical Global Studies data. We want to show the annual results in [Choropleth maps](https://en.wikipedia.org/wiki/Choropleth_map).
```{r data_map_gs}
# our custom map data
load(here::here("data",
"tidy_data",
"maps",
"world_map2_project.rda"))
# raw Global Studies dat from Gapminder.org
gni_percapita <- read_csv(here::here("data",
"raw_data",
"gnipercapita_ppp_current_international.csv") )
# raw Global Studies dat from Gapminder.org
water_access <- read_csv(here::here("data",
"raw_data",
"at_least_basic_water_source_overall_access_percent.csv") )
# raw Global Studies dat from Gapminder.org
energy_capita <- read_csv(here::here("data",
"raw_data",
"energy_use_per_person.csv"))
```
## Case Study: Energy
The first data set, "Energy use, per person", is from Gapminder.org but sourced from there to the World Bank. We will proceed by exploring it, creating a Tidy version, and then mapping the Tidy version. Our try at mapping will introduce and troubleshoot a common problem.
### EDA Enegry
We see that the data is in a wide format, and not Tidy. It does not use ISO country codes. In fact, none of data sets available from [Gapminder.org/data](https://www.gapminder.org/data/) do so. So we will need to match `country` names.
```{r energy_eda}
energy_capita %>% vis_dat()
energy_capita %>% glimpse()
```
When converting to a Tidy data format, we want to make sure the variable `year` is not a character. But first, let's check to see if the country names in `energy_capita` have a match in our mapping data.
```{r energy_diff}
setdiff(energy_capita$country,world_map2$country)
setdiff(world_map2$country, energy_capita$country) %>%
enframe(name = NULL, value ="diff")
```
Yes! All the values for the `country` variable in our Global Studies data have a match in `world_map2`. We have only 169 countries in `energy_capita`, so this comprises only a subset of the countries in `world_map2`.
### Tidy Energy
```{r eng_tidy}
energy_tidy <- energy_capita %>%
pivot_longer(cols = !country,
names_to = "year",
names_transform = list(year = as.integer),
values_to = "energy")
energy_tidy %>% vis_dat()
energy_tidy %>% glimpse()
```
### Map left_join()
We have `energy_tidy`. It nows seems a simple step to `left_join` it to the map data, and then plot the energy values for a given year. But this will result in void spaces on our map.
```{r missing_dat}
energy_tidy %>%
filter(year == 2004) %>%
left_join(world_map2, by = "country") %>%
ggplot(aes(x = long,
y = lat,
group = group,
label = country)) +
geom_polygon(aes(fill = energy) )+
scale_fill_viridis_c(option = "C") +
labs(fill = "Energy Use\nPer Capita",
title = "Gapminder Data: 2004") +
theme_void()
```
The white spaces for Afghanistan and various nations in Africa and Southeast Asia are unnecessarily confusing. Although these nations were not reported on in `energy_tidy`, that does not mean they were replaced by ocean or otherwise ceased to exist.
We want the map to fail gracefully. Missing countries would be better treated as NA cases. And NA cases should be displayed. knowing that we have no data for certain nations is both important and useful.
### Using complete()
So we want to expand the data set before we join it with the mapping data. The Tidyverse verb [complete()](https://tidyr.tidyverse.org/reference/complete.html) helps solve this problem. For the missing `energy` values, we will assign `NA` using `replace_na()`. This ensures that our plot will use all listed countries, showing the `NA` when applicable. Again, we want both graceful failure and to know when we are missing information.
```{r better_ver}
energy_tidy %>%
filter(year == 2004) %>%
complete(country = world_map2$country,
fill = (list(energy = NA )) ) %>%
left_join(world_map2, by = "country") %>%
replace_na(list(year = 2004)) %>%
ggplot(aes(x = long,
y = lat,
group = group,
label = country)) +
geom_polygon(aes(fill = energy) )+
scale_fill_viridis_c(option = "C") +
labs(fill = "Energy Use\nPer Capita",
title = "Gapminder Data: 2000") +
theme_void()
```
This version fails much more gracefully. It also makes clearer for what regions of the world we lack data.
#### Interactivity made easy
Let's say that someone thinks [Antarctica](https://en.wikipedia.org/wiki/Antarctica) justs wastes space in this or like plots as it seldom generates economic, political science, or public health data. We can remove it with a simple filter call. Likewise, we can make the map interactive with one additional library and function call: [plotly](https://plotly.com/r/) and `ggplotly()`.
For `plotly::ggplotly()`, the default legend guide will not carry over. We can solve this by changing the title.
In the code below, this is broken down into three steps. First, create the data set for the plot. Second, create the ggplot object. Third, display it as plotly graphic.
```{r plt_ver}
# Map data
enegry_dat_2004 <- energy_tidy %>%
filter(year == 2004) %>%
complete(country = world_map2$country,
fill = (list(energy = NA )) ) %>%
left_join(world_map2, by = "country") %>%
replace_na(list(year = 2004))
# ggplot object
enegry_map_2004 <- enegry_dat_2004 %>%
filter(code_3 != "ATA") %>%
ggplot(aes(x = long,
y = lat,
group = group,
label = country)) +
geom_polygon(aes(fill = energy) )+
scale_fill_viridis_c(option = "C") +
labs(fill = "",
title = "Energy Use Per Capita (2004)") +
theme_void()
# interactive version
plotly::ggplotly(enegry_map_2004)
```
So we now have a procedure. This project originated with a course entitled "Telling Stories with Data" (*TSD*), which introduced non-STEM majors to the [Tidyverse](https://jhudatascience.org/tidyversecourse/) and basic data visualization and analysis. Beginning students find it useful to have a proven model to follow. We have one now for our [Choropleth maps](https://en.wikipedia.org/wiki/Choropleth_map), Let's apply it to two other Gapminder.org data sets.
## CS: Water Access
The second data set, "At least basic water source, overall access", is likewise from Gapminder.org but sourced from there to the UN's Millennium Development Goals. We will apply the procedure -- the "recipe" of steps -- developed when mapping Energy as above.
### EDA: Water Access
The data set "At least basic water source, overall access" has values for 194 countries, some `NA`, for the years 2000 to 2017.
```{r water_eda}
water_access %>% vis_dat()
water_access %>% glimpse()
```
This data set again is in wide format, and not Tidy. But we know how to fix, and can largely recycle our solution for `energy_capita` to `energy_tidy`. We now need to check the how well the country names match -- do we have coverage?
```{r water_diff}
setdiff(water_access$country,world_map2$country)
setdiff(world_map2$country, water_access$country) %>%
enframe(name = NULL, value ="diff")
```
So our map data provides complete coverage for our statistical data. Now, let's Tidy the data, and then use `complete()` with `left_join()` to create our plot data sets.
### Tidy Water
```{r water_tidy}
water_tidy <- water_access %>%
pivot_longer(cols = !country,
names_to = "year",
names_transform = list(year = as.integer),
values_to = "water")
water_tidy %>%
vis_dat()
water_tidy %>%
glimpse()
```
All good.
### Map & Plot
```{r use_complete}
water_dat_2010 <- water_tidy %>%
filter(year == 2010) %>%
complete(country = world_map2$country,
fill = (list(water = NA )) ) %>%
left_join(world_map2, by = "country") %>%
replace_na(list(year = 2010))
```
#### Create Plot Object
```{r com_ex}
water_mpa_2010 <- water_dat_2010 %>%
filter(code_3 != "ATA") %>%
ggplot(aes(x = long,
y = lat,
group = group,
label = country)) +
geom_polygon(aes(fill = water) )+
scale_fill_viridis_c(option = "C") +
labs(fill = "",
title = "Basic Water Access (2010)") +
theme_void()
water_mpa_2010
```
#### Plotly version
```{r gni_plotly}
plotly::ggplotly(water_mpa_2010)
```
Since we have a proven procedure, we can now tackle a slightly more difficult case.
## CS: GNI Per Capita
Our third data set, "GNI per capita (PPP, current international $)", is likewise from Gapminder.org, but sourced from there to the World Bank. It also regretfully does not use any ISO codes for identifying the countries reported on. We are back to matching names.
### EDA: GNI Per Capita
This data is wide, not Tidy, and has one additional issue. It reports on 196 countries over the time span 1990-2019. The values for [GNI Per Capita](https://en.wikipedia.org/wiki/Gross_national_income) should be numeric. Instead, they are character data. Values under 10000 appear as numbers; values over 10000 are expressed as k-units: for example, 13.2k for 13200.
```{r gni_eda}
gni_percapita %>%
vis_dat()
gni_percapita %>%
glimpse()
```
To properly parse this data, we need to explore a bit more and think our way through the problem. We also need to check the country coverage.
```{r gni_diff}
setdiff(gni_percapita$country, world_map2$country) %>%
enframe(name = NULL, value ="diff")
setdiff(world_map2$country, gni_percapita$country) %>%
enframe(name = NULL, value ="diff")
```
### Troubleshoot to Tidy
We have two mismatches: "Curaçao", which uses the special character "ç"; and "Sint Maarten (Dutch part)", which contains the addition of "(Dutch part)". Although both these names are now becoming more regularly used as the short designations for these countries, we will still normalize them to simpler versions: "Curaçao" will be simplified to "Curacao", and "Sint Maarten (Dutch part)" to "Sint Maarten". Why? Special characters do not always play well in data sets or in coding, and the parenthetical information can also cause problems. It would be better if the data sets used the ISO country codes in addition to the short version of the country names. But the Gapminder.org data sets so far do not.
Now, back to the bigger issue. We need to convert the GNI Per Capita values to numbers. But the values previously expressed as *something k* need to be multiplied by 1000.
We have several ways of doing this. Below is one solution that relies on a cut-off value of 175. This assumes -- based an exploration of the data -- that we have no nations in this set with a reported GNI Per Capita greater than 199000, and that we have no nations with a reported GNI Per Capita less than 200. The top and bottom values will shortly be displayed.
```{r gni_tidy}
gni_tidy <- gni_percapita %>%
pivot_longer(cols = !country,
names_to = "year",
names_transform = list(year = as.integer),
values_to = "gni_ppp_cap") %>%
mutate(gni_ppp_cap = readr::parse_number(gni_ppp_cap) )%>%
mutate(gni_ppp_cap = case_when(gni_ppp_cap < 200 ~ gni_ppp_cap * 1000,
TRUE ~ gni_ppp_cap) )
gni_tidy <- gni_tidy %>%
mutate( country = case_when(country == "Curaçao" ~ "Curacao",
country == "Sint Maarten (Dutch part)" ~ "Sint Maarten" ,
TRUE ~ country) )
# Brief check
gni_tidy %>% slice_min(gni_ppp_cap, n = 4)
gni_tidy %>% slice_max(gni_ppp_cap, n = 4)
```
So as it turns out, our cut-off was effective. The lowest reported GNI Per Capita: 270, for Mozambique in 1992. The highest, 132000 for Qatar in 2012 and Macao in 2014.
We should now recheck country coverage.
```{r gni_check}
setdiff(gni_tidy$country, world_map2$country)
gni_tidy %>%
vis_dat()
```
Complete coverage. And all variables the right types. We have mapping data for all the countries in `gni_tidy`. We are now ready to make our maps. From here, we follow the earlier recipe.
### Map & Plot
```{r gni_dat}
gni_dat_2017 <- gni_tidy %>%
filter(year == 2017) %>%
complete(country = world_map2$country,
fill = (list(gni_ppp_cap = NA)) ) %>%
left_join(world_map2, by = "country") %>%
replace_na(list(year = 2017))
gni_map_2017 <- gni_dat_2017%>%
filter(code_3 != "ATA") %>%
ggplot(aes(x = long,
y = lat,
group = group,
label = country)) +
geom_polygon(aes(fill = gni_ppp_cap) )+
scale_fill_viridis_c(option = "C") +
labs(fill = "",
title = "GNI Per Capita for 2017 (in PPP dollars)") +
theme_void()
gni_map_2017
```
```{r gni_plotly3}
plotly::ggplotly(gni_map_2017 )
```
## Summary
The three above case studies show how to use Global studies data and`world_map2` with `complete()` to produce better [Choropleth maps](https://en.wikipedia.org/wiki/Choropleth_map). As opposed to the default`ggplot2::map_data("world")`, the updated [world_map2](https://github.com/Thom-J-H/map_Gap_2_Tidy/blob/main/world_map2_project.rda) both fails gracefully when we have missing countries or data, and contains the following standard ISO country codes: [Alpha-2 code](https://en.wikipedia.org/wiki/ISO_3166-1_alpha-2), [Alpha-3 code](https://en.wikipedia.org/wiki/ISO_3166-1_alpha-3), and [Numeric code](https://en.wikipedia.org/wiki/ISO_3166-1_numeric). As such, we can match by `country` name to the majority of Gapminder.org data sets, and we can match by to any Global Studies data set which likewise uses one or more of the above ISO country codes. The bottom line: better graphs and improved interoperability while keeping it (relatively) simple in the Tidyverse.
### Download from
Or, please improve and share: [github.com/Thom-J-H/map_Gap_2_Tidy](https://github.com/Thom-J-H/map_Gap_2_Tidy)
<hr />
<div style = "background-color: #F0F8FF; padding: 1em;">
<p>Thomas J. Haslam <br />
`r Sys.Date()`
</p>
</div>
<hr />
#### Session Info
```{r package info, echo = FALSE}
ses_info <- sessioninfo::package_info() %>% tibble()
ses_info %>% select(c(1,3,8,9)) %>% DT::datatable()
```