diff --git a/notes.txt b/notes.txt deleted file mode 100644 index f2f3dae..0000000 --- a/notes.txt +++ /dev/null @@ -1,34 +0,0 @@ -Structure: -2 hrs of interactive instruction -1 hr group consultation / exercise - -To do: -- create git repo -x book location -- ask microdata team for joining consultation -- group people https://docs.google.com/spreadsheets/d/1fOWyx2WHRVasjVY2aJgAb-q9x9apAycrT0jctYrk7bk/edit#gid=2123602320 -x send email to Kasia - -fundamentals -- infrastructure -- information about files -- additional (meta) data available (postal codes etc.) - -project structure & reproducibility -- separate preprocessing and analysis -- simple folder structure, R project, relative paths -- simulated data -- code import/export is free -- working offline / collaborating -- changing versions: put paths on top of your file - -code efficiency -- import/export/writing to disk -- separate scripts, each works by itself -- reading spolis in chunks -- arrow & dplyr? - -consultation in groups -- consult each other -- additional files (network analysis in python?) -- exercise with 10GB dataset diff --git a/questions.txt b/questions.txt deleted file mode 100644 index dc8eb72..0000000 --- a/questions.txt +++ /dev/null @@ -1,40 +0,0 @@ -FUNDAMENTALS -General tips on how to start using CBS Microdata in policy research -I have no topics yet. -No questions at the moment -Not sure yet, I will get access to our CBS microdata on April 1st. -How to navigate the CBS environment, and how to use multiple CBS data files, and other fundamentals of using CBS data in R -Structure of CBS data, access and procedures. -No specific topic to discuss. -Introduction to working with CBS data and the general structure of the data and how to combine different sources of data. -Project linkage and organization between LISS and CBS - - -PROJECT STRUCTURE & REPRODUCIBILITY -(1) How to efficiently organize codes/data/results for different papers within a CBS-project -(2) How to create reproducible and efficient codings (Stata) for the whole pipeline (raw - modified / generate - analyses) -I would like to learn how to ensure reproducibility in the CBS environment, find folder structures that are flexible and suitable for long-term projects, and create efficient data pipelines. -I am mainly interested in the research reproducibility part of the workshop. Depending on what I learn there, I might have some more questions related to R coding practices, environment and state management and data wrangling. -Efficient data pipelines under the constraint that the data is always in the remote access environment, while researchers may also want to work offline -version control in RA environment (working with git not possible?), improving efficiency in R-coding, preparing replication packages for projects with microdata -(1) how to use version control in the RA environment -(2) How to make reproducible scripts that meet journal guidelines. I'm struggling a lot with memory limits and scripts that run very long (i.e. aggregate SPOLISBUS from monthly to annual), and I need to produce a "one-click" script for a journal submission. - -CODE EFFICIENCY -data import/export -efficient coding to work around RA constraints -Efficiency in handling the large datafiles such as Polis. And tips for R. -Efficient handling of these large micro datasets -Handling big datasets in R -improve reproducibility and efficiency in data pipelines; databases. Python-specific best practises, if possible -The speed of data processing; the harmonization of different bestanden -How to work with CBS’s labeled SPSS strings in R most effectively. Haven labelled is not easily usable with every function, but creating new ‘_lab’ variables for all these variables also does not seem like a suitable fix. The question extends to how to best import all valuable information from the CBS SPSS datasets into R while minimising memory use. -tips how to work with huge datafiles and avoid "cannot allocate a vector sized...." problem. A question: is active CBS RA access required to participate? -How to use CBS data in R in an efficient way. -I feel that when working with datasets I have on one hand the urge to save under a new dataset name a lot, so that if I make an error and have to start again I only ""drop"" a little back in my code and can start over without having to go back to the beginning and spent a lot of time rerunning my code (which takes time with big datasets). -On the other hand, especially when working with big dataset I try to keep as few datasets open as possible, to save ""space/working memory"". So with that in mind I rather not save my dataset every time under a new name and would rather remove (rm() in r) any dataset as soon as I don't use them in the rest of the file anymore. But then I run the risk that if I make an error I have to ""go back"" very far in my code and rerun a lot of things. Is there advice how to best deal with this tension. - -SPECIFIC DATA QUESTIONS -How to work efficiently with the network data at CBS? -Parallel computation and use of spatial data -Questions regarding specific datasets such as spolisbus \ No newline at end of file diff --git a/solutions/solution_chunked.R b/solutions/solution_chunked.R new file mode 100644 index 0000000..b9553fe --- /dev/null +++ b/solutions/solution_chunked.R @@ -0,0 +1,66 @@ +# Solution 2: manual chunks & online statistics to compute +# mean and variance in a streaming way + +# function to get a data chunk, with only required columns +get_chunk <- function(start_pos = 0L, chunksize = 1e6) { + read_spss( + spolis_loc, + n_max = chunksize, + skip = start_pos, + col_select = c(SBASISLOON, SBASISUREN, SCONTRACTSOORT) + ) +} + +# function to compute n, the sum, and the sum of squares +compute_stats <- function(df) { + df |> + mutate(hourlywage = SBASISLOON / pmax(SBASISUREN, 1)) |> + summarize( + sum = sum(hourlywage), + ssq = sum(hourlywage^2), + n = n(), + .by = SCONTRACTSOORT + ) +} + +# loop over chunks, add to result every time +cur_pos <- 0L +chunk <- get_chunk(cur_pos) +result <- compute_stats(chunk) +while (nrow(chunk) != 0) { + cur_pos <- cur_pos + nrow(chunk) + cat("Row:", cur_pos, "\r") + chunk <- get_chunk(cur_pos) + result <- bind_rows(result, compute_stats(chunk)) +} +write_rds(result, "processed_data/chunked_result.rds") + +# we need to do one extra aggregation step +output <- + result |> + summarize( + sum = sum(sum), + ssq = sum(ssq), + n = sum(n), + .by = SCONTRACTSOORT + ) |> + mutate( + mean = sum / n, + var = ssq / n - (sum / n)^2, + sd = sqrt(var), + sem = sd / sqrt(n), + lwr = mean - 1.96*sem, + upr = mean + 1.96*sem + ) + +# create plot! +output |> + ggplot(aes(x = as_factor(SCONTRACTSOORT, levels = "labels"), y = mean, ymax = upr, ymin = lwr)) + + geom_pointrange() + + labs( + x = "Contract type", + y = "Average wage", + title = "Average wage per unit time for different contract types." + ) + + theme_linedraw() + diff --git a/solutions/solution_duckdb.R b/solutions/solution_duckdb.R new file mode 100644 index 0000000..681f74a --- /dev/null +++ b/solutions/solution_duckdb.R @@ -0,0 +1,64 @@ +# Solution 1: duckdb to the rescue! +library(tidyverse) +library(haven) +library(duckdb) +library(dbplyr) + +# first, read the whole table into a duckdb +# database. Do this in chunks to ensure low +# RAM usage. +spolis_loc <- "fake_cbs_data/Spolis/SPOLISBUS2022V2.sav" + +drv <- duckdb("processed_data/spolis.duckdb") +dbc <- dbConnect(drv) + +cur_pos <- 0L +chunk_size <- 1e6 +cur_df <- read_spss( + file = spolis_loc, n_max = chunk_size, skip = cur_pos, + col_select = c(RINPERSOON, RINPERSOONS, SCONTRACTSOORT, SBASISLOON, SBASISUREN) +) +dbWriteTable(dbc, "income", cur_df, append = TRUE) +while (nrow(cur_df) != 0) { + cur_pos <- cur_pos + nrow(cur_df) + cat("Row:", cur_pos, "\r") + cur_df <- read_spss( + file = spolis_loc, n_max = chunk_size, skip = cur_pos, + col_select = c(RINPERSOON, RINPERSOONS, SCONTRACTSOORT, SBASISLOON, SBASISUREN) + ) + dbWriteTable(dbc, "income", cur_df, append = TRUE) +} + + + +# connect to the table we just created +income_tbl <- tbl(dbc, "income") + +income_tbl |> + summarize( + mean = mean(SBASISLOON / pmax(1, SBASISUREN)), + stdev = sd(SBASISLOON / pmax(1, SBASISUREN)), + n = n(), + .by = SCONTRACTSOORT + ) |> + mutate( + stderr = stdev / sqrt(n), + lower = mean - 1.96*stderr, + upper = mean + 1.96*stderr + ) |> + ggplot(aes( + x = as_factor(SCONTRACTSOORT), + y = mean, + ymax = upper, + ymin = lower + )) + + geom_pointrange() + + labs( + x = "Contract type", + y = "Average wage", + title = "Average wage per unit time for different contract types." + ) + + theme_linedraw() + +dbDisconnect(dbc) +duckdb_shutdown(drv)