diff --git a/notes.txt b/notes.txt
deleted file mode 100644
index f2f3dae..0000000
--- a/notes.txt
+++ /dev/null
@@ -1,34 +0,0 @@
-Structure:
-2 hrs of interactive instruction
-1 hr group consultation / exercise
-
-To do:
-- create git repo
-x book location
-- ask microdata team for joining consultation
-- group people https://docs.google.com/spreadsheets/d/1fOWyx2WHRVasjVY2aJgAb-q9x9apAycrT0jctYrk7bk/edit#gid=2123602320
-x send email to Kasia
-
-fundamentals
-- infrastructure
-- information about files
-- additional (meta) data available (postal codes etc.)
-
-project structure & reproducibility
-- separate preprocessing and analysis
-- simple folder structure, R project, relative paths
-- simulated data
-- code import/export is free
-- working offline / collaborating
-- changing versions: put paths on top of your file
-
-code efficiency
-- import/export/writing to disk
-- separate scripts, each works by itself
-- reading spolis in chunks
-- arrow & dplyr?
-
-consultation in groups
-- consult each other
-- additional files (network analysis in python?)
-- exercise with 10GB dataset
diff --git a/questions.txt b/questions.txt
deleted file mode 100644
index dc8eb72..0000000
--- a/questions.txt
+++ /dev/null
@@ -1,40 +0,0 @@
-FUNDAMENTALS
-General tips on how to start using CBS Microdata in policy research
-I have no topics yet. 
-No questions at the moment 
-Not sure yet, I will get access to our CBS microdata on April 1st. 
-How to navigate the CBS environment, and how to use multiple CBS data files, and other fundamentals of using CBS data in R
-Structure of CBS data, access and procedures.
-No specific topic to discuss. 
-Introduction to working with CBS data and the general structure of the data and how to combine different sources of data. 
-Project linkage and organization between LISS and CBS
-
-
-PROJECT STRUCTURE & REPRODUCIBILITY
-(1) How to efficiently organize codes/data/results for different papers within a CBS-project
-(2) How to create reproducible and efficient codings (Stata) for the whole pipeline (raw - modified / generate - analyses)
-I would like to learn how to ensure reproducibility in the CBS environment, find folder structures that are flexible and suitable for long-term projects, and create efficient data pipelines. 
-I am mainly interested in the research reproducibility part of the workshop. Depending on what I learn there, I might have some more questions related to R coding practices, environment and state management and data wrangling. 
-Efficient data pipelines under the constraint that the data is always in the remote access environment, while researchers may also want to work offline
-version control in RA environment (working with git not possible?), improving efficiency in R-coding, preparing replication packages for projects with microdata
-(1) how to use version control in the RA environment 
-(2) How to make reproducible scripts that meet journal guidelines. I'm struggling a lot with memory limits and scripts that run very long (i.e. aggregate SPOLISBUS from monthly to annual), and I need to produce a "one-click" script for a journal submission. 
-
-CODE EFFICIENCY
-data import/export
-efficient coding to work around RA constraints
-Efficiency in handling the large datafiles such as Polis. And tips for R.
-Efficient handling of these large micro datasets
-Handling big datasets in R
-improve reproducibility and efficiency in data pipelines; databases. Python-specific best practises, if possible
-The speed of data processing; the harmonization of different bestanden
-How to work with CBS’s labeled SPSS strings in R most effectively. Haven labelled is not easily usable with every function, but creating new ‘_lab’ variables for all these variables also does not seem like a suitable fix. The question extends to how to best import all valuable information from the CBS SPSS datasets into R while minimising memory use. 
-tips how to work with huge datafiles and avoid "cannot allocate a vector sized...." problem. A question: is active CBS RA access required to participate?
-How to use CBS data in R in an efficient way. 
-I feel that when working with datasets I have on one hand the urge to save under a new dataset name a lot, so that if I make an error and have to start again I only ""drop"" a little back in my code and can start over without having to go back to the beginning and spent a lot of time rerunning my code (which takes time with big datasets).
-On the other hand, especially when working with big dataset I try to keep as few datasets open as possible, to save ""space/working memory"". So with that in mind I rather not save my dataset every time under a new name and would rather remove (rm() in r) any dataset as soon as I don't use them in the rest of the file anymore. But then I run the risk that if I make an error I have to ""go back"" very far in my code and rerun a lot of things. Is there advice how to best deal with this tension.
-
-SPECIFIC DATA QUESTIONS
-How to work efficiently with the network data at CBS?
-Parallel computation and use of spatial data
-Questions regarding specific datasets such as spolisbus
\ No newline at end of file
diff --git a/solutions/solution_chunked.R b/solutions/solution_chunked.R
new file mode 100644
index 0000000..b9553fe
--- /dev/null
+++ b/solutions/solution_chunked.R
@@ -0,0 +1,66 @@
+# Solution 2: manual chunks & online statistics to compute
+# mean and variance in a streaming way
+
+# function to get a data chunk, with only required columns
+get_chunk <- function(start_pos = 0L, chunksize = 1e6) {
+  read_spss(
+    spolis_loc,
+    n_max = chunksize,
+    skip = start_pos,
+    col_select = c(SBASISLOON, SBASISUREN, SCONTRACTSOORT)
+  ) 
+}
+
+# function to compute n, the sum, and the sum of squares
+compute_stats <- function(df) {
+  df |>
+    mutate(hourlywage = SBASISLOON / pmax(SBASISUREN, 1)) |>
+    summarize(
+      sum = sum(hourlywage),
+      ssq = sum(hourlywage^2),
+      n = n(),
+      .by = SCONTRACTSOORT
+    )
+}
+
+# loop over chunks, add to result every time
+cur_pos <- 0L
+chunk <- get_chunk(cur_pos)
+result <- compute_stats(chunk)
+while (nrow(chunk) != 0) {
+  cur_pos <- cur_pos + nrow(chunk)
+  cat("Row:", cur_pos, "\r")
+  chunk <- get_chunk(cur_pos)
+  result <- bind_rows(result, compute_stats(chunk))
+}
+write_rds(result, "processed_data/chunked_result.rds")
+
+# we need to do one extra aggregation step
+output <- 
+  result |> 
+  summarize(
+    sum = sum(sum), 
+    ssq = sum(ssq),
+    n = sum(n), 
+    .by = SCONTRACTSOORT
+  ) |> 
+  mutate(
+    mean = sum / n,
+    var  = ssq / n - (sum / n)^2,
+    sd   = sqrt(var),
+    sem  = sd / sqrt(n),
+    lwr  = mean - 1.96*sem,
+    upr  = mean + 1.96*sem
+  )
+
+# create plot!
+output |> 
+  ggplot(aes(x = as_factor(SCONTRACTSOORT, levels = "labels"), y = mean, ymax = upr, ymin = lwr)) +
+  geom_pointrange() +
+  labs(
+    x = "Contract type",
+    y = "Average wage",
+    title = "Average wage per unit time for different contract types."
+  ) +
+  theme_linedraw()
+
diff --git a/solutions/solution_duckdb.R b/solutions/solution_duckdb.R
new file mode 100644
index 0000000..681f74a
--- /dev/null
+++ b/solutions/solution_duckdb.R
@@ -0,0 +1,64 @@
+# Solution 1: duckdb to the rescue!
+library(tidyverse)
+library(haven)
+library(duckdb)
+library(dbplyr)
+
+# first, read the whole table into a duckdb
+# database. Do this in chunks to ensure low 
+# RAM usage.
+spolis_loc <- "fake_cbs_data/Spolis/SPOLISBUS2022V2.sav"
+
+drv <- duckdb("processed_data/spolis.duckdb")
+dbc <- dbConnect(drv)
+
+cur_pos <- 0L
+chunk_size <- 1e6
+cur_df <- read_spss(
+  file = spolis_loc, n_max = chunk_size, skip = cur_pos, 
+  col_select = c(RINPERSOON, RINPERSOONS, SCONTRACTSOORT, SBASISLOON, SBASISUREN)
+)
+dbWriteTable(dbc, "income", cur_df, append = TRUE)
+while (nrow(cur_df) != 0) {
+  cur_pos <- cur_pos + nrow(cur_df)
+  cat("Row:", cur_pos, "\r")
+  cur_df <- read_spss(
+    file = spolis_loc, n_max = chunk_size, skip = cur_pos,
+    col_select = c(RINPERSOON, RINPERSOONS, SCONTRACTSOORT, SBASISLOON, SBASISUREN)
+  )
+  dbWriteTable(dbc, "income", cur_df, append = TRUE)
+}
+
+
+
+# connect to the table we just created
+income_tbl <- tbl(dbc, "income")
+
+income_tbl |> 
+  summarize(
+    mean  = mean(SBASISLOON / pmax(1, SBASISUREN)), 
+    stdev = sd(SBASISLOON / pmax(1, SBASISUREN)),
+    n     = n(),
+    .by   = SCONTRACTSOORT
+  ) |> 
+  mutate(
+    stderr = stdev / sqrt(n),
+    lower  = mean - 1.96*stderr,
+    upper  = mean + 1.96*stderr
+  ) |> 
+  ggplot(aes(
+    x = as_factor(SCONTRACTSOORT),
+    y = mean,
+    ymax = upper,
+    ymin = lower
+  )) +
+  geom_pointrange() +
+  labs(
+    x = "Contract type",
+    y = "Average wage",
+    title = "Average wage per unit time for different contract types."
+  ) +
+  theme_linedraw()
+
+dbDisconnect(dbc)
+duckdb_shutdown(drv)