Skip to content
This repository has been archived by the owner on Nov 2, 2024. It is now read-only.

introduce & utilize duckdb #28

Open
jenna-jordan opened this issue Feb 5, 2024 · 7 comments
Open

introduce & utilize duckdb #28

jenna-jordan opened this issue Feb 5, 2024 · 7 comments
Labels
help-wanted Requires knowledge or skills the core maintainer lacks in-content Things in lessons request-addition Please add this
Milestone

Comments

@jenna-jordan
Copy link

duckdb is similar to sqlite in that it is a database as a file, but it does require an install process. The key advantage of duckdb over sqlite is (at least for the purpose of this lesson, I think) the datatypes - you have many more datatypes than with sqlite and better handling of those datatypes.

I think a good point to introduce duckdb may be around query 070. At this point you have just finished demonstrating some ways to get around sqlite's suboptimal handling of dates, and you could demonstrate how those same queries could be handled in duckdb using proper datetime datatypes and functions. Then, you could also use duckdb to demonstrate how to query JSON data. So, I propose replacing the "yet another database" lab_log.db with a duckdb database, and demo extracting out the datetimes from the JSON and then querying those datetimes.

Then, starting with query 088, it might be interesting to contrast how to interact with a sqlite database using python vs a duckdb database.

Install instructions for duckdb could be linked to in the setup, and likewise it may be helpful to link to install instructions for python & jupyter notebooks in the setup.

@gvwilson gvwilson added in-content Things in lessons request-addition Please add this discuss Discussion thread labels Feb 6, 2024
@gvwilson
Copy link
Owner

gvwilson commented Feb 6, 2024

@jenna-jordan I like the idea of introducing people to DuckDB, but would like to work through a few possible issues - feedback greatly appreciated.

  1. Doing it mid-tutorial and then switching back to SQLite feels awkward: if we switch, should we stay switched?
  2. Do we motivate it as "SQLite's date handling is clumsy, here's a better database?" In which case, people will (quite reasonably) ask why we haven't been using DuckDB all along. We can answer that in the intro ("We're going to start with SQLite because you almost certainly already have it, then switch later to show you that SQL is a cross-database technology").
  3. If DuckDB has better (i.e., proper) handling of multi-statement transactions? SQLite's shortcomings here were my biggest disappointment in writing this tutorial; if we can do a proper multi-statement "failed in the middle so nothing happened at all" transaction, that would be compelling.
  4. I think the Python components stay the same, but I haven't tried DuckDB with Polars - any input there?
    Thanks again.

@rkern
Copy link
Contributor

rkern commented Feb 6, 2024

Re 2: One way to motivate this is that I can think of 2 distinct learner personas for this document:

A. I am a Data Scientist who has been given a SQL database full of data. I need enough SQL knowledge to get out the subsets that I am interested in (possibly doing some in-database transformations along the way).

B. I am a Data Scientist who has been given a pile of tabular data in some format or other. I want to use the power of a relational database to do the slicing and dicing and data transformations that would be more awkward or inefficient in my programming language of choice.

For Persona A, SQLite is a good pedagogical choice. It's probably not the flavor SQL database they have been given access to, but it probably works enough like that flavor for the lessons to transfer over, given its ease of deployment to students. DuckDB might otherwise be a fine choice, but it's a little more specialized and offers affordances that are a little too friendly to transfer over to the database they are actually using.

For Persona B, SQLite is also a fine choice (especially with #22). But DuckDB is purpose-built for this persona, and has a lot of affordances for ingesting those piles of tabular data and round-tripping between the programming language and SQL in addition to just being generally more efficient and scalable for analytical workloads.

Re 3: You're talking about lesson 80? DuckDB doesn't have the ON CONFLICT ROLLBACK extension to begin with, so that clause's failure to work like we expect is less of an issue (but note that the INSERT OR ROLLBACK variant will do what we expect within an explicit transaction; just the clause within a CHECK constraint behaves wonky, AFAICT). For someone using either DuckDB or SQLite from Python or whatever, it doesn't matter as much since one usually doesn't bother with using these SQL clauses. We catch the exception in Python and issue a conn.rollback() or conn.commit() from there (or more usually, let the context handler do it), and that conn.rollback() will work across multiple statements fine. AFAIK, standard SQL doesn't really have something like a try: except: else: suite that you can just write, fire, and forget. You BEGIN TRANSACTION; do your operations, then when an error happens, in response, you personally write ROLLBACK;; otherwise you write COMMIT;. If you want to make that choice automatically, that's in Python, and the DB API 2.0 implementations for both DuckDB and SQLite work basically the same.

Re 4: About the same as Pandas. In general, there are more affordances in DuckDB that could be worth getting into if you wanted to spend the time on it. The idiomatic, most efficient ways to move data in and out of the SQL realm are unique to DuckDB.

@jenna-jordan
Copy link
Author

I think the concern about switching to a different database engine partway through is very valid. The only solution I've come up with is a very big lift, but potentially a good thing to get the foundation set up for and then rely on contributors to fill it out.

What if there were different versions of the site/queries for the different flavors of SQL? You could select your preferred flavor from the start, and then you'd see the queries/site written for that version of SQL. In a more advanced version, you could have side by side comparison to see how the query is expressed in two SQL flavors of choice. Contributors could write the SQL & explanation for their favorite SQL flavor, and slowly grow the repository of queries in many flavors over time, with SQLite remaining the core/base version.

But the effort this might require may not be worth it for what you're trying to accomplish with this site.

@defuneste
Copy link

One point that was not mentioned that go in favor of SQLite (and I will definitely try to replicate this tutorial with duckDB) is that SQLite open you to a lot of other file formats. In the geospatial world both geopackage and mbtiles are build on top of it (and widely used).

@chrisgebert
Copy link

Maybe this suggestion suffers too much from the curse of knowledge problem, but I'm wondering if a possible solution is to make use of duckdb's ATTACH functionality to attach and query other databases. Specifically I'm imagining: creating a new duckdb database around topic 70 as @jenna-jordan mentioned, and then attaching the penguins.db SQLite database to it by topic 76 when it is used again.

This may also benefit Persona A from @rkern's comment if they've been given access to a Postgres or MySQL database. But I can appreciate too that duckdb (and this functionality in particular) may seem too specialized for a general purpose introduction tutorial to SQL.

@gvwilson gvwilson added this to the someday milestone Feb 16, 2024
@jenna-jordan
Copy link
Author

I've been thinking further about how to best introduce/integrate duckdb, postgres, etc into the tutorial. I think it might be a useful framework to establish a way to integrate optional modules into the lesson. So, you'd have the core module, which includes all of the queries that illustrate the core concepts that anybody should be aware of and competent in when interacting with databases, and you'd want the setup to be as easy as possible for this (so, SQLite). But then, you could have additional modules that focus on some particular advanced topic. In the tutorial as is, the queries from 90 up (those that use python) would be a good collection to pull out into their own module. They require a distinct setup in order to work. Similarly, a set of queries using duckdb could be their own module. At the start of the tutorial, you could allow the independent learner or instructor to select which modules to include - the core module would always be selected, and some modules would build on each other, so you would have a tree-like structure of modules where the delivered set of queries is some ordered subset of that tree. This would allow for more modular development of topics, and the linear nature that the queries are presented in would make more sense if learners could exclude topics they weren't interested in. So for example, you could have a python branch and an R branch for programmatically interacting with databases. Each module could have their own setup instructions and set of resources. Other modules might focus on GIS, access management, a particular DBMS (duckdb, postgres), etc.

I'm not sure how to actually implement that type of functionality in the website, though.

@gvwilson
Copy link
Owner

I've just moved the Python material into a separate page called pages/python.md that produces /python/ in the site. We could add another page duckdb, start filling in content, and then figure out where to put installation instructions etc.

@gvwilson gvwilson added help-wanted Requires knowledge or skills the core maintainer lacks and removed discuss Discussion thread labels Apr 12, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
help-wanted Requires knowledge or skills the core maintainer lacks in-content Things in lessons request-addition Please add this
Projects
None yet
Development

No branches or pull requests

5 participants