-
Notifications
You must be signed in to change notification settings - Fork 48
introduce & utilize duckdb #28
Comments
@jenna-jordan I like the idea of introducing people to DuckDB, but would like to work through a few possible issues - feedback greatly appreciated.
|
Re 2: One way to motivate this is that I can think of 2 distinct learner personas for this document: A. I am a Data Scientist who has been given a SQL database full of data. I need enough SQL knowledge to get out the subsets that I am interested in (possibly doing some in-database transformations along the way). B. I am a Data Scientist who has been given a pile of tabular data in some format or other. I want to use the power of a relational database to do the slicing and dicing and data transformations that would be more awkward or inefficient in my programming language of choice. For Persona A, SQLite is a good pedagogical choice. It's probably not the flavor SQL database they have been given access to, but it probably works enough like that flavor for the lessons to transfer over, given its ease of deployment to students. DuckDB might otherwise be a fine choice, but it's a little more specialized and offers affordances that are a little too friendly to transfer over to the database they are actually using. For Persona B, SQLite is also a fine choice (especially with #22). But DuckDB is purpose-built for this persona, and has a lot of affordances for ingesting those piles of tabular data and round-tripping between the programming language and SQL in addition to just being generally more efficient and scalable for analytical workloads. Re 3: You're talking about lesson 80? DuckDB doesn't have the Re 4: About the same as Pandas. In general, there are more affordances in DuckDB that could be worth getting into if you wanted to spend the time on it. The idiomatic, most efficient ways to move data in and out of the SQL realm are unique to DuckDB. |
I think the concern about switching to a different database engine partway through is very valid. The only solution I've come up with is a very big lift, but potentially a good thing to get the foundation set up for and then rely on contributors to fill it out. What if there were different versions of the site/queries for the different flavors of SQL? You could select your preferred flavor from the start, and then you'd see the queries/site written for that version of SQL. In a more advanced version, you could have side by side comparison to see how the query is expressed in two SQL flavors of choice. Contributors could write the SQL & explanation for their favorite SQL flavor, and slowly grow the repository of queries in many flavors over time, with SQLite remaining the core/base version. But the effort this might require may not be worth it for what you're trying to accomplish with this site. |
One point that was not mentioned that go in favor of SQLite (and I will definitely try to replicate this tutorial with duckDB) is that SQLite open you to a lot of other file formats. In the geospatial world both geopackage and mbtiles are build on top of it (and widely used). |
Maybe this suggestion suffers too much from the curse of knowledge problem, but I'm wondering if a possible solution is to make use of duckdb's This may also benefit Persona A from @rkern's comment if they've been given access to a Postgres or MySQL database. But I can appreciate too that duckdb (and this functionality in particular) may seem too specialized for a general purpose introduction tutorial to SQL. |
I've been thinking further about how to best introduce/integrate duckdb, postgres, etc into the tutorial. I think it might be a useful framework to establish a way to integrate optional modules into the lesson. So, you'd have the core module, which includes all of the queries that illustrate the core concepts that anybody should be aware of and competent in when interacting with databases, and you'd want the setup to be as easy as possible for this (so, SQLite). But then, you could have additional modules that focus on some particular advanced topic. In the tutorial as is, the queries from 90 up (those that use python) would be a good collection to pull out into their own module. They require a distinct setup in order to work. Similarly, a set of queries using duckdb could be their own module. At the start of the tutorial, you could allow the independent learner or instructor to select which modules to include - the core module would always be selected, and some modules would build on each other, so you would have a tree-like structure of modules where the delivered set of queries is some ordered subset of that tree. This would allow for more modular development of topics, and the linear nature that the queries are presented in would make more sense if learners could exclude topics they weren't interested in. So for example, you could have a python branch and an R branch for programmatically interacting with databases. Each module could have their own setup instructions and set of resources. Other modules might focus on GIS, access management, a particular DBMS (duckdb, postgres), etc. I'm not sure how to actually implement that type of functionality in the website, though. |
I've just moved the Python material into a separate page called |
duckdb is similar to sqlite in that it is a database as a file, but it does require an install process. The key advantage of duckdb over sqlite is (at least for the purpose of this lesson, I think) the datatypes - you have many more datatypes than with sqlite and better handling of those datatypes.
I think a good point to introduce duckdb may be around query 070. At this point you have just finished demonstrating some ways to get around sqlite's suboptimal handling of dates, and you could demonstrate how those same queries could be handled in duckdb using proper datetime datatypes and functions. Then, you could also use duckdb to demonstrate how to query JSON data. So, I propose replacing the "yet another database" lab_log.db with a duckdb database, and demo extracting out the datetimes from the JSON and then querying those datetimes.
Then, starting with query 088, it might be interesting to contrast how to interact with a sqlite database using python vs a duckdb database.
Install instructions for duckdb could be linked to in the setup, and likewise it may be helpful to link to install instructions for python & jupyter notebooks in the setup.
The text was updated successfully, but these errors were encountered: