Placeholder: Decide what researchers need to import in order to write a dataset definition #527

inglesp · 2022-06-03T19:53:19Z

(Was Make it trivial for researchers to import exactly and only what they need to write dataset definitions in Shortcut.)

evansd · 2022-08-16T15:13:06Z

I'm going to treat this as the question: from where do researchers import the EventFrames they need? (The only other components required are Dataset, case, when which should be relatively straightforward.)

I'd like to advocate for separate modules for each backend, and for each set of tables that might be implemented by several backends. That is, we'd have several modules like:

databuilder.tables.tpp
databuilder.tables.emis
databuilder.tables.graphnet

And we'd also have modules like:

databuilder.tables.core_ehr

Which contains just those tables, and just the columns on those tables, which we'd require to be implemented by any backend claiming "Core EHR" compatibility.

The idea is that once you've decided what backend or backends your study is targeting you don't have a separate question of working out what tables and columns you're allowed to use in your study — dataset validation and autocomplete take care of the rest.

Of course, there would be a lot of commonality between these modules. I anticipate that behind the scenes a lot of these tables would be defined together in some base module and then each individual backend would import just the tables it needs.

There's a further question of to handle studies which are designed to be run against multiple backends but need to do different slightly different things for each backend. I've got thoughts about how we could handle this but I think that's worthy of a separate discussion.

evansd · 2022-08-16T16:03:24Z

Further thought: we should probably establish a pattern in dataset definitions along the lines of:

from databuilder.tables import core_ehr as tables

So that it's possible to change the schema the dataset is built from just by changing the import.

inglesp · 2022-08-16T16:18:33Z

If we encourage:

from databuilder.tables import core_ehr as tables

then would it mean that users had to qualify all their table variables, eg tables.clinical_events?

Alternatively, if we encouraged users to do:

from databuilder.tables.tpp import clinical_events

then they'd still be able to change the schema just by changing the import.

inglesp · 2022-08-16T16:19:03Z

Meant to say: the proposal feels sound to me.

evansd · 2022-08-16T16:21:05Z

then they'd still be able to change the schema just by changing the import

Oh yes, good point. Your suggestion is a better one.

benbc · 2022-08-16T16:25:55Z

These are excellent thoughts. Here are mine.

I'm not certain that we need core_ehr. Or at least it's v adjacent to the goals of contracts generally. So while it might be a good idea I suggest that we leave it for now and go with just the backend imports until we see a need for something else.
We need to ensure that researchers see a unified surface when looking at docs and the API. Right now we have "contracts" as the entry point for the docs and "tables" here in the code. We need to straighten that out. (And of course we need to incorporate the previous Data Builder work with the data curation worldview which may imply further changes.)
I do think we need to think a bit about Dataset, case and when (and maybe others later) just to make sure that the experience is lovely. In particular I think we want a module with a good name that exports those and only those so it's really clear to researchers what universe of things they should be considering.
I love the idea of being able to type tables. or tpp. and seeing what all the valid options are. But the verbosity of tables.clinical_events.take(tables.clinical_events.date > "some date") etc ad nauseam probably makes it impractical. I guess we expect researchers to be looking at the docs so it's natural to start typing the name of the table they know exists and have it auto-complete to confirm they're right.

evansd · 2022-08-16T17:11:39Z

Thanks @benbc.

On 1, I'm very happy to punt this and not include it for now. I don't feel like the "Core EHR" thing was my idea but I can't now remember where it came from.

On 2, I was just discussing with Peter the idea of unifying ehrQL schemas and contracts. I think following #663 this would be reasonably straightforward. It would be a case of letting the Series descriptor accept the arguments we currently use in contract definitions. All except type would (for now) be ignored by ehrQL but accessible for introspection by the contracts machinery.

So, taking the example from databuilder.contracts.universal it would look something like:

@construct
class patients(PatientFrame):
    sex = Series(
        str,
        choices=["female", "male", "intersex", "unknown"],
        description="Patient's sex as defined by the options: male, female, intersex, unknown.",
        implementation_notes_to_add_to_description=(
            'Specify how this has been determined, e.g. "sex at birth", or "current sex".'
        ),
        constraints=[NotNullConstraint()],
    )
    date_of_birth = Series(
        datetime.date,
        description=(
            "Patient's year and month of birth, provided in format YYYY-MM-01. "
            "The day will always be the first of the month."
        ),
        constraints=[FirstOfMonthConstraint(), NotNullConstraint()],
    )

That would be both directly importable and usable by ehrQL, and contain all the metadata currently encoded in the contract. I don't think we really benefit from having them as two separate things, do we?

On 3, that's a good point about the cleanliness aspect. I've wondered before about renaming databuilder.query_language to databuilder.ehrql. But maybe we should keep the former and use the latter as the clean public namespace?

On 4, I share your disappointment but I think Peter's right that importing the table names is overall better. I did wonder about import databuilder... as t but even with a single character namespace it's still quite syntactically noisy.

benbc · 2022-08-17T09:10:48Z

Your suggestion for databuilder.ehrql sounds good. I suppose we'd encourage (through examples) just from databuilder.ehrql import * at the top of every data definition.

benbc · 2022-08-17T09:15:34Z

I don't feel v strongly about unifying the table and contract types in the codebase -- although fewer things is generally better than more, obvs. But I feel strongly that it needs to feel unified to researchers, e.g. the vocabulary of contracts and tables either needs unifying or there needs to be a really clear semantic distinction.

We also need to consider the needs of backend implementers as consumers -- for them the concepts of "contracts" and "tables are clearly different. But researchers needs will ultimately trump theirs.

evansd · 2022-08-18T09:29:02Z

Can I bikeshed the databuilder.tables namespace a bit? "Tables" has the advantage of being a relatively simple, familiar term but I think it has some disadvantages too:

It doesn't tie up with the ehrQL vocabulary where we use "frames".
It doesn't capture the fact that the primary entities in this namespace (tpp, emis, core_ehr, whatever) are collections of tables which belong together.

My tentative suggestion here would be databuilder.schemas (no Jon, not schemata) but I'm not particularly wedded to that. My thinking is that we're going to end up wanting to talk about e.g "the TPP schema" anyway, so we may as well use the term consistently.

inglesp · 2022-08-18T09:55:35Z

Is there a one-one correspondence with whatever these things are, and things that quack like tables?

benbc · 2022-08-18T10:04:15Z

I think that schemas would be okay, but I don't really buy either of your disadvantages.

I'm not convinced that "frames" is a term that we should expect researchers to know. It's not going to be in the code they write, is it? I think of it more as an implementation abstraction than part of the surface. (Also it includes things that aren't tables as far as researchers are concerned -- e.g. some of the rows in a table.)
I think tables.tpp is a reasonable answer to the question "where are the TPP tables?"

evansd · 2022-08-18T10:21:17Z

That's fair. In my head, we were going to have to teach people what frames are, and the different flavours they come in, but quite possibly we don't — I haven't though that much about the pedagogy here.

I think all I'm really keen on is that we use terms deliberately and carefully, and I was worried we were introducing the term "tables" accidentally alongside "frames", "contracts" and "schemas". Docs can be rewritten but names we bake into the API are obviously harder to change so I wanted to make sure we got it right.

I'm happy to stick with tables if that's going to be term we use elsewhere — it certainly has the advantage of familiarity.

StevenMaude · 2022-08-18T15:07:14Z

I'm not convinced that "frames" is a term that we should expect researchers to know. It's not going to be in the code they write, is it?

Maybe not in the code. But, for what it's worth, it does appear:

in tracebacks when you try and do the wrong thing (ask me how I know this)
in the current test-based reference

It might be that this is all better hidden in future.

Here and now, a user might encounter the term. So it's possibly helpful to at least know what is being referred to.

evansd mentioned this issue Jun 22, 2022

Critical path to running CIPHA booster effectiveness study in Graphnet #565

Closed

inglesp added the CIPHA Work needed for the CIPHA booster effectiveness study label Jul 20, 2022

inglesp moved this to Next in Data Team Sep 7, 2022

inglesp added this to Data Team Sep 7, 2022

inglesp assigned evansd Sep 7, 2022

evansd moved this from Next to In Progress in Data Team Sep 7, 2022

evansd mentioned this issue Sep 8, 2022

Add public namespaces #705

Merged

evansd moved this from In Progress to Under Review in Data Team Sep 8, 2022

evansd closed this as completed in #705 Sep 9, 2022

Repository owner moved this from Under Review to Done in Data Team Sep 9, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Placeholder: Decide what researchers need to import in order to write a dataset definition #527

Placeholder: Decide what researchers need to import in order to write a dataset definition #527

inglesp commented Jun 3, 2022

evansd commented Aug 16, 2022

evansd commented Aug 16, 2022

inglesp commented Aug 16, 2022

inglesp commented Aug 16, 2022

evansd commented Aug 16, 2022

benbc commented Aug 16, 2022 •

edited

Loading

evansd commented Aug 16, 2022

benbc commented Aug 17, 2022

benbc commented Aug 17, 2022

evansd commented Aug 18, 2022

inglesp commented Aug 18, 2022

benbc commented Aug 18, 2022

evansd commented Aug 18, 2022

StevenMaude commented Aug 18, 2022

Placeholder: Decide what researchers need to import in order to write a dataset definition #527

Placeholder: Decide what researchers need to import in order to write a dataset definition #527

Comments

inglesp commented Jun 3, 2022

evansd commented Aug 16, 2022

evansd commented Aug 16, 2022

inglesp commented Aug 16, 2022

inglesp commented Aug 16, 2022

evansd commented Aug 16, 2022

benbc commented Aug 16, 2022 • edited Loading

evansd commented Aug 16, 2022

benbc commented Aug 17, 2022

benbc commented Aug 17, 2022

evansd commented Aug 18, 2022

inglesp commented Aug 18, 2022

benbc commented Aug 18, 2022

evansd commented Aug 18, 2022

StevenMaude commented Aug 18, 2022

benbc commented Aug 16, 2022 •

edited

Loading