Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Placeholder: Decide what researchers need to import in order to write a dataset definition #527

Closed
inglesp opened this issue Jun 3, 2022 · 14 comments · Fixed by #705
Closed
Assignees
Labels
CIPHA Work needed for the CIPHA booster effectiveness study

Comments

@inglesp
Copy link
Contributor

inglesp commented Jun 3, 2022

(Was Make it trivial for researchers to import exactly and only what they need to write dataset definitions in Shortcut.)

@inglesp inglesp added the CIPHA Work needed for the CIPHA booster effectiveness study label Jul 20, 2022
@evansd
Copy link
Contributor

evansd commented Aug 16, 2022

I'm going to treat this as the question: from where do researchers import the EventFrames they need? (The only other components required are Dataset, case, when which should be relatively straightforward.)

I'd like to advocate for separate modules for each backend, and for each set of tables that might be implemented by several backends. That is, we'd have several modules like:

databuilder.tables.tpp
databuilder.tables.emis
databuilder.tables.graphnet

And we'd also have modules like:

databuilder.tables.core_ehr

Which contains just those tables, and just the columns on those tables, which we'd require to be implemented by any backend claiming "Core EHR" compatibility.

The idea is that once you've decided what backend or backends your study is targeting you don't have a separate question of working out what tables and columns you're allowed to use in your study — dataset validation and autocomplete take care of the rest.

Of course, there would be a lot of commonality between these modules. I anticipate that behind the scenes a lot of these tables would be defined together in some base module and then each individual backend would import just the tables it needs.

There's a further question of to handle studies which are designed to be run against multiple backends but need to do different slightly different things for each backend. I've got thoughts about how we could handle this but I think that's worthy of a separate discussion.

@evansd
Copy link
Contributor

evansd commented Aug 16, 2022

Further thought: we should probably establish a pattern in dataset definitions along the lines of:

from databuilder.tables import core_ehr as tables

So that it's possible to change the schema the dataset is built from just by changing the import.

@inglesp
Copy link
Contributor Author

inglesp commented Aug 16, 2022

If we encourage:

from databuilder.tables import core_ehr as tables

then would it mean that users had to qualify all their table variables, eg tables.clinical_events?

Alternatively, if we encouraged users to do:

from databuilder.tables.tpp import clinical_events

then they'd still be able to change the schema just by changing the import.

@inglesp
Copy link
Contributor Author

inglesp commented Aug 16, 2022

Meant to say: the proposal feels sound to me.

@evansd
Copy link
Contributor

evansd commented Aug 16, 2022

then they'd still be able to change the schema just by changing the import

Oh yes, good point. Your suggestion is a better one.

@benbc
Copy link
Contributor

benbc commented Aug 16, 2022

These are excellent thoughts. Here are mine.

  1. I'm not certain that we need core_ehr. Or at least it's v adjacent to the goals of contracts generally. So while it might be a good idea I suggest that we leave it for now and go with just the backend imports until we see a need for something else.

  2. We need to ensure that researchers see a unified surface when looking at docs and the API. Right now we have "contracts" as the entry point for the docs and "tables" here in the code. We need to straighten that out. (And of course we need to incorporate the previous Data Builder work with the data curation worldview which may imply further changes.)

  3. I do think we need to think a bit about Dataset, case and when (and maybe others later) just to make sure that the experience is lovely. In particular I think we want a module with a good name that exports those and only those so it's really clear to researchers what universe of things they should be considering.

  4. I love the idea of being able to type tables. or tpp. and seeing what all the valid options are. But the verbosity of tables.clinical_events.take(tables.clinical_events.date > "some date") etc ad nauseam probably makes it impractical. I guess we expect researchers to be looking at the docs so it's natural to start typing the name of the table they know exists and have it auto-complete to confirm they're right.

@evansd
Copy link
Contributor

evansd commented Aug 16, 2022

Thanks @benbc.

On 1, I'm very happy to punt this and not include it for now. I don't feel like the "Core EHR" thing was my idea but I can't now remember where it came from.


On 2, I was just discussing with Peter the idea of unifying ehrQL schemas and contracts. I think following #663 this would be reasonably straightforward. It would be a case of letting the Series descriptor accept the arguments we currently use in contract definitions. All except type would (for now) be ignored by ehrQL but accessible for introspection by the contracts machinery.

So, taking the example from databuilder.contracts.universal it would look something like:

@construct
class patients(PatientFrame):
    sex = Series(
        str,
        choices=["female", "male", "intersex", "unknown"],
        description="Patient's sex as defined by the options: male, female, intersex, unknown.",
        implementation_notes_to_add_to_description=(
            'Specify how this has been determined, e.g. "sex at birth", or "current sex".'
        ),
        constraints=[NotNullConstraint()],
    )
    date_of_birth = Series(
        datetime.date,
        description=(
            "Patient's year and month of birth, provided in format YYYY-MM-01. "
            "The day will always be the first of the month."
        ),
        constraints=[FirstOfMonthConstraint(), NotNullConstraint()],
    )

That would be both directly importable and usable by ehrQL, and contain all the metadata currently encoded in the contract. I don't think we really benefit from having them as two separate things, do we?


On 3, that's a good point about the cleanliness aspect. I've wondered before about renaming databuilder.query_language to databuilder.ehrql. But maybe we should keep the former and use the latter as the clean public namespace?


On 4, I share your disappointment but I think Peter's right that importing the table names is overall better. I did wonder about import databuilder... as t but even with a single character namespace it's still quite syntactically noisy.

@benbc
Copy link
Contributor

benbc commented Aug 17, 2022

Your suggestion for databuilder.ehrql sounds good. I suppose we'd encourage (through examples) just from databuilder.ehrql import * at the top of every data definition.

@benbc
Copy link
Contributor

benbc commented Aug 17, 2022

I don't feel v strongly about unifying the table and contract types in the codebase -- although fewer things is generally better than more, obvs. But I feel strongly that it needs to feel unified to researchers, e.g. the vocabulary of contracts and tables either needs unifying or there needs to be a really clear semantic distinction.

We also need to consider the needs of backend implementers as consumers -- for them the concepts of "contracts" and "tables are clearly different. But researchers needs will ultimately trump theirs.

@evansd
Copy link
Contributor

evansd commented Aug 18, 2022

Can I bikeshed the databuilder.tables namespace a bit? "Tables" has the advantage of being a relatively simple, familiar term but I think it has some disadvantages too:

  • It doesn't tie up with the ehrQL vocabulary where we use "frames".
  • It doesn't capture the fact that the primary entities in this namespace (tpp, emis, core_ehr, whatever) are collections of tables which belong together.

My tentative suggestion here would be databuilder.schemas (no Jon, not schemata) but I'm not particularly wedded to that. My thinking is that we're going to end up wanting to talk about e.g "the TPP schema" anyway, so we may as well use the term consistently.

@inglesp
Copy link
Contributor Author

inglesp commented Aug 18, 2022

Is there a one-one correspondence with whatever these things are, and things that quack like tables?

@benbc
Copy link
Contributor

benbc commented Aug 18, 2022

I think that schemas would be okay, but I don't really buy either of your disadvantages.

  • I'm not convinced that "frames" is a term that we should expect researchers to know. It's not going to be in the code they write, is it? I think of it more as an implementation abstraction than part of the surface. (Also it includes things that aren't tables as far as researchers are concerned -- e.g. some of the rows in a table.)
  • I think tables.tpp is a reasonable answer to the question "where are the TPP tables?"

@evansd
Copy link
Contributor

evansd commented Aug 18, 2022

That's fair. In my head, we were going to have to teach people what frames are, and the different flavours they come in, but quite possibly we don't — I haven't though that much about the pedagogy here.

I think all I'm really keen on is that we use terms deliberately and carefully, and I was worried we were introducing the term "tables" accidentally alongside "frames", "contracts" and "schemas". Docs can be rewritten but names we bake into the API are obviously harder to change so I wanted to make sure we got it right.

I'm happy to stick with tables if that's going to be term we use elsewhere — it certainly has the advantage of familiarity.

@StevenMaude
Copy link
Contributor

I'm not convinced that "frames" is a term that we should expect researchers to know. It's not going to be in the code they write, is it?

Maybe not in the code. But, for what it's worth, it does appear:

It might be that this is all better hidden in future.

Here and now, a user might encounter the term. So it's possibly helpful to at least know what is being referred to.

@inglesp inglesp moved this to Next in Data Team Sep 7, 2022
@inglesp inglesp added this to Data Team Sep 7, 2022
@evansd evansd moved this from Next to In Progress in Data Team Sep 7, 2022
@evansd evansd moved this from In Progress to Under Review in Data Team Sep 8, 2022
Repository owner moved this from Under Review to Done in Data Team Sep 9, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CIPHA Work needed for the CIPHA booster effectiveness study
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

4 participants