-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Placeholder: Decide what researchers need to import in order to write a dataset definition #527
Comments
I'm going to treat this as the question: from where do researchers import the I'd like to advocate for separate modules for each backend, and for each set of tables that might be implemented by several backends. That is, we'd have several modules like:
And we'd also have modules like:
Which contains just those tables, and just the columns on those tables, which we'd require to be implemented by any backend claiming "Core EHR" compatibility. The idea is that once you've decided what backend or backends your study is targeting you don't have a separate question of working out what tables and columns you're allowed to use in your study — dataset validation and autocomplete take care of the rest. Of course, there would be a lot of commonality between these modules. I anticipate that behind the scenes a lot of these tables would be defined together in some base module and then each individual backend would import just the tables it needs. There's a further question of to handle studies which are designed to be run against multiple backends but need to do different slightly different things for each backend. I've got thoughts about how we could handle this but I think that's worthy of a separate discussion. |
Further thought: we should probably establish a pattern in dataset definitions along the lines of: from databuilder.tables import core_ehr as tables So that it's possible to change the schema the dataset is built from just by changing the import. |
If we encourage:
then would it mean that users had to qualify all their table variables, eg Alternatively, if we encouraged users to do:
then they'd still be able to change the schema just by changing the import. |
Meant to say: the proposal feels sound to me. |
Oh yes, good point. Your suggestion is a better one. |
These are excellent thoughts. Here are mine.
|
Thanks @benbc. On 1, I'm very happy to punt this and not include it for now. I don't feel like the "Core EHR" thing was my idea but I can't now remember where it came from. On 2, I was just discussing with Peter the idea of unifying ehrQL schemas and contracts. I think following #663 this would be reasonably straightforward. It would be a case of letting the So, taking the example from @construct
class patients(PatientFrame):
sex = Series(
str,
choices=["female", "male", "intersex", "unknown"],
description="Patient's sex as defined by the options: male, female, intersex, unknown.",
implementation_notes_to_add_to_description=(
'Specify how this has been determined, e.g. "sex at birth", or "current sex".'
),
constraints=[NotNullConstraint()],
)
date_of_birth = Series(
datetime.date,
description=(
"Patient's year and month of birth, provided in format YYYY-MM-01. "
"The day will always be the first of the month."
),
constraints=[FirstOfMonthConstraint(), NotNullConstraint()],
) That would be both directly importable and usable by ehrQL, and contain all the metadata currently encoded in the contract. I don't think we really benefit from having them as two separate things, do we? On 3, that's a good point about the cleanliness aspect. I've wondered before about renaming On 4, I share your disappointment but I think Peter's right that importing the table names is overall better. I did wonder about |
Your suggestion for |
I don't feel v strongly about unifying the table and contract types in the codebase -- although fewer things is generally better than more, obvs. But I feel strongly that it needs to feel unified to researchers, e.g. the vocabulary of contracts and tables either needs unifying or there needs to be a really clear semantic distinction. We also need to consider the needs of backend implementers as consumers -- for them the concepts of "contracts" and "tables are clearly different. But researchers needs will ultimately trump theirs. |
Can I bikeshed the
My tentative suggestion here would be |
Is there a one-one correspondence with whatever these things are, and things that quack like tables? |
I think that
|
That's fair. In my head, we were going to have to teach people what frames are, and the different flavours they come in, but quite possibly we don't — I haven't though that much about the pedagogy here. I think all I'm really keen on is that we use terms deliberately and carefully, and I was worried we were introducing the term "tables" accidentally alongside "frames", "contracts" and "schemas". Docs can be rewritten but names we bake into the API are obviously harder to change so I wanted to make sure we got it right. I'm happy to stick with |
Maybe not in the code. But, for what it's worth, it does appear:
It might be that this is all better hidden in future. Here and now, a user might encounter the term. So it's possibly helpful to at least know what is being referred to. |
(Was Make it trivial for researchers to import exactly and only what they need to write dataset definitions in Shortcut.)
The text was updated successfully, but these errors were encountered: