Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better autocomplete behaviour in ehrQL #506

Open
evansd opened this issue May 27, 2022 · 9 comments
Open

Better autocomplete behaviour in ehrQL #506

evansd opened this issue May 27, 2022 · 9 comments
Milestone

Comments

@evansd
Copy link
Contributor

evansd commented May 27, 2022

The dream of having a fully statically typed query language almost certainly isn't possible due to the limitations of Python's type system and the complexities of the query model. But we should still try to make the editing experience as helpful as possible.

Our target here should be VS Code with the default Python extension installed. That's what many users have locally, but it's also what's provided in online IDEs like Gitpod and Github Codespaces.

I think it's acceptable here for the autocomplete behaviour here to be a bit "optimistic" in that it completes method names that probably exist on the object in question, even if it turns out that they don't always. At least, I think that's a lot better than not showing anything because we can't be certain exactly what methods exist. Especially if we combine that with decent error reporting as in #505.

I don't know exactly how we'd achieve that though. Have some type which has the all the methods and then use that as the type signature on everything?

It would also be really good to get column name completion if that's at all possible.

@benbc
Copy link
Contributor

benbc commented May 30, 2022

I had a spike query language which used parameterized types to give some of what we'd want here. I eventually ran into a limitation that type variables and parameterized types don't compose as you'd like in Python.

It's possible that this will be a bit easier now we've got parallel hierarchies and mixins in the ehrQL definition.

@evansd
Copy link
Contributor Author

evansd commented Aug 8, 2022

The dream of having a fully statically typed query language almost certainly isn't possible due to the limitations of Python's type system and the complexities of the query model

I would like to tentatively retract this statement.

Following on from the changes in #663 I have a proof-of-concept for how autocomplete and type checking could be made to work in ehrQL. In fact it works in the purely in-browser version of VSCode, which was slightly surprising to me.

You should be able to open it directly in the Github editor here. Or you can also visit vscode.dev and copy-paste the example from:
https://github.com/opensafely-core/databuilder/blob/evansd/autocomplete-demo/autocomplete_demo.py

In either case, you should be prompted to install the Python extension and should click "Install anyway" when it warns you that functionality will be limited in the browser. If you've just installed it then you might need to wait 10 seconds or so before autocomplete becomes available.

You should find that the appropriate columns are offered as completions on each frame and, for each of those, the methods offered are just the appropriate ones for the type and dimension of the series.

The key trick here was using a descriptor to access series from frames and then adding lots of signature @overloads to the descriptor's __get__ method which say things like "given an initial type of X and an EventFrame, return a series of type XEventSeries".

There's obviously a combinatorial explosion of overloads here as we add more types, and as we start extending this approach to the rest of the ehrQL methods.

I think the only sane way of maintaining these would be to generate them programmatically i.e. have some script which introspects ehrQL/the Query Model and spits out the relevant signatures. Possibly they can live in a standalone typing stubs file. Or maybe all the concrete series classes should be generated this way, and just the mixins and base classes written by hand.

In any case, I think the only thing that needs immediate thought is making sure that the API surface we're currently exposing to users doesn't prevent us providing autocomplete later. I'm hopeful that the changes in #663 are sufficient here.

@benbc
Copy link
Contributor

benbc commented Aug 15, 2022

Wow.

@inglesp
Copy link
Contributor

inglesp commented Aug 16, 2022

Wow.

Seconded.

I think the only sane way of maintaining these would be to generate them programmatically

We're going to end up with a file called ehrql.xml, aren't we?

@evansd
Copy link
Contributor Author

evansd commented Aug 18, 2022

I've realised there's still an unsolved problem here, which is how to handle operations on frames which return different types of frame while retaining column name autocomplete.

The basic EventFrame->EventFrame methods (take and drop) should be fine as they just return their own type:

def take(self: T, condition: Whatever) -> T:
    ...

But for first_for_patient() we want to return a PatientFrame which has the same columns as an EventFrame. And as yet I've no idea how to do that statically.

@benbc
Copy link
Contributor

benbc commented Aug 18, 2022

@evansd Type parameters? There is a limit to their usefulness because the don't compose fully with type variables in Python, but they may do enough.

@benbc benbc added this to Data Team Dec 6, 2022
evansd added a commit that referenced this issue Mar 14, 2023
We define tables in ehrQL using subclasses of `EventFrame` and
`PatientFrame` e.g.
```py
@table
class event(EventFrame):
  date = Series(datetime.date)
  code = Series(SNOMEDCTCode)
```

This gives us two big advantages:
 * auto-complete for column names;
 * the option to define table-specific [helper][1] methods.

Previously however, as soon as you performed any kind of operation on
one of these tables (i.e. called a method) you'd get back a plain
`EventFrame` or `PatientFrame` with no column auto-completion and no
helper methods.

This PR ensures that we return the appropriate frame subclass from all
methods. This also allows us to remove the `__getattr__` magic from the
`BaseFrame` class.

**The nasty bits**

Returning an appropriate type in all cases requires two bits of trickery
in the form of dynamic class compilation:

 1. When we call `sort_by()` we need the result to have, as well as its
    existing methods, the `get_first/last_for_patient()` methods. So we
    construct a subclass which mixes the necessary methods.

 2. When we call one of `get_first/last_for_patient()` methods we need
    to get back a `PatientFrame`. This should have all the columns
    defined on the original frame, but none of the methods. We
    introspect the class definition to extract all the columns and
    construct a new `PatientFrame` with those columns included.

**Static auto-complete**

The above gives us auto-complete in a dynamic context like an IPython
session when code is actually executed. We also get a limited form of
static (type-based) auto-complete in VSCode. Previously, this worked
only on the original frame and this PR extends this so that it persists
through `where/except_where` calls. However it won't persist through
`sort_by` or `get_first/last_for_patient()`.

After reasonably extensive investigation (which I need to write up in
[this ticket][2]) I don't _think_ our ideal behaviour is acheivable in
Pylance (VSCode's type checker) as things currently stand. But I don't
think anything in this PR makes things worse in that regard.

[1]: #1021
[2]: #506
evansd added a commit that referenced this issue Mar 14, 2023
We define tables in ehrQL using subclasses of `EventFrame` and
`PatientFrame` e.g.
```py
@table
class event(EventFrame):
  date = Series(datetime.date)
  code = Series(SNOMEDCTCode)
```

This gives us two big advantages:
 * auto-complete for column names;
 * the option to define [table-specific helper][1] methods.

Previously however, as soon as you performed any kind of operation on
one of these tables (i.e. called a method) you'd get back a plain
`EventFrame` or `PatientFrame` with no column auto-completion and no
helper methods.

This PR ensures that we return the appropriate frame subclass from all
methods. This also allows us to remove the `__getattr__` magic from the
`BaseFrame` class.

**The nasty bits**

Returning an appropriate type in all cases requires two bits of trickery
in the form of dynamic class compilation:

 1. When we call `sort_by()` we need the result to have, as well as its
    existing methods, the `get_first/last_for_patient()` methods. So we
    construct a subclass which mixes the necessary methods.

 2. When we call one of `get_first/last_for_patient()` methods we need
    to get back a `PatientFrame`. This should have all the columns
    defined on the original frame, but none of the methods. We
    introspect the class definition to extract all the columns and
    construct a new `PatientFrame` with those columns included.

**Static auto-complete**

The above gives us auto-complete in a dynamic context like an IPython
session when code is actually executed. We also get a limited form of
static (type-based) auto-complete in VSCode. Previously, this worked
only on the original frame and this PR extends this so that it persists
through `where/except_where` calls. However it won't persist through
`sort_by` or `get_first/last_for_patient()`.

After reasonably extensive investigation (which I need to write up in
[this ticket][2]) I don't _think_ our ideal behaviour is acheivable in
Pylance (VSCode's type checker) as things currently stand. But I don't
think anything in this PR makes things worse in that regard.

[1]: #1021
[2]: #506
evansd added a commit that referenced this issue Mar 14, 2023
We define tables in ehrQL using subclasses of `EventFrame` and
`PatientFrame` e.g.
```py
@table
class event(EventFrame):
  date = Series(datetime.date)
  code = Series(SNOMEDCTCode)
```

This gives us two big advantages:
 * auto-complete for column names;
 * the option to define [table-specific helper][1] methods.

Previously however, as soon as you performed any kind of operation on
one of these tables (i.e. called a method) you'd get back a plain
`EventFrame` or `PatientFrame` with no column auto-completion and no
helper methods.

This PR ensures that we return the appropriate frame subclass from all
methods. This also allows us to remove the `__getattr__` magic from the
`BaseFrame` class.

**The nasty bits**

Returning an appropriate type in all cases requires two bits of trickery
in the form of dynamic class compilation:

 1. When we call `sort_by()` we need the result to have, as well as its
    existing methods, the `get_first/last_for_patient()` methods. So we
    construct a subclass which mixes the necessary methods.

 2. When we call one of `get_first/last_for_patient()` methods we need
    to get back a `PatientFrame`. This should have all the columns
    defined on the original frame, but none of the methods. We
    introspect the class definition to extract all the columns and
    construct a new `PatientFrame` with those columns included.

**Static auto-complete**

The above gives us auto-complete in a dynamic context like an IPython
session when code is actually executed. We also get a limited form of
static (type-based) auto-complete in VSCode. Previously, this worked
only on the original frame and this PR extends this so that it persists
through `where/except_where` calls. However it won't persist through
`sort_by` or `get_first/last_for_patient()`.

After reasonably extensive investigation (which I need to write up in
[this ticket][2]) I don't _think_ our ideal behaviour is acheivable in
Pylance (VSCode's type checker) as things currently stand. But I don't
think anything in this PR makes things worse in that regard.

[1]: #1021
[2]: #506
evansd added a commit that referenced this issue Mar 14, 2023
We define tables in ehrQL using subclasses of `EventFrame` and
`PatientFrame` e.g.
```py
@table
class event(EventFrame):
  date = Series(datetime.date)
  code = Series(SNOMEDCTCode)
```

This gives us two big advantages:
 * auto-complete for column names;
 * the option to define [table-specific helper][1] methods.

Previously however, as soon as you performed any kind of operation on
one of these tables (i.e. called a method) you'd get back a plain
`EventFrame` or `PatientFrame` with no column auto-completion and no
helper methods.

This PR ensures that we return the appropriate frame subclass from all
methods. This also allows us to remove the `__getattr__` magic from the
`BaseFrame` class.

**The nasty bits**

Returning an appropriate type in all cases requires two bits of trickery
in the form of dynamic class compilation:

 1. When we call `sort_by()` we need the result to have, as well as its
    existing methods, the `get_first/last_for_patient()` methods. So we
    construct a subclass which mixes in the necessary methods.

 2. When we call one of the `get_first/last_for_patient()` methods we
    need to get back a `PatientFrame`. This should have all the columns
    defined on the original frame, but none of the methods. We introspect
    the class definition to extract all the columns and construct a new
    `PatientFrame` with those columns included.

**Static auto-complete**

The above gives us auto-complete in a dynamic context like an IPython
session where code is actually executed. We also get a limited form of
static (type-based) auto-complete in VSCode. Previously, this worked
only on the original frame and this PR extends this so that it persists
through `where/except_where` calls. However it won't persist through
`sort_by` or `get_first/last_for_patient()`.

After reasonably extensive investigation (which I need to write up in
[this ticket][2]) I don't _think_ our ideal behaviour is acheivable in
Pylance (VSCode's type checker) as things currently stand. But I don't
think anything in this PR makes things worse in that regard.

[1]: #1021
[2]: #506
evansd added a commit that referenced this issue Mar 14, 2023
We define tables in ehrQL using subclasses of `EventFrame` and
`PatientFrame` e.g.
```py
@table
class event(EventFrame):
  date = Series(datetime.date)
  code = Series(SNOMEDCTCode)
```

This gives us two big advantages:
 * auto-complete for column names;
 * the option to define [table-specific helper][1] methods.

Previously however, as soon as you performed any kind of operation on
one of these tables (i.e. called a method) you'd get back a plain
`EventFrame` or `PatientFrame` with no column auto-completion and no
helper methods.

This PR ensures that we return the appropriate frame subclass from all
methods. This also allows us to remove the `__getattr__` magic from the
`BaseFrame` class.

**The nasty bits**

Returning an appropriate type in all cases requires two bits of trickery
in the form of dynamic class compilation:

 1. When we call `sort_by()` we need the result to have, as well as its
    existing methods, the `get_first/last_for_patient()` methods. So we
    construct a subclass which mixes in the necessary methods.

 2. When we call one of the `get_first/last_for_patient()` methods we
    need to get back a `PatientFrame`. This should have all the columns
    defined on the original frame, but none of the methods. We introspect
    the class definition to extract all the columns and construct a new
    `PatientFrame` with those columns included.

**Static auto-complete**

The above gives us auto-complete in a dynamic context like an IPython
session where code is actually executed. We also get a limited form of
static (type-based) auto-complete in VSCode. Previously, this worked
only on the original frame and this PR extends this so that it persists
through `where/except_where` calls. However it won't persist through
`sort_by` or `get_first/last_for_patient()`.

After reasonably extensive investigation (which I need to write up in
[this ticket][2]) I don't _think_ our ideal behaviour is acheivable in
Pylance (VSCode's type checker) as things currently stand. But I don't
think anything in this PR makes things worse in that regard.

[1]: #1021
[2]: #506
evansd added a commit that referenced this issue Mar 14, 2023
We define tables in ehrQL using subclasses of `EventFrame` and
`PatientFrame` e.g.
```py
@table
class event(EventFrame):
  date = Series(datetime.date)
  code = Series(SNOMEDCTCode)
```

This gives us two big advantages:
 * auto-complete for column names;
 * the option to define [table-specific helper][1] methods.

Previously however, as soon as you performed any kind of operation on
one of these tables (i.e. called a method) you'd get back a plain
`EventFrame` or `PatientFrame` with no column auto-completion and no
helper methods.

This PR ensures that we return the appropriate frame subclass from all
methods. This also allows us to remove the `__getattr__` magic from the
`BaseFrame` class.

**The nasty bits**

Returning an appropriate type in all cases requires two bits of trickery
in the form of dynamic class compilation:

 1. When we call `sort_by()` we need the result to have, as well as its
    existing methods, the `get_first/last_for_patient()` methods. So we
    construct a subclass which mixes in the necessary methods.

 2. When we call one of the `get_first/last_for_patient()` methods we
    need to get back a `PatientFrame`. This should have all the columns
    defined on the original frame, but none of the methods. We introspect
    the class definition to extract all the columns and construct a new
    `PatientFrame` with those columns included.

**Static auto-complete**

The above gives us auto-complete in a dynamic context like an IPython
session where code is actually executed. We also get a limited form of
static (type-based) auto-complete in VSCode. Previously, this worked
only on the original frame and this PR extends this so that it persists
through `where/except_where` calls. However it won't persist through
`sort_by` or `get_first/last_for_patient()`.

After reasonably extensive investigation (which I need to write up in
[this ticket][2]) I don't _think_ our ideal behaviour is acheivable in
Pylance (VSCode's type checker) as things currently stand. But I don't
think anything in this PR makes things worse in that regard.

[1]: #1021
[2]: #506
evansd added a commit that referenced this issue Mar 14, 2023
We define tables in ehrQL using subclasses of `EventFrame` and
`PatientFrame` e.g.
```py
@table
class event(EventFrame):
  date = Series(datetime.date)
  code = Series(SNOMEDCTCode)
```

This gives us two big advantages:
 * auto-complete for column names;
 * the option to define [table-specific helper][1] methods.

Previously however, as soon as you performed any kind of operation on
one of these tables (i.e. called a method) you'd get back a plain
`EventFrame` or `PatientFrame` with no column auto-completion and no
helper methods.

This PR ensures that we return the appropriate frame subclass from all
methods. This also allows us to remove the `__getattr__` magic from the
`BaseFrame` class.

**The nasty bits**

Returning an appropriate type in all cases requires two bits of trickery
in the form of dynamic class compilation:

 1. When we call `sort_by()` we need the result to have, as well as its
    existing methods, the `get_first/last_for_patient()` methods. So we
    construct a subclass which mixes in the necessary methods.

 2. When we call one of the `get_first/last_for_patient()` methods we
    need to get back a `PatientFrame`. This should have all the columns
    defined on the original frame, but none of the methods. We introspect
    the class definition to extract all the columns and construct a new
    `PatientFrame` with those columns included.

**Static auto-complete**

The above gives us auto-complete in a dynamic context like an IPython
session where code is actually executed. We also get a limited form of
static (type-based) auto-complete in VSCode. Previously, this worked
only on the original frame and this PR extends this so that it persists
through `where/except_where` calls. However it won't persist through
`sort_by` or `get_first/last_for_patient()`.

After reasonably extensive investigation (which I need to write up in
[this ticket][2]) I don't _think_ our ideal behaviour is acheivable in
Pylance (VSCode's type checker) as things currently stand. But I don't
think anything in this PR makes things worse in that regard.

[1]: #1021
[2]: #506
@inglesp inglesp added this to the P3 milestone Oct 30, 2023
@rw251
Copy link
Contributor

rw251 commented Dec 20, 2024

A lot of this has now been done in #2337 . Here are the outstanding things:

  • first_for_patient() and last_for_patient() might not be possible without a refactor as noted above:

But for first_for_patient() we want to return a PatientFrame which has the same columns as an EventFrame. And as yet I've no idea how to do that statically.

  • DateDifference. It would be straightforward to add a return type of DateDifference for the __sub__ and __rsub__ methods on DateFunctions. But what you really want is to then know whether properties .days, .weeks etc are IntPatientSeries or IntEventSeries
  • Nothing has been done for .case(when().then(), otherwise=).
  • Nothing has been done for to_category() (an alias for map_values()). They're almost always going to return StrXXXSeries, which would be easy to hint (same as most existing hints), but maybe your category could be an int, float, bool etc? Maybe we can assume to_category() returns strings, but the more generic map_values() can return anything
  • maximum_of() and minimum_of() currently only give type hints if the first argument is a Series. It doesn't work otherwise because the casting uses the type of the first Series argument, so e.g. maximum_of("2024-01-01", date_series), and maximum_of(30, float_series) will both work because the first argument casts to the series type - but it makes hints tricky
  • Finally nothing has been done to automate this by extracting the signatures into a stub file.

@rw251
Copy link
Contributor

rw251 commented Jan 7, 2025

A couple of other minor things that I've spotted:

  • patients.is_alive_on() is not working because the type hints here:
    @overload
    def is_null(self: "PatientSeries", other) -> "BoolPatientSeries": ...
    @overload
    def is_null(self: "EventSeries", other) -> "BoolEventSeries": ...
    def is_null(self):
    """
    Return a boolean series which is True for each value in this series which is
    NULL, and False otherwise.
    """
    return _apply(qm.Function.IsNull, self)
    @overload
    def is_not_null(self: "PatientSeries", other) -> "BoolPatientSeries": ...
    @overload
    def is_not_null(self: "EventSeries", other) -> "BoolEventSeries": ...
    def is_not_null(self):
    incorrectly have an other parameter so they aren't actually matching as expected.
  • patients.age_on() is also not working, but I think because the pattern (date - DatePatientSeries) isn't currently type hinted.

@rw251
Copy link
Contributor

rw251 commented Jan 16, 2025

Another thing. When you define a dataset the autocomplete shows define_population() and a couple of others, but it doesn't autocomplete for any columns set. E.g.

dataset.sex = patients.sex
dataset.

sex doesn't appear in the autocomplete dropdown

UPDATE
This closed issue on pylance suggests that dynamically added props on a class instance will never be supported. So short of a large refactor of how we define a dataset this seems unlikely to happen. However, although sex wont appear in the dropdown, the language server still knows what dataset.sex is, so dataset.sex. gives the correct autocomplete behaviour.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: No status
Development

No branches or pull requests

4 participants