Cache csv download data and shas #1746

rebkwok · 2023-11-15T17:42:04Z

Fixes #1744
And hopefully also fixes #1740

The opensafely CLI, study github actions and jobserver now all call the new codelist check endpoing /api/v1/check. Mostly this is a pretty fast call, but for very large dm+d codelists, and/or studies that have a lot of codelists, especially dm+d ones, it can be slow, and it seems to be the cause of tech-support issue #1740

We can cache these shas so we don't have to go through the VMP mapping checks every time. dm+d is release once a week, so at most we need to re-cache weekly. There are under 700 dm+d codelist versions that would need to be re-cached, and locally this is quick to do, so it can be done as part of the weekly import.

Since caching the shas involves fetching the CSV data for downloading, we can also cache that. It's caching with a key that includes the function parameters so that we can cache different calls (e.g. OSI will always pass fixed_headers=True)

inglesp

I think this is ok, but there's a risk of a (harmless, as far as I can tell) race when a codelist is downloaded at the same time as the caches are being built after import.

Do we need to rebuild the caches after import? Or can we do the caching on demand?

rebkwok · 2023-11-16T13:58:28Z

I think this is ok, but there's a risk of a (harmless, as far as I can tell) race when a codelist is downloaded at the same time as the caches are being built after import.

Do we need to rebuild the caches after import? Or can we do the caching on demand?

We don't need to rebuild the caches after import, but I was concerned that if we don't, we'll still sometimes get the behaviour we saw yesterday on OpenCodelists, where the check endpoint made everything go slow. Given that it takes a minimal time to re-cache 600+ dm+d codelist versions though, it's possible that dm+d isn't actually the problem, it's more studies that have huge numbers of codelists in general.

However, there's also a chance of a race if two users try to download a codelist that needs cached at the same time (also CI runs/job-server checks, since the check endpoint will also re-cache if necessary). The import runs automatically at about 2am, so there's probably less risk of a race with users if it re-caches then.

I could make the re-cache after import an optional flag, so it does it in the automated 2am import, but not if we re-run an import manually (more likely to be during working hours)

inglesp · 2023-11-16T14:02:29Z

I could make the re-cache after import an optional flag,

I don't think it's worth adding the complexity for this.

rebkwok added 4 commits November 15, 2023 15:31

Return an error in the check response for non-downloadable dmd codelists

b021dde

Cache the codelist download shas

e0cc3f7

Cache the csv download data

27e1d5c

Cache CSV download data after dm+d imports

e397766

inglesp approved these changes Nov 16, 2023

View reviewed changes

rebkwok merged commit 1bedeec into main Nov 16, 2023
6 checks passed

rebkwok deleted the cache-csv-data branch November 16, 2023 14:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cache csv download data and shas #1746

Cache csv download data and shas #1746

rebkwok commented Nov 15, 2023

inglesp left a comment

rebkwok commented Nov 16, 2023

inglesp commented Nov 16, 2023

Cache csv download data and shas #1746

Cache csv download data and shas #1746

Conversation

rebkwok commented Nov 15, 2023

inglesp left a comment

Choose a reason for hiding this comment

rebkwok commented Nov 16, 2023

inglesp commented Nov 16, 2023