Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cache csv download data and shas #1746

Merged
merged 4 commits into from
Nov 16, 2023
Merged

Cache csv download data and shas #1746

merged 4 commits into from
Nov 16, 2023

Conversation

rebkwok
Copy link
Contributor

@rebkwok rebkwok commented Nov 15, 2023

Fixes #1744
And hopefully also fixes #1740

The opensafely CLI, study github actions and jobserver now all call the new codelist check endpoing /api/v1/check. Mostly this is a pretty fast call, but for very large dm+d codelists, and/or studies that have a lot of codelists, especially dm+d ones, it can be slow, and it seems to be the cause of tech-support issue #1740

We can cache these shas so we don't have to go through the VMP mapping checks every time. dm+d is release once a week, so at most we need to re-cache weekly. There are under 700 dm+d codelist versions that would need to be re-cached, and locally this is quick to do, so it can be done as part of the weekly import.

Since caching the shas involves fetching the CSV data for downloading, we can also cache that. It's caching with a key that includes the function parameters so that we can cache different calls (e.g. OSI will always pass fixed_headers=True)

Copy link
Contributor

@inglesp inglesp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is ok, but there's a risk of a (harmless, as far as I can tell) race when a codelist is downloaded at the same time as the caches are being built after import.

Do we need to rebuild the caches after import? Or can we do the caching on demand?

@rebkwok
Copy link
Contributor Author

rebkwok commented Nov 16, 2023

I think this is ok, but there's a risk of a (harmless, as far as I can tell) race when a codelist is downloaded at the same time as the caches are being built after import.

Do we need to rebuild the caches after import? Or can we do the caching on demand?

We don't need to rebuild the caches after import, but I was concerned that if we don't, we'll still sometimes get the behaviour we saw yesterday on OpenCodelists, where the check endpoint made everything go slow. Given that it takes a minimal time to re-cache 600+ dm+d codelist versions though, it's possible that dm+d isn't actually the problem, it's more studies that have huge numbers of codelists in general.

However, there's also a chance of a race if two users try to download a codelist that needs cached at the same time (also CI runs/job-server checks, since the check endpoint will also re-cache if necessary). The import runs automatically at about 2am, so there's probably less risk of a race with users if it re-caches then.

I could make the re-cache after import an optional flag, so it does it in the automated 2am import, but not if we re-run an import manually (more likely to be during working hours)

@inglesp
Copy link
Contributor

inglesp commented Nov 16, 2023

I could make the re-cache after import an optional flag,

I don't think it's worth adding the complexity for this.

@rebkwok rebkwok merged commit 1bedeec into main Nov 16, 2023
6 checks passed
@rebkwok rebkwok deleted the cache-csv-data branch November 16, 2023 14:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Cache CSV download shas OpenCodelists Freshping checks are flapping
2 participants