Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DataCube caching only effective after reshuffle #986

Open
jdries opened this issue Jan 8, 2025 · 2 comments
Open

DataCube caching only effective after reshuffle #986

jdries opened this issue Jan 8, 2025 · 2 comments

Comments

@jdries
Copy link
Contributor

jdries commented Jan 8, 2025

Consider a simple case of cube caching:

cube1-> costly process -> cube2 -> processA
cube2 -> processB

Geopyspark will take care that the rdd of cube2 is reused.
However, caching is only happening effectively if both processA and processB start with a 'reshuffle' operation, or if as part of 'costly process' there was a reshuffle and the actually costly part happened before that.

If due to later reshuffling, the costly part ends up in 'unique' stages, it will be recomputed multiple times.

To solve this, we would ideally require knowledge on which cubes are effectively reused, allowing to for instance force a reshuffling at cache points.

jdries added a commit to Open-EO/openeo-geotrellis-extensions that referenced this issue Jan 8, 2025
@jdries
Copy link
Contributor Author

jdries commented Jan 8, 2025

A forced shuffle is possible:

Regular to_scl_dilation_mask, stage 33 compute scl dilation and filter steps done by masking, meaning the dilation part is repeated because it was already done when determining mask keys:

image

Forced reshuffle after to_scl_dilation, scl dilation steps are no longer repeated as seen in stage 42 which is now only a few steps:

image

@jdries
Copy link
Contributor Author

jdries commented Jan 8, 2025

@HansVRP they actually have access to the cubes they needed, but it is the stage boundaries that govern the automatic caching by spark itself, and stage boundaries are introduced by reshuffles.

Indeed, better solution would maybe force a reshuffle whenever we see a node is reused. The dry run would be the place to detect this, but it doesn't really exist yet and we also don't have the machinery in place to then enforce the reshuffle. So this is a larger task to implement properly.

jdries added a commit to Open-EO/openeo-geotrellis-extensions that referenced this issue Jan 8, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant