-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DataCube caching only effective after reshuffle #986
Comments
A forced shuffle is possible: Regular to_scl_dilation_mask, stage 33 compute scl dilation and filter steps done by masking, meaning the dilation part is repeated because it was already done when determining mask keys: Forced reshuffle after to_scl_dilation, scl dilation steps are no longer repeated as seen in stage 42 which is now only a few steps: |
@HansVRP they actually have access to the cubes they needed, but it is the stage boundaries that govern the automatic caching by spark itself, and stage boundaries are introduced by reshuffles. Indeed, better solution would maybe force a reshuffle whenever we see a node is reused. The dry run would be the place to detect this, but it doesn't really exist yet and we also don't have the machinery in place to then enforce the reshuffle. So this is a larger task to implement properly. |
…also provides significant speed-up of the test itself Open-EO/openeo-geopyspark-driver#986
Consider a simple case of cube caching:
Geopyspark will take care that the rdd of cube2 is reused.
However, caching is only happening effectively if both processA and processB start with a 'reshuffle' operation, or if as part of 'costly process' there was a reshuffle and the actually costly part happened before that.
If due to later reshuffling, the costly part ends up in 'unique' stages, it will be recomputed multiple times.
To solve this, we would ideally require knowledge on which cubes are effectively reused, allowing to for instance force a reshuffling at cache points.
The text was updated successfully, but these errors were encountered: