Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove copy statements in sea ice metrics #1041

Merged
merged 5 commits into from
Feb 2, 2024
Merged

Remove copy statements in sea ice metrics #1041

merged 5 commits into from
Feb 2, 2024

Conversation

acordonez
Copy link
Collaborator

@acordonez acordonez commented Feb 1, 2024

Remove copy statements to reduce memory needs in sea ice metrics.

With local testing this reduced max memory usage for the Arctic section from 12 -> 8GB, so it seems to help.

Edit: The new changes I'm pushing get the max memory to about 4Gb for me on gates, on the command line. I've refactored the code a bit and the metrics calculation order has changed a bit in the multiple realizations case. I'm getting slightly different MSE for the climatology for some realizations, and it's not clear why. The difference is usually in the 0.1 or 0.01 place. The figures look very similar but you can see some small shifts in some of the realization spread. I can keep digging into this but it's going to take me at least another day to see if this is a bug or the natural result of the refactor.

Edit2: Some more explanation of the refactor. Previously I would loop through all realizations and save the total ice extent timeseries for each realization. At the end of the loop, I'd get the model mean, calculate the overall total extent and climatologies, and get the metrics.
The new method gets the total mean extent and climatological extent for each realization, then gets the average of those quantities over all models and calculates all the metrics. For the total extent the difference is quite small, but for the climatology this seems to make more of a difference for the MSE.

@acordonez
Copy link
Collaborator Author

@lee1043 @durack1 I don't see a way to test this branch in Jason's prototype PMP binderhub link, but it runs on Gates and should reduce some of the memory usage.

@durack1
Copy link
Collaborator

durack1 commented Feb 1, 2024

@jasonb5 just looping you in on this thread, so we can try and sync concurrent changes to get to the final binder config that works seamlessly.

Any suggestions for @acordonez to get these changes merged into your working update-binderhub branch?

@durack1
Copy link
Collaborator

durack1 commented Feb 1, 2024

@acordonez I just took a quick peek at code, and if you wedge this somewhere, you could call this to report resource usage throughout code execution:

>>> import gc, resource
>>> import numpy as np
>>> def getResources():
...     memGb = "{:05.2f}".format(np.float32(resource.getrusage(resource.RUSAGE_SELF).ru_maxrss)/1.e6)
...     pyObj = "{:08d}".format(len(gc.get_objects()))
...     print("Resources:", memGb, "Gb;", pyObj, "pyObjs in memory")
... 
>>> getResources()
Resources: 00.03 Gb; 00021319 pyObjs in memory

I just wedged this into the driver file as a test, I can push this to your branch if you like?

@acordonez
Copy link
Collaborator Author

@durack1 I'm doing something like that in my code locally, just don't want to commit it.

@lee1043
Copy link
Contributor

lee1043 commented Feb 1, 2024

@acordonez, just curious what if use open_mfdataset instead of open_dataset at L385? open_mfdataset should work for a single file as well. Can you remind me what .compute actually does?

@acordonez
Copy link
Collaborator Author

acordonez commented Feb 1, 2024

@lee1043 Deleted my last comment about the .compute() statement, it actually doesn't seem to make a difference when looking at multiple runs.

@acordonez
Copy link
Collaborator Author

@lee1043 using open_mfdataset increases my max memory use back up, to 11GB. So it's better to use open_dataset if there's only one file to open in terms of memory.

@acordonez
Copy link
Collaborator Author

Masking out the eight different regions is where the memory usage jumps from 3 to 7.5GB, so even though I'm not making new copies directly the program still uses more space for that.

@durack1
Copy link
Collaborator

durack1 commented Feb 1, 2024

@lee1043 using open_mfdataset increases my max memory use back up, to 11GB. So it's better to use open_dataset if there's only one file to open in terms of memory.

a plus ~3Gb/~50% delta using a different function, woah, this is bad!

It seems to me that excessive memory usage is an issue that a number of folks have encountered:
https://stackoverflow.com/questions/73535939/how-to-lower-ram-usage-using-xarray-open-mfdataset-and-the-quantile-function
https://stackoverflow.com/questions/60902571/how-can-i-optimize-a-code-that-is-using-xarray-for-better-performance?rq=3
https://stackoverflow.com/questions/71843487/why-does-manipulating-xarrays-takes-up-so-much-memory-permanently?rq=3

The pangeo comment (below) seems to suggest that using open_dataset uses lazy loading, which is more efficient than the mf call, maybe that's why the 11 vs 8 Gb example?

https://discourse.pangeo.io/t/technical-question-about-reading-file-by-chunks-and-xarray-dask-netcdf/2688

@durack1
Copy link
Collaborator

durack1 commented Feb 1, 2024

@acordonez, just curious what if use open_mfdataset instead of open_dataset at L385? open_mfdataset should work for a single file as well. Can you remind me what .compute actually does?

@lee1043 I think this requests that the lazy loading is actually implemented, so in the case that you choose open_dataset it continues to queue calls to interact with data until the last minute, which means the code executes very fast (no IO) until is requires the load to get to the next step. If you issue .compute I believe this then forces the process, and the IO. I believe .load does a similar thing

Turns out I was a little offbase, see pydata/xarray#6837 for a better description

lee1043 added a commit that referenced this pull request Feb 1, 2024
Update xcdat_openxml.py to use `open_dataset` instead of `open_mfdataset` to save memory, considering discussions at #1041 (comment)
@acordonez
Copy link
Collaborator Author

@durack1 @lee1043 I've just been testing the Arctic section of the code; but with rearranging some steps and heavily using the del statement, I've been able to get the max memory usage down to 5.8GB. That's roughly half of what it was originally. I don't know if we can get it much lower than that. I need to clean up the code before I can push it but it could be ready sometime this afternoon.

@durack1
Copy link
Collaborator

durack1 commented Feb 1, 2024

@durack1 @lee1043 I've just been testing the Arctic section of the code; but with rearranging some steps and heavily using the del statement, I've been able to get the max memory usage down to 5.8GB. That's roughly half of what it was originally. I don't know if we can get it much lower than that. I need to clean up the code before I can push it but it could be ready sometime this afternoon.

This sounds like great progress. If we can get that merged into main then hopefully @jasonb5 should be able to rebase his branch, we can test, and hopefully get this env and binder notebook example working. If we can get these changes done today, it will allow me to ping the LANL folks for them to test too.

And just to check, this 5.8Gb is for the whole notebook, even down to the multi-model examples at the bottom?

@durack1
Copy link
Collaborator

durack1 commented Feb 1, 2024

@sashakames just looping you on this thread, as this real-world example of resource usage is probably something that would be useful in the ESGF2-US project discussions

@acordonez
Copy link
Collaborator Author

@durack1 No this is 5.8 GB for just the Arctic section of the driver code, tested on the command line.

@durack1
Copy link
Collaborator

durack1 commented Feb 1, 2024

@durack1 No this is 5.8 GB for just the Arctic section of the driver code, tested on the command line.

Ah ha, ok. It might be useful to run through the whole thing, so we can ascertain the MAXIMUM memory usage, and if the nimbus machine can support it, bump each binder environment (likely ~40 concurrently) from 10240 to whatever this max usage is, plus a buffer

@durack1
Copy link
Collaborator

durack1 commented Feb 1, 2024

@acordonez just in case it's useful, I just found https://github.com/pythonprofilers/memory_profiler/ - which appears to be alive on conda-forge (here the noarch version looks like the latest release 0.61)

@durack1
Copy link
Collaborator

durack1 commented Feb 1, 2024

@acordonez here are a couple of other tidbits:

pydata/xarray#3200 (comment) - using xr.set_options(file_cache_maxsize=1) ensured memory was released
pydata/xarray#7018 (comment) - using explicit dimension chunking to reduce memory usage (presumably this gets around the netcdf memory leak, e.g. ds = xr.open_dataset(fldr_in_grib, engine="cfgrib", chunks={"latitude":875}, backend_kwargs={'indexpath': ''}))
And another some others with no solutions pydata/xarray#7397, pydata/xarray#7404

@acordonez
Copy link
Collaborator Author

@durack1 Thanks. I'm working on a general fix for the regions since that made a big difference for the Arctic, then I can see if any of the profilers show other spots for improvement.

@durack1
Copy link
Collaborator

durack1 commented Feb 1, 2024

@durack1 Thanks. I'm working on a general fix for the regions since that made a big difference for the Arctic, then I can see if any of the profilers show other spots for improvement.

Perfect! I think while it's a little painful, we're learning alot here.

The comment I made about suppressing warnings, that could also be caught quickly (e.g. here), although we'll have to be careful about using this when we could be suppressing really useful debugging info from users that are going to want to play outside of the basic notebook template

import warnings
warnings.simplefilter("ignore")

@durack1
Copy link
Collaborator

durack1 commented Feb 2, 2024

@acordonez just a little python hack, you can use var()["varName"] = something*somethingElse as a way of dynamically assigning variables (so looping through a list of variable names). This approach has saved me ALOT of code repeats in years gone by - it also means you can load up specifics, so the region names, along with their boundary limits in a dictionary and iterate that way

@acordonez
Copy link
Collaborator Author

@lee1043 @durack1 I've pushed my fix to lower the memory usage of the regional subsetting. This did result in some refactoring of the model loops, and there are some small changes to the resulting metrics values. See my main comment as well. I've pushed a rerun version of the notebook.

@durack1 Adding this to the sea ice driver did not suppress warnings for me. I can try some other things later but need to run now.

import warnings
warnings.simplefilter("ignore")

Copy link
Collaborator

@durack1 durack1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@acordonez this looks great!

@lee1043
Copy link
Contributor

lee1043 commented Feb 2, 2024

@acordonez thank you for the update! The numbers in JSON are consistent before and after the change, they are not different at least until 3rd decimal point.

@durack1
Copy link
Collaborator

durack1 commented Feb 2, 2024

@acordonez thank you for the update! The numbers in JSON are consistent before and after the change, they are not different at least until 3rd decimal point.

@lee1043 I did spot a couple of changes (there are likely a couple of extras that I missed)

  • 0.0878 (old) vs 0.1074 (new) - this is MIROC6-np-r1i1p1f1-OSI-SAF-monthly_clim
  • 83.8321 vs 83.5771 - MIROC6-antarctic-r2i1p1f1-OSI-SAF-monthly_clim also io, ca
  • 0.6767 vs 0.5282 - E3SM-1-0-sp-r1i2p2f1-OSI-SAF-monthly_clim, also arctic

@acordonez
Copy link
Collaborator Author

@acordonez thank you for the update! The numbers in JSON are consistent before and after the change, they are not different at least until 3rd decimal point.

@lee1043 I did spot a couple of changes (there are likely a couple of extras that I missed)

* 0.0878 (old) vs 0.1074 (new) - this is MIROC6-np-r1i1p1f1-OSI-SAF-monthly_clim

* 83.8321 vs 83.5771 - MIROC6-antarctic-r2i1p1f1-OSI-SAF-monthly_clim also io, ca

* 0.6767 vs 0.5282 - E3SM-1-0-sp-r1i2p2f1-OSI-SAF-monthly_clim, also arctic

I think @durack1 caught the worst changes. I edited the main comment to add more details about how the calculations changed.

@durack1
Copy link
Collaborator

durack1 commented Feb 2, 2024

@lee1043 bummer, I just ran the code (pre this PR merge) and maxed out memory even with the 16384 MB limit, so this refactoring is required for the notebook to work with the update config.

Once this is merged, I'll take another try.

Edit: I just tried launching this branch through binder, and it's now out of sync with main and is missing the cartopy library, so it will need to be rebased before it can be tested with a functioning binder instance

Edit2: bummer, I think I just managed to overwrite sea_ice_driver.py with the updated file, and it still bombed at 15.90 GB as a last entry. I think we will have to work with this trimmed down timeseries 1990 to 2010 might work. I just tried the period over which models and obs have complete coverage

sea_ice_driver.py -p sea_ice_param.py --osyear 1990 --oeyear 2015 --msyear 1990 --meyear 2015

and this seemed to work, maxing out around ~12.5 GB through the Antarctic section. I have now tried this ~4 times or so and it seems to work everytime

Edit3: ok, great, I tested this again couple of times, with kernel restarts each time. It works from start to end seamlessly with the temporal limitation from 1990 through 2015. I'll send a copy of the updated notebook via email, which has the most trivial edits over the top of what is the current main version - seems like once we have the notebook updated with the additional args, and this PR merged, we'll be good to get the LANL folks to test this out on their end tomorrow - Yay!

@durack1
Copy link
Collaborator

durack1 commented Feb 2, 2024

Also if we add

import logging
logging.getLogger('xcdat').setLevel(logging.ERROR)

to all files calling xcdat, then we suppress those annoying warnings, voila - this cleans up the output for users a lot. It seems to work in the sector and line plotting scripts, but dropping it into pcmdi_metrics.io.base didn't seem to catch it for the obs files

@acordonez
Copy link
Collaborator Author

Also if we add

import logging
logging.getLogger('xcdat').setLevel(logging.ERROR)

to all files calling xcdat, then we suppress those annoying warnings, voila - this cleans up the output for users a lot. It seems to work in the sector and line plotting scripts, but dropping it into pcmdi_metrics.io.base didn't seem to catch it for the obs files

@durack1 if you got this working on your end would you be able to open a PR for it? The suppress warnings statement has not worked for me inside the PMP code.

@acordonez
Copy link
Collaborator Author

@durack1 @lee1043 Should I merge this PR now? It sounds like some other ideas have come up in the chat that aren't related to this specific fix, which we can address in other PRs.

@lee1043
Copy link
Contributor

lee1043 commented Feb 2, 2024

@acordonez yes please. By merging this PR we can start testing the binderhub, and additional edits can be handled by a separate PR.

@acordonez acordonez merged commit 972984c into main Feb 2, 2024
5 checks passed
@lee1043 lee1043 added this to the 3.3.1 milestone Feb 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants