Remove copy statements in sea ice metrics #1041

acordonez · 2024-02-01T18:14:49Z

Remove copy statements to reduce memory needs in sea ice metrics.

With local testing this reduced max memory usage for the Arctic section from 12 -> 8GB, so it seems to help.

Edit: The new changes I'm pushing get the max memory to about 4Gb for me on gates, on the command line. I've refactored the code a bit and the metrics calculation order has changed a bit in the multiple realizations case. I'm getting slightly different MSE for the climatology for some realizations, and it's not clear why. The difference is usually in the 0.1 or 0.01 place. The figures look very similar but you can see some small shifts in some of the realization spread. I can keep digging into this but it's going to take me at least another day to see if this is a bug or the natural result of the refactor.

Edit2: Some more explanation of the refactor. Previously I would loop through all realizations and save the total ice extent timeseries for each realization. At the end of the loop, I'd get the model mean, calculate the overall total extent and climatologies, and get the metrics.
The new method gets the total mean extent and climatological extent for each realization, then gets the average of those quantities over all models and calculates all the metrics. For the total extent the difference is quite small, but for the climatology this seems to make more of a difference for the MSE.

acordonez · 2024-02-01T18:16:01Z

@lee1043 @durack1 I don't see a way to test this branch in Jason's prototype PMP binderhub link, but it runs on Gates and should reduce some of the memory usage.

durack1 · 2024-02-01T18:19:07Z

@jasonb5 just looping you in on this thread, so we can try and sync concurrent changes to get to the final binder config that works seamlessly.

Any suggestions for @acordonez to get these changes merged into your working update-binderhub branch?

durack1 · 2024-02-01T18:51:49Z

@acordonez I just took a quick peek at code, and if you wedge this somewhere, you could call this to report resource usage throughout code execution:

>>> import gc, resource
>>> import numpy as np
>>> def getResources():
...     memGb = "{:05.2f}".format(np.float32(resource.getrusage(resource.RUSAGE_SELF).ru_maxrss)/1.e6)
...     pyObj = "{:08d}".format(len(gc.get_objects()))
...     print("Resources:", memGb, "Gb;", pyObj, "pyObjs in memory")
... 
>>> getResources()
Resources: 00.03 Gb; 00021319 pyObjs in memory

I just wedged this into the driver file as a test, I can push this to your branch if you like?

acordonez · 2024-02-01T18:58:24Z

@durack1 I'm doing something like that in my code locally, just don't want to commit it.

lee1043 · 2024-02-01T18:59:46Z

@acordonez, just curious what if use open_mfdataset instead of open_dataset at L385? open_mfdataset should work for a single file as well. Can you remind me what .compute actually does?

acordonez · 2024-02-01T18:59:50Z

@lee1043 Deleted my last comment about the .compute() statement, it actually doesn't seem to make a difference when looking at multiple runs.

acordonez · 2024-02-01T19:07:58Z

@lee1043 using open_mfdataset increases my max memory use back up, to 11GB. So it's better to use open_dataset if there's only one file to open in terms of memory.

acordonez · 2024-02-01T19:20:32Z

Masking out the eight different regions is where the memory usage jumps from 3 to 7.5GB, so even though I'm not making new copies directly the program still uses more space for that.

durack1 · 2024-02-01T19:20:52Z

@lee1043 using open_mfdataset increases my max memory use back up, to 11GB. So it's better to use open_dataset if there's only one file to open in terms of memory.

a plus ~3Gb/~50% delta using a different function, woah, this is bad!

It seems to me that excessive memory usage is an issue that a number of folks have encountered:
https://stackoverflow.com/questions/73535939/how-to-lower-ram-usage-using-xarray-open-mfdataset-and-the-quantile-function
https://stackoverflow.com/questions/60902571/how-can-i-optimize-a-code-that-is-using-xarray-for-better-performance?rq=3
https://stackoverflow.com/questions/71843487/why-does-manipulating-xarrays-takes-up-so-much-memory-permanently?rq=3

The pangeo comment (below) seems to suggest that using open_dataset uses lazy loading, which is more efficient than the mf call, maybe that's why the 11 vs 8 Gb example?

https://discourse.pangeo.io/t/technical-question-about-reading-file-by-chunks-and-xarray-dask-netcdf/2688

durack1 · 2024-02-01T19:31:44Z

@acordonez, just curious what if use open_mfdataset instead of open_dataset at L385? open_mfdataset should work for a single file as well. Can you remind me what .compute actually does?

@lee1043 I think this requests that the lazy loading is actually implemented, so in the case that you choose open_dataset it continues to queue calls to interact with data until the last minute, which means the code executes very fast (no IO) until is requires the load to get to the next step. If you issue .compute I believe this then forces the process, and the IO. I believe .load does a similar thing

Turns out I was a little offbase, see pydata/xarray#6837 for a better description

Update xcdat_openxml.py to use `open_dataset` instead of `open_mfdataset` to save memory, considering discussions at #1041 (comment)

acordonez · 2024-02-01T19:40:49Z

@durack1 @lee1043 I've just been testing the Arctic section of the code; but with rearranging some steps and heavily using the del statement, I've been able to get the max memory usage down to 5.8GB. That's roughly half of what it was originally. I don't know if we can get it much lower than that. I need to clean up the code before I can push it but it could be ready sometime this afternoon.

durack1 · 2024-02-01T19:52:17Z

@durack1 @lee1043 I've just been testing the Arctic section of the code; but with rearranging some steps and heavily using the del statement, I've been able to get the max memory usage down to 5.8GB. That's roughly half of what it was originally. I don't know if we can get it much lower than that. I need to clean up the code before I can push it but it could be ready sometime this afternoon.

This sounds like great progress. If we can get that merged into main then hopefully @jasonb5 should be able to rebase his branch, we can test, and hopefully get this env and binder notebook example working. If we can get these changes done today, it will allow me to ping the LANL folks for them to test too.

And just to check, this 5.8Gb is for the whole notebook, even down to the multi-model examples at the bottom?

durack1 · 2024-02-01T20:00:26Z

@sashakames just looping you on this thread, as this real-world example of resource usage is probably something that would be useful in the ESGF2-US project discussions

acordonez · 2024-02-01T20:01:18Z

@durack1 No this is 5.8 GB for just the Arctic section of the driver code, tested on the command line.

durack1 · 2024-02-01T20:02:56Z

@durack1 No this is 5.8 GB for just the Arctic section of the driver code, tested on the command line.

Ah ha, ok. It might be useful to run through the whole thing, so we can ascertain the MAXIMUM memory usage, and if the nimbus machine can support it, bump each binder environment (likely ~40 concurrently) from 10240 to whatever this max usage is, plus a buffer

durack1 · 2024-02-01T20:31:37Z

@acordonez just in case it's useful, I just found https://github.com/pythonprofilers/memory_profiler/ - which appears to be alive on conda-forge (here the noarch version looks like the latest release 0.61)

durack1 · 2024-02-01T20:43:38Z

@acordonez here are a couple of other tidbits:

pydata/xarray#3200 (comment) - using xr.set_options(file_cache_maxsize=1) ensured memory was released
pydata/xarray#7018 (comment) - using explicit dimension chunking to reduce memory usage (presumably this gets around the netcdf memory leak, e.g. ds = xr.open_dataset(fldr_in_grib, engine="cfgrib", chunks={"latitude":875}, backend_kwargs={'indexpath': ''}))
And another some others with no solutions pydata/xarray#7397, pydata/xarray#7404

acordonez · 2024-02-01T21:37:07Z

@durack1 Thanks. I'm working on a general fix for the regions since that made a big difference for the Arctic, then I can see if any of the profilers show other spots for improvement.

durack1 · 2024-02-01T21:48:32Z

@durack1 Thanks. I'm working on a general fix for the regions since that made a big difference for the Arctic, then I can see if any of the profilers show other spots for improvement.

Perfect! I think while it's a little painful, we're learning alot here.

The comment I made about suppressing warnings, that could also be caught quickly (e.g. here), although we'll have to be careful about using this when we could be suppressing really useful debugging info from users that are going to want to play outside of the basic notebook template

import warnings
warnings.simplefilter("ignore")

durack1 · 2024-02-02T00:47:37Z

@acordonez just a little python hack, you can use var()["varName"] = something*somethingElse as a way of dynamically assigning variables (so looping through a list of variable names). This approach has saved me ALOT of code repeats in years gone by - it also means you can load up specifics, so the region names, along with their boundary limits in a dictionary and iterate that way

acordonez · 2024-02-02T00:52:28Z

@lee1043 @durack1 I've pushed my fix to lower the memory usage of the regional subsetting. This did result in some refactoring of the model loops, and there are some small changes to the resulting metrics values. See my main comment as well. I've pushed a rerun version of the notebook.

@durack1 Adding this to the sea ice driver did not suppress warnings for me. I can try some other things later but need to run now.

import warnings
warnings.simplefilter("ignore")

durack1

@acordonez this looks great!

lee1043 · 2024-02-02T01:03:41Z

@acordonez thank you for the update! The numbers in JSON are consistent before and after the change, they are not different at least until 3rd decimal point.

durack1 · 2024-02-02T01:36:18Z

@acordonez thank you for the update! The numbers in JSON are consistent before and after the change, they are not different at least until 3rd decimal point.

@lee1043 I did spot a couple of changes (there are likely a couple of extras that I missed)

0.0878 (old) vs 0.1074 (new) - this is MIROC6-np-r1i1p1f1-OSI-SAF-monthly_clim
83.8321 vs 83.5771 - MIROC6-antarctic-r2i1p1f1-OSI-SAF-monthly_clim also io, ca
0.6767 vs 0.5282 - E3SM-1-0-sp-r1i2p2f1-OSI-SAF-monthly_clim, also arctic

acordonez · 2024-02-02T01:43:22Z

@acordonez thank you for the update! The numbers in JSON are consistent before and after the change, they are not different at least until 3rd decimal point.

@lee1043 I did spot a couple of changes (there are likely a couple of extras that I missed)
* 0.0878 (old) vs 0.1074 (new) - this is MIROC6-np-r1i1p1f1-OSI-SAF-monthly_clim

* 83.8321 vs 83.5771 - MIROC6-antarctic-r2i1p1f1-OSI-SAF-monthly_clim also io, ca

* 0.6767 vs 0.5282 - E3SM-1-0-sp-r1i2p2f1-OSI-SAF-monthly_clim, also arctic

I think @durack1 caught the worst changes. I edited the main comment to add more details about how the calculations changed.

durack1 · 2024-02-02T03:47:02Z

@lee1043 bummer, I just ran the code (pre this PR merge) and maxed out memory even with the 16384 MB limit, so this refactoring is required for the notebook to work with the update config.

Once this is merged, I'll take another try.

Edit: I just tried launching this branch through binder, and it's now out of sync with main and is missing the cartopy library, so it will need to be rebased before it can be tested with a functioning binder instance

Edit2: bummer, I think I just managed to overwrite sea_ice_driver.py with the updated file, and it still bombed at 15.90 GB as a last entry. I think we will have to work with this trimmed down timeseries 1990 to 2010 might work. I just tried the period over which models and obs have complete coverage

sea_ice_driver.py -p sea_ice_param.py --osyear 1990 --oeyear 2015 --msyear 1990 --meyear 2015

and this seemed to work, maxing out around ~12.5 GB through the Antarctic section. I have now tried this ~4 times or so and it seems to work everytime

Edit3: ok, great, I tested this again couple of times, with kernel restarts each time. It works from start to end seamlessly with the temporal limitation from 1990 through 2015. I'll send a copy of the updated notebook via email, which has the most trivial edits over the top of what is the current main version - seems like once we have the notebook updated with the additional args, and this PR merged, we'll be good to get the LANL folks to test this out on their end tomorrow - Yay!

durack1 · 2024-02-02T06:16:07Z

Also if we add

import logging
logging.getLogger('xcdat').setLevel(logging.ERROR)

to all files calling xcdat, then we suppress those annoying warnings, voila - this cleans up the output for users a lot. It seems to work in the sector and line plotting scripts, but dropping it into pcmdi_metrics.io.base didn't seem to catch it for the obs files

acordonez · 2024-02-02T16:45:04Z

Also if we add
import logging
logging.getLogger('xcdat').setLevel(logging.ERROR)
to all files calling xcdat, then we suppress those annoying warnings, voila - this cleans up the output for users a lot. It seems to work in the sector and line plotting scripts, but dropping it into pcmdi_metrics.io.base didn't seem to catch it for the obs files

@durack1 if you got this working on your end would you be able to open a PR for it? The suppress warnings statement has not worked for me inside the PMP code.

acordonez · 2024-02-02T17:31:38Z

@durack1 @lee1043 Should I merge this PR now? It sounds like some other ideas have come up in the chat that aren't related to this specific fix, which we can address in other PRs.

lee1043 · 2024-02-02T17:33:11Z

@acordonez yes please. By merging this PR we can start testing the binderhub, and additional edits can be handled by a separate PR.

lee1043 added a commit that referenced this pull request Feb 1, 2024

Use open_dataset if single file

140353d

Update xcdat_openxml.py to use `open_dataset` instead of `open_mfdataset` to save memory, considering discussions at #1041 (comment)

lee1043 mentioned this pull request Feb 1, 2024

Use open_dataset if single file #1042

Merged

durack1 approved these changes Feb 2, 2024

View reviewed changes

Ana Ordonez added 5 commits February 2, 2024 09:25

remove copy

08a6df7

refactor for memory

852a1bd

rerun

d42968a

clean up comments

a5d2635

remove ignor warnings

687e500

acordonez force-pushed the 405_sic_ao branch from 6e9fc50 to 687e500 Compare February 2, 2024 17:25

acordonez merged commit 972984c into main Feb 2, 2024
5 checks passed

lee1043 added this to the 3.3.1 milestone Feb 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove copy statements in sea ice metrics #1041

Remove copy statements in sea ice metrics #1041

acordonez commented Feb 1, 2024 •

edited

Loading

acordonez commented Feb 1, 2024

durack1 commented Feb 1, 2024

durack1 commented Feb 1, 2024

acordonez commented Feb 1, 2024

lee1043 commented Feb 1, 2024

acordonez commented Feb 1, 2024 •

edited

Loading

acordonez commented Feb 1, 2024

acordonez commented Feb 1, 2024

durack1 commented Feb 1, 2024 •

edited

Loading

durack1 commented Feb 1, 2024 •

edited

Loading

acordonez commented Feb 1, 2024

durack1 commented Feb 1, 2024 •

edited

Loading

durack1 commented Feb 1, 2024

acordonez commented Feb 1, 2024

durack1 commented Feb 1, 2024

durack1 commented Feb 1, 2024

durack1 commented Feb 1, 2024

acordonez commented Feb 1, 2024

durack1 commented Feb 1, 2024

durack1 commented Feb 2, 2024 •

edited

Loading

acordonez commented Feb 2, 2024

durack1 left a comment

lee1043 commented Feb 2, 2024

durack1 commented Feb 2, 2024

acordonez commented Feb 2, 2024

durack1 commented Feb 2, 2024 •

edited

Loading

durack1 commented Feb 2, 2024 •

edited

Loading

acordonez commented Feb 2, 2024

acordonez commented Feb 2, 2024

lee1043 commented Feb 2, 2024

Remove copy statements in sea ice metrics #1041

Remove copy statements in sea ice metrics #1041

Conversation

acordonez commented Feb 1, 2024 • edited Loading

acordonez commented Feb 1, 2024

durack1 commented Feb 1, 2024

durack1 commented Feb 1, 2024

acordonez commented Feb 1, 2024

lee1043 commented Feb 1, 2024

acordonez commented Feb 1, 2024 • edited Loading

acordonez commented Feb 1, 2024

acordonez commented Feb 1, 2024

durack1 commented Feb 1, 2024 • edited Loading

durack1 commented Feb 1, 2024 • edited Loading

acordonez commented Feb 1, 2024

durack1 commented Feb 1, 2024 • edited Loading

durack1 commented Feb 1, 2024

acordonez commented Feb 1, 2024

durack1 commented Feb 1, 2024

durack1 commented Feb 1, 2024

durack1 commented Feb 1, 2024

acordonez commented Feb 1, 2024

durack1 commented Feb 1, 2024

durack1 commented Feb 2, 2024 • edited Loading

acordonez commented Feb 2, 2024

durack1 left a comment

Choose a reason for hiding this comment

lee1043 commented Feb 2, 2024

durack1 commented Feb 2, 2024

acordonez commented Feb 2, 2024

durack1 commented Feb 2, 2024 • edited Loading

durack1 commented Feb 2, 2024 • edited Loading

acordonez commented Feb 2, 2024

acordonez commented Feb 2, 2024

lee1043 commented Feb 2, 2024

acordonez commented Feb 1, 2024 •

edited

Loading

acordonez commented Feb 1, 2024 •

edited

Loading

durack1 commented Feb 1, 2024 •

edited

Loading

durack1 commented Feb 1, 2024 •

edited

Loading

durack1 commented Feb 1, 2024 •

edited

Loading

durack1 commented Feb 2, 2024 •

edited

Loading

durack1 commented Feb 2, 2024 •

edited

Loading

durack1 commented Feb 2, 2024 •

edited

Loading