Memory leak - xr.open_dataset() not releasing memory. #7404

deepgabani8 · 2022-12-28T06:40:03Z

What happened?

Let's take this sample netcdf file.

Observe that the memory has not been cleared even after deleting the ds.

Code

import os
import psutil
import xarray as xr
from memory_profiler import profile

@profile
def main():
    path = 'ECMWF_ERA-40_subset.nc'
    print(f"Before opening file: {psutil.Process(os.getpid()).memory_info().rss / 1024 ** 2} MiB")
    ds = xr.open_dataset(path)
    del ds
    print(f"After opening file: {psutil.Process(os.getpid()).memory_info().rss / 1024 ** 2} MiB")

if __name__ == '__main__':
    print(f"Start: {psutil.Process(os.getpid()).memory_info().rss / 1024 ** 2} MiB")
    main()
    print(f"End: {psutil.Process(os.getpid()).memory_info().rss / 1024 ** 2} MiB")

Console logs

Start: 186.5859375 MiB
Before opening file: 187.25 MiB
After opening file: 308.09375 MiB
Filename: temp.py

Line #    Mem usage    Increment  Occurrences   Line Contents
=============================================================
     6    187.2 MiB    187.2 MiB           1   @profile
     7                                         def main():
     8    187.2 MiB      0.0 MiB           1       path = 'ECMWF_ERA-40_subset.nc'
     9    187.2 MiB      0.0 MiB           1       print(f"Before opening file: {psutil.Process(os.getpid()).memory_info().rss / 1024 ** 2} MiB")
    10    308.1 MiB    120.8 MiB           1       ds = xr.open_dataset(path)
    11    308.1 MiB      0.0 MiB           1       del ds
    12    308.1 MiB      0.0 MiB           1       print(f"After opening file: {psutil.Process(os.getpid()).memory_info().rss / 1024 ** 2} MiB")


End: 308.09375 MiB

I am using xarray==0.20.2and gdal==3.5.1.
Sister issue: ecmwf/cfgrib#325 (comment)

What did you expect to happen?

Ideally, memory consumed by the xarray dataset should be released when the dataset is closed/deleted.

Minimal Complete Verifiable Example

No response

MVCE confirmation

Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in xarray.
Complete example — the example is self-contained, including all data and the text of any traceback.
Verifiable example — the example copy & pastes into an IPython prompt or Binder notebook, returning the result.
New issue — a search of GitHub Issues suggests this is not a duplicate.

Relevant log output

No response

Anything else we need to know?

No response

Environment

INSTALLED VERSIONS ------------------ commit: None python: 3.7.12 | packaged by conda-forge | (default, Oct 26 2021, 06:08:53) [GCC 9.4.0] python-bits: 64 OS: Linux OS-release: 4.19.0-22-cloud-amd64 machine: x86_64 processor: byteorder: little LC_ALL: None LANG: C.UTF-8 LOCALE: ('en_US', 'UTF-8') libhdf5: 1.12.2 libnetcdf: 4.8.1

xarray: 0.20.2
pandas: 1.3.5
numpy: 1.19.5
scipy: 1.7.3
netCDF4: 1.6.0
pydap: None
h5netcdf: 1.0.2
h5py: 3.7.0
Nio: None
zarr: 2.12.0
cftime: 1.6.1
nc_time_axis: None
PseudoNetCDF: None
rasterio: 1.2.10
cfgrib: 0.9.10.1
iris: None
bottleneck: None
dask: 2022.02.0
distributed: 2022.02.0
matplotlib: 3.5.2
cartopy: 0.20.3
seaborn: 0.11.2
numbagg: None
fsspec: 2022.7.1
cupy: None
pint: None
sparse: None
setuptools: 59.8.0
pip: 22.2.2
conda: 22.9.0
pytest: None
IPython: 7.33.0
sphinx: None

keewis · 2022-12-28T10:28:22Z

I'm not sure how memory_profiler calculates the memory usage, but I suspect that this happens because python's garbage collector does not have to run immediately after the del.

Can you try manually triggering the garbage collector?

import gc
import os
import psutil
import xarray as xr
from memory_profiler import profile

@profile
def main():
    path = 'ECMWF_ERA-40_subset.nc'
    gc.collect()
    print(f"Before opening file: {psutil.Process(os.getpid()).memory_info().rss / 1024 ** 2} MiB")
    ds = xr.open_dataset(path)
    del ds
    gc.collect()
    print(f"After opening file: {psutil.Process(os.getpid()).memory_info().rss / 1024 ** 2} MiB")

if __name__ == '__main__':
    print(f"Start: {psutil.Process(os.getpid()).memory_info().rss / 1024 ** 2} MiB")
    main()
    print(f"End: {psutil.Process(os.getpid()).memory_info().rss / 1024 ** 2} MiB")

deepgabani8 · 2022-12-28T17:21:28Z

It still shows similar memory consumption.

Start: 185.6015625 MiB
Before opening file: 186.24609375 MiB
After opening file: 307.1328125 MiB
Filename: temp.py

Line #    Mem usage    Increment  Occurrences   Line Contents
=============================================================
     7    186.0 MiB    186.0 MiB           1   @profile
     8                                         def main():
     9    186.0 MiB      0.0 MiB           1       path = 'ECMWF_ERA-40_subset.nc'
    10    186.0 MiB      0.0 MiB           1       gc.collect()
    11    186.2 MiB      0.2 MiB           1       print(f"Before opening file: {psutil.Process(os.getpid()).memory_info().rss / 1024 ** 2} MiB")
    12    307.1 MiB    120.9 MiB           1       ds = xr.open_dataset(path)
    13    307.1 MiB      0.0 MiB           1       del ds
    14    307.1 MiB      0.0 MiB           1       gc.collect()
    15    307.1 MiB      0.0 MiB           1       print(f"After opening file: {psutil.Process(os.getpid()).memory_info().rss / 1024 ** 2} MiB")


End: 307.1328125 MiB

shoyer · 2022-12-28T19:46:07Z

If you care about memory usage, you should explicitly close files after you use them, e.g., by calling ds.close() or by using a context manager. Does that work for you?

deepgabani8 · 2022-12-29T16:20:41Z

Thanks @shoyer , but closing the dataset explicitly also doesn't seem to be releasing the memory.

Start: 185.5078125 MiB
Before opening file: 186.28515625 MiB
After opening file: 307.75390625 MiB
Filename: temp.py

Line #    Mem usage    Increment  Occurrences   Line Contents
=============================================================
     7    186.1 MiB    186.1 MiB           1   @profile
     8                                         def main():
     9    186.1 MiB      0.0 MiB           1       path = 'ECMWF_ERA-40_subset.nc'
    10    186.1 MiB      0.0 MiB           1       gc.collect()
    11    186.3 MiB      0.2 MiB           1       print(f"Before opening file: {psutil.Process(os.getpid()).memory_info().rss / 1024 ** 2} MiB")
    12    307.8 MiB    121.5 MiB           1       ds = xr.open_dataset(path)
    13    307.8 MiB      0.0 MiB           1       ds.close()
    14    307.8 MiB      0.0 MiB           1       del ds
    15    307.8 MiB      0.0 MiB           1       gc.collect()
    16    307.8 MiB      0.0 MiB           1       print(f"After opening file: {psutil.Process(os.getpid()).memory_info().rss / 1024 ** 2} MiB")


End: 307.75390625 MiB

I also tried the context manager but the same memory consumption.

Start: 185.5625 MiB
Before opening file: 186.36328125 MiB
After opening file: 307.265625 MiB
Filename: temp.py

Line #    Mem usage    Increment  Occurrences   Line Contents
=============================================================
     7    186.2 MiB    186.2 MiB           1   @profile
     8                                         def main():
     9    186.2 MiB      0.0 MiB           1       path = 'ECMWF_ERA-40_subset.nc'
    10    186.2 MiB      0.0 MiB           1       gc.collect()
    11    186.4 MiB      0.2 MiB           1       print(f"Before opening file: {psutil.Process(os.getpid()).memory_info().rss / 1024 ** 2} MiB")
    12    307.3 MiB    120.9 MiB           1       with xr.open_dataset(path) as ds:
    13    307.3 MiB      0.0 MiB           1           ds.close()
    14    307.3 MiB      0.0 MiB           1           del ds
    15    307.3 MiB      0.0 MiB           1       gc.collect()
    16    307.3 MiB      0.0 MiB           1       print(f"After opening file: {psutil.Process(os.getpid()).memory_info().rss / 1024 ** 2} MiB")


End: 307.265625 MiB

DocOtak · 2022-12-29T17:45:33Z

I've personally seen a lot of what looks like memory reuse in numpy and related libraries. I don't think any of this happens explicitly but have never investigated. I would have some expectation that if memory was not being released as expected, that opening and closing the dataset in a loop would increase memory usage, it didn't on the recent library versions I have.

Start: 89.71875 MiB
Before opening file: 90.203125 MiB
After opening file: 96.6875 MiB
Filename: test.py

Line #    Mem usage    Increment  Occurrences   Line Contents
=============================================================
     6     90.2 MiB     90.2 MiB           1   @profile
     7                                         def main():
     8     90.2 MiB      0.0 MiB           1       path = 'ECMWF_ERA-40_subset.nc'
     9     90.2 MiB      0.0 MiB           1       print(f"Before opening file: {psutil.Process(os.getpid()).memory_info().rss / 1024 ** 2} MiB")
    10     96.7 MiB     -0.1 MiB        1001       for i in range(1000):
    11     96.7 MiB      6.4 MiB        1000           with xr.open_dataset(path) as ds:
    12     96.7 MiB     -0.1 MiB        1000             ...
    13     96.7 MiB      0.0 MiB           1       print(f"After opening file: {psutil.Process(os.getpid()).memory_info().rss / 1024 ** 2} MiB")


End: 96.6875 MiB

Show Versions

INSTALLED VERSIONS
------------------
commit: None
python: 3.8.13 (default, Jul 23 2022, 17:00:57)
[Clang 13.1.6 (clang-1316.0.21.2.5)]
python-bits: 64
OS: Darwin
OS-release: 22.1.0
machine: arm64
processor: arm
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: ('en_US', 'UTF-8')
libhdf5: 1.12.2
libnetcdf: 4.9.0

xarray: 2022.11.0
pandas: 1.4.3
numpy: 1.23.5
scipy: None
netCDF4: 1.6.0
pydap: None
h5netcdf: None
h5py: None
Nio: None
zarr: None
cftime: 1.6.1
nc_time_axis: None
PseudoNetCDF: None
rasterio: None
cfgrib: None
iris: None
bottleneck: None
dask: None
distributed: None
matplotlib: 3.5.3
cartopy: None
seaborn: None
numbagg: None
fsspec: None
cupy: None
pint: None
sparse: None
flox: None
numpy_groupies: None
setuptools: 56.0.0
pip: 22.0.4
conda: None
pytest: 6.2.5
IPython: 8.4.0
sphinx: 5.1.1

deepgabani8 · 2023-01-03T11:22:21Z

Thanks @DocOtak for the observation.

This is valid only when iterating over the same file. I am observing the same behavior. Here is a memory usage against the iterations.

When I tried to validate this by iterating over different files, the memory is gradually increasing. Here is a memory usage.

rachtsingh · 2023-06-23T18:54:33Z

I can confirm a similar issue, where opening a large number of files in a row causes memory usage to linearly increase (in my case, while watching, from 17GB to 27GB). This means that I can't write long-running jobs because it eventually causes a system failure because of memory usage.

I'm actually uncertain why the job doesn't get OOM killed before the memory issue (my issue to fix with ulimits or cgroups). We're accessing GRIB files using cfgrib (all of which have an index) on secondary SSD storage.

deepgabani8 added bug needs triage Issue that has not been reviewed by xarray team member labels Dec 28, 2022

dcherian removed bug needs triage Issue that has not been reviewed by xarray team member labels Jan 15, 2023

durack1 mentioned this issue Feb 1, 2024

Remove copy statements in sea ice metrics PCMDI/pcmdi_metrics#1041

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory leak - xr.open_dataset() not releasing memory. #7404

Memory leak - xr.open_dataset() not releasing memory. #7404

deepgabani8 commented Dec 28, 2022

keewis commented Dec 28, 2022

deepgabani8 commented Dec 28, 2022

shoyer commented Dec 28, 2022

deepgabani8 commented Dec 29, 2022

DocOtak commented Dec 29, 2022

deepgabani8 commented Jan 3, 2023 •

edited

Loading

rachtsingh commented Jun 23, 2023

Memory leak - xr.open_dataset() not releasing memory. #7404

Memory leak - xr.open_dataset() not releasing memory. #7404

Comments

deepgabani8 commented Dec 28, 2022

What happened?

What did you expect to happen?

Minimal Complete Verifiable Example

MVCE confirmation

Relevant log output

Anything else we need to know?

Environment

keewis commented Dec 28, 2022

deepgabani8 commented Dec 28, 2022

shoyer commented Dec 28, 2022

deepgabani8 commented Dec 29, 2022

DocOtak commented Dec 29, 2022

deepgabani8 commented Jan 3, 2023 • edited Loading

rachtsingh commented Jun 23, 2023

deepgabani8 commented Jan 3, 2023 •

edited

Loading