Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory leak - xr.open_dataset() not releasing memory. #7404

Open
4 tasks
deepgabani8 opened this issue Dec 28, 2022 · 7 comments
Open
4 tasks

Memory leak - xr.open_dataset() not releasing memory. #7404

deepgabani8 opened this issue Dec 28, 2022 · 7 comments

Comments

@deepgabani8
Copy link

What happened?

Let's take this sample netcdf file.

Observe that the memory has not been cleared even after deleting the ds.

Code

import os
import psutil
import xarray as xr
from memory_profiler import profile

@profile
def main():
    path = 'ECMWF_ERA-40_subset.nc'
    print(f"Before opening file: {psutil.Process(os.getpid()).memory_info().rss / 1024 ** 2} MiB")
    ds = xr.open_dataset(path)
    del ds
    print(f"After opening file: {psutil.Process(os.getpid()).memory_info().rss / 1024 ** 2} MiB")

if __name__ == '__main__':
    print(f"Start: {psutil.Process(os.getpid()).memory_info().rss / 1024 ** 2} MiB")
    main()
    print(f"End: {psutil.Process(os.getpid()).memory_info().rss / 1024 ** 2} MiB")

Console logs

Start: 186.5859375 MiB
Before opening file: 187.25 MiB
After opening file: 308.09375 MiB
Filename: temp.py

Line #    Mem usage    Increment  Occurrences   Line Contents
=============================================================
     6    187.2 MiB    187.2 MiB           1   @profile
     7                                         def main():
     8    187.2 MiB      0.0 MiB           1       path = 'ECMWF_ERA-40_subset.nc'
     9    187.2 MiB      0.0 MiB           1       print(f"Before opening file: {psutil.Process(os.getpid()).memory_info().rss / 1024 ** 2} MiB")
    10    308.1 MiB    120.8 MiB           1       ds = xr.open_dataset(path)
    11    308.1 MiB      0.0 MiB           1       del ds
    12    308.1 MiB      0.0 MiB           1       print(f"After opening file: {psutil.Process(os.getpid()).memory_info().rss / 1024 ** 2} MiB")


End: 308.09375 MiB

I am using xarray==0.20.2and gdal==3.5.1.
Sister issue: ecmwf/cfgrib#325 (comment)

What did you expect to happen?

Ideally, memory consumed by the xarray dataset should be released when the dataset is closed/deleted.

Minimal Complete Verifiable Example

No response

MVCE confirmation

  • Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in xarray.
  • Complete example — the example is self-contained, including all data and the text of any traceback.
  • Verifiable example — the example copy & pastes into an IPython prompt or Binder notebook, returning the result.
  • New issue — a search of GitHub Issues suggests this is not a duplicate.

Relevant log output

No response

Anything else we need to know?

No response

Environment

INSTALLED VERSIONS ------------------ commit: None python: 3.7.12 | packaged by conda-forge | (default, Oct 26 2021, 06:08:53) [GCC 9.4.0] python-bits: 64 OS: Linux OS-release: 4.19.0-22-cloud-amd64 machine: x86_64 processor: byteorder: little LC_ALL: None LANG: C.UTF-8 LOCALE: ('en_US', 'UTF-8') libhdf5: 1.12.2 libnetcdf: 4.8.1

xarray: 0.20.2
pandas: 1.3.5
numpy: 1.19.5
scipy: 1.7.3
netCDF4: 1.6.0
pydap: None
h5netcdf: 1.0.2
h5py: 3.7.0
Nio: None
zarr: 2.12.0
cftime: 1.6.1
nc_time_axis: None
PseudoNetCDF: None
rasterio: 1.2.10
cfgrib: 0.9.10.1
iris: None
bottleneck: None
dask: 2022.02.0
distributed: 2022.02.0
matplotlib: 3.5.2
cartopy: 0.20.3
seaborn: 0.11.2
numbagg: None
fsspec: 2022.7.1
cupy: None
pint: None
sparse: None
setuptools: 59.8.0
pip: 22.2.2
conda: 22.9.0
pytest: None
IPython: 7.33.0
sphinx: None

@deepgabani8 deepgabani8 added bug needs triage Issue that has not been reviewed by xarray team member labels Dec 28, 2022
@keewis
Copy link
Collaborator

keewis commented Dec 28, 2022

I'm not sure how memory_profiler calculates the memory usage, but I suspect that this happens because python's garbage collector does not have to run immediately after the del.

Can you try manually triggering the garbage collector?

import gc
import os
import psutil
import xarray as xr
from memory_profiler import profile

@profile
def main():
    path = 'ECMWF_ERA-40_subset.nc'
    gc.collect()
    print(f"Before opening file: {psutil.Process(os.getpid()).memory_info().rss / 1024 ** 2} MiB")
    ds = xr.open_dataset(path)
    del ds
    gc.collect()
    print(f"After opening file: {psutil.Process(os.getpid()).memory_info().rss / 1024 ** 2} MiB")

if __name__ == '__main__':
    print(f"Start: {psutil.Process(os.getpid()).memory_info().rss / 1024 ** 2} MiB")
    main()
    print(f"End: {psutil.Process(os.getpid()).memory_info().rss / 1024 ** 2} MiB")

@deepgabani8
Copy link
Author

It still shows similar memory consumption.

Start: 185.6015625 MiB
Before opening file: 186.24609375 MiB
After opening file: 307.1328125 MiB
Filename: temp.py

Line #    Mem usage    Increment  Occurrences   Line Contents
=============================================================
     7    186.0 MiB    186.0 MiB           1   @profile
     8                                         def main():
     9    186.0 MiB      0.0 MiB           1       path = 'ECMWF_ERA-40_subset.nc'
    10    186.0 MiB      0.0 MiB           1       gc.collect()
    11    186.2 MiB      0.2 MiB           1       print(f"Before opening file: {psutil.Process(os.getpid()).memory_info().rss / 1024 ** 2} MiB")
    12    307.1 MiB    120.9 MiB           1       ds = xr.open_dataset(path)
    13    307.1 MiB      0.0 MiB           1       del ds
    14    307.1 MiB      0.0 MiB           1       gc.collect()
    15    307.1 MiB      0.0 MiB           1       print(f"After opening file: {psutil.Process(os.getpid()).memory_info().rss / 1024 ** 2} MiB")


End: 307.1328125 MiB

@shoyer
Copy link
Member

shoyer commented Dec 28, 2022

If you care about memory usage, you should explicitly close files after you use them, e.g., by calling ds.close() or by using a context manager. Does that work for you?

@deepgabani8
Copy link
Author

Thanks @shoyer , but closing the dataset explicitly also doesn't seem to be releasing the memory.

Start: 185.5078125 MiB
Before opening file: 186.28515625 MiB
After opening file: 307.75390625 MiB
Filename: temp.py

Line #    Mem usage    Increment  Occurrences   Line Contents
=============================================================
     7    186.1 MiB    186.1 MiB           1   @profile
     8                                         def main():
     9    186.1 MiB      0.0 MiB           1       path = 'ECMWF_ERA-40_subset.nc'
    10    186.1 MiB      0.0 MiB           1       gc.collect()
    11    186.3 MiB      0.2 MiB           1       print(f"Before opening file: {psutil.Process(os.getpid()).memory_info().rss / 1024 ** 2} MiB")
    12    307.8 MiB    121.5 MiB           1       ds = xr.open_dataset(path)
    13    307.8 MiB      0.0 MiB           1       ds.close()
    14    307.8 MiB      0.0 MiB           1       del ds
    15    307.8 MiB      0.0 MiB           1       gc.collect()
    16    307.8 MiB      0.0 MiB           1       print(f"After opening file: {psutil.Process(os.getpid()).memory_info().rss / 1024 ** 2} MiB")


End: 307.75390625 MiB

I also tried the context manager but the same memory consumption.

Start: 185.5625 MiB
Before opening file: 186.36328125 MiB
After opening file: 307.265625 MiB
Filename: temp.py

Line #    Mem usage    Increment  Occurrences   Line Contents
=============================================================
     7    186.2 MiB    186.2 MiB           1   @profile
     8                                         def main():
     9    186.2 MiB      0.0 MiB           1       path = 'ECMWF_ERA-40_subset.nc'
    10    186.2 MiB      0.0 MiB           1       gc.collect()
    11    186.4 MiB      0.2 MiB           1       print(f"Before opening file: {psutil.Process(os.getpid()).memory_info().rss / 1024 ** 2} MiB")
    12    307.3 MiB    120.9 MiB           1       with xr.open_dataset(path) as ds:
    13    307.3 MiB      0.0 MiB           1           ds.close()
    14    307.3 MiB      0.0 MiB           1           del ds
    15    307.3 MiB      0.0 MiB           1       gc.collect()
    16    307.3 MiB      0.0 MiB           1       print(f"After opening file: {psutil.Process(os.getpid()).memory_info().rss / 1024 ** 2} MiB")


End: 307.265625 MiB

@DocOtak
Copy link
Contributor

DocOtak commented Dec 29, 2022

I've personally seen a lot of what looks like memory reuse in numpy and related libraries. I don't think any of this happens explicitly but have never investigated. I would have some expectation that if memory was not being released as expected, that opening and closing the dataset in a loop would increase memory usage, it didn't on the recent library versions I have.

Start: 89.71875 MiB
Before opening file: 90.203125 MiB
After opening file: 96.6875 MiB
Filename: test.py

Line #    Mem usage    Increment  Occurrences   Line Contents
=============================================================
     6     90.2 MiB     90.2 MiB           1   @profile
     7                                         def main():
     8     90.2 MiB      0.0 MiB           1       path = 'ECMWF_ERA-40_subset.nc'
     9     90.2 MiB      0.0 MiB           1       print(f"Before opening file: {psutil.Process(os.getpid()).memory_info().rss / 1024 ** 2} MiB")
    10     96.7 MiB     -0.1 MiB        1001       for i in range(1000):
    11     96.7 MiB      6.4 MiB        1000           with xr.open_dataset(path) as ds:
    12     96.7 MiB     -0.1 MiB        1000             ...
    13     96.7 MiB      0.0 MiB           1       print(f"After opening file: {psutil.Process(os.getpid()).memory_info().rss / 1024 ** 2} MiB")


End: 96.6875 MiB
Show Versions
INSTALLED VERSIONS
------------------
commit: None
python: 3.8.13 (default, Jul 23 2022, 17:00:57)
[Clang 13.1.6 (clang-1316.0.21.2.5)]
python-bits: 64
OS: Darwin
OS-release: 22.1.0
machine: arm64
processor: arm
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: ('en_US', 'UTF-8')
libhdf5: 1.12.2
libnetcdf: 4.9.0

xarray: 2022.11.0
pandas: 1.4.3
numpy: 1.23.5
scipy: None
netCDF4: 1.6.0
pydap: None
h5netcdf: None
h5py: None
Nio: None
zarr: None
cftime: 1.6.1
nc_time_axis: None
PseudoNetCDF: None
rasterio: None
cfgrib: None
iris: None
bottleneck: None
dask: None
distributed: None
matplotlib: 3.5.3
cartopy: None
seaborn: None
numbagg: None
fsspec: None
cupy: None
pint: None
sparse: None
flox: None
numpy_groupies: None
setuptools: 56.0.0
pip: 22.0.4
conda: None
pytest: 6.2.5
IPython: 8.4.0
sphinx: 5.1.1

@deepgabani8
Copy link
Author

deepgabani8 commented Jan 3, 2023

Thanks @DocOtak for the observation.

This is valid only when iterating over the same file. I am observing the same behavior. Here is a memory usage against the iterations.
image

When I tried to validate this by iterating over different files, the memory is gradually increasing. Here is a memory usage.
image

@dcherian dcherian removed bug needs triage Issue that has not been reviewed by xarray team member labels Jan 15, 2023
@rachtsingh
Copy link

I can confirm a similar issue, where opening a large number of files in a row causes memory usage to linearly increase (in my case, while watching, from 17GB to 27GB). This means that I can't write long-running jobs because it eventually causes a system failure because of memory usage.

I'm actually uncertain why the job doesn't get OOM killed before the memory issue (my issue to fix with ulimits or cgroups). We're accessing GRIB files using cfgrib (all of which have an index) on secondary SSD storage.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants