-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Memory leak - xr.open_dataset() not releasing memory. #7404
Comments
I'm not sure how Can you try manually triggering the garbage collector? import gc
import os
import psutil
import xarray as xr
from memory_profiler import profile
@profile
def main():
path = 'ECMWF_ERA-40_subset.nc'
gc.collect()
print(f"Before opening file: {psutil.Process(os.getpid()).memory_info().rss / 1024 ** 2} MiB")
ds = xr.open_dataset(path)
del ds
gc.collect()
print(f"After opening file: {psutil.Process(os.getpid()).memory_info().rss / 1024 ** 2} MiB")
if __name__ == '__main__':
print(f"Start: {psutil.Process(os.getpid()).memory_info().rss / 1024 ** 2} MiB")
main()
print(f"End: {psutil.Process(os.getpid()).memory_info().rss / 1024 ** 2} MiB") |
It still shows similar memory consumption. Start: 185.6015625 MiB
Before opening file: 186.24609375 MiB
After opening file: 307.1328125 MiB
Filename: temp.py
Line # Mem usage Increment Occurrences Line Contents
=============================================================
7 186.0 MiB 186.0 MiB 1 @profile
8 def main():
9 186.0 MiB 0.0 MiB 1 path = 'ECMWF_ERA-40_subset.nc'
10 186.0 MiB 0.0 MiB 1 gc.collect()
11 186.2 MiB 0.2 MiB 1 print(f"Before opening file: {psutil.Process(os.getpid()).memory_info().rss / 1024 ** 2} MiB")
12 307.1 MiB 120.9 MiB 1 ds = xr.open_dataset(path)
13 307.1 MiB 0.0 MiB 1 del ds
14 307.1 MiB 0.0 MiB 1 gc.collect()
15 307.1 MiB 0.0 MiB 1 print(f"After opening file: {psutil.Process(os.getpid()).memory_info().rss / 1024 ** 2} MiB")
End: 307.1328125 MiB |
If you care about memory usage, you should explicitly close files after you use them, e.g., by calling |
Thanks @shoyer , but closing the dataset explicitly also doesn't seem to be releasing the memory. Start: 185.5078125 MiB
Before opening file: 186.28515625 MiB
After opening file: 307.75390625 MiB
Filename: temp.py
Line # Mem usage Increment Occurrences Line Contents
=============================================================
7 186.1 MiB 186.1 MiB 1 @profile
8 def main():
9 186.1 MiB 0.0 MiB 1 path = 'ECMWF_ERA-40_subset.nc'
10 186.1 MiB 0.0 MiB 1 gc.collect()
11 186.3 MiB 0.2 MiB 1 print(f"Before opening file: {psutil.Process(os.getpid()).memory_info().rss / 1024 ** 2} MiB")
12 307.8 MiB 121.5 MiB 1 ds = xr.open_dataset(path)
13 307.8 MiB 0.0 MiB 1 ds.close()
14 307.8 MiB 0.0 MiB 1 del ds
15 307.8 MiB 0.0 MiB 1 gc.collect()
16 307.8 MiB 0.0 MiB 1 print(f"After opening file: {psutil.Process(os.getpid()).memory_info().rss / 1024 ** 2} MiB")
End: 307.75390625 MiB I also tried the context manager but the same memory consumption. Start: 185.5625 MiB
Before opening file: 186.36328125 MiB
After opening file: 307.265625 MiB
Filename: temp.py
Line # Mem usage Increment Occurrences Line Contents
=============================================================
7 186.2 MiB 186.2 MiB 1 @profile
8 def main():
9 186.2 MiB 0.0 MiB 1 path = 'ECMWF_ERA-40_subset.nc'
10 186.2 MiB 0.0 MiB 1 gc.collect()
11 186.4 MiB 0.2 MiB 1 print(f"Before opening file: {psutil.Process(os.getpid()).memory_info().rss / 1024 ** 2} MiB")
12 307.3 MiB 120.9 MiB 1 with xr.open_dataset(path) as ds:
13 307.3 MiB 0.0 MiB 1 ds.close()
14 307.3 MiB 0.0 MiB 1 del ds
15 307.3 MiB 0.0 MiB 1 gc.collect()
16 307.3 MiB 0.0 MiB 1 print(f"After opening file: {psutil.Process(os.getpid()).memory_info().rss / 1024 ** 2} MiB")
End: 307.265625 MiB |
I've personally seen a lot of what looks like memory reuse in numpy and related libraries. I don't think any of this happens explicitly but have never investigated. I would have some expectation that if memory was not being released as expected, that opening and closing the dataset in a loop would increase memory usage, it didn't on the recent library versions I have. Start: 89.71875 MiB
Before opening file: 90.203125 MiB
After opening file: 96.6875 MiB
Filename: test.py
Line # Mem usage Increment Occurrences Line Contents
=============================================================
6 90.2 MiB 90.2 MiB 1 @profile
7 def main():
8 90.2 MiB 0.0 MiB 1 path = 'ECMWF_ERA-40_subset.nc'
9 90.2 MiB 0.0 MiB 1 print(f"Before opening file: {psutil.Process(os.getpid()).memory_info().rss / 1024 ** 2} MiB")
10 96.7 MiB -0.1 MiB 1001 for i in range(1000):
11 96.7 MiB 6.4 MiB 1000 with xr.open_dataset(path) as ds:
12 96.7 MiB -0.1 MiB 1000 ...
13 96.7 MiB 0.0 MiB 1 print(f"After opening file: {psutil.Process(os.getpid()).memory_info().rss / 1024 ** 2} MiB")
End: 96.6875 MiB Show Versions
|
Thanks @DocOtak for the observation. This is valid only when iterating over the same file. I am observing the same behavior. Here is a memory usage against the iterations. When I tried to validate this by iterating over different files, the memory is gradually increasing. Here is a memory usage. |
I can confirm a similar issue, where opening a large number of files in a row causes memory usage to linearly increase (in my case, while watching, from 17GB to 27GB). This means that I can't write long-running jobs because it eventually causes a system failure because of memory usage. I'm actually uncertain why the job doesn't get OOM killed before the memory issue (my issue to fix with ulimits or cgroups). We're accessing GRIB files using cfgrib (all of which have an index) on secondary SSD storage. |
What happened?
Let's take this sample netcdf file.
Observe that the memory has not been cleared even after deleting the ds.
Code
Console logs
I am using xarray==0.20.2and gdal==3.5.1.
Sister issue: ecmwf/cfgrib#325 (comment)
What did you expect to happen?
Ideally, memory consumed by the xarray dataset should be released when the dataset is closed/deleted.
Minimal Complete Verifiable Example
No response
MVCE confirmation
Relevant log output
No response
Anything else we need to know?
No response
Environment
xarray: 0.20.2
pandas: 1.3.5
numpy: 1.19.5
scipy: 1.7.3
netCDF4: 1.6.0
pydap: None
h5netcdf: 1.0.2
h5py: 3.7.0
Nio: None
zarr: 2.12.0
cftime: 1.6.1
nc_time_axis: None
PseudoNetCDF: None
rasterio: 1.2.10
cfgrib: 0.9.10.1
iris: None
bottleneck: None
dask: 2022.02.0
distributed: 2022.02.0
matplotlib: 3.5.2
cartopy: 0.20.3
seaborn: 0.11.2
numbagg: None
fsspec: 2022.7.1
cupy: None
pint: None
sparse: None
setuptools: 59.8.0
pip: 22.2.2
conda: 22.9.0
pytest: None
IPython: 7.33.0
sphinx: None
The text was updated successfully, but these errors were encountered: