GPU out of memory and hdf5 file not read for the crashed simulations #5459

Tissot11 · 2024-11-14T13:13:32Z

Using the same input deck as in #5131, I could not finish a simulation due to GPU of memory. Setting amrex.abort_on_out_of_gpu_memory = 0 did not help. However, the stdout file generated by WarpX reports significant lower memory usage compared to the memory available. I attach err and std files. You can see in out file, WarpX reports only 650 GB memory usage which is far lower than the total memory of 32 GPUs with 40 GB each.

errWarpX-2735006.txt
outWarpX-2735006.txt

When this simulation does not finish, I tried reading the data using OpenPMD time series, but it can not read the files. Is this expected? In my experiences with other codes, I should be able to read whatever data was written for the crashed simulation. Do I need to compile HDF5 with some other flags or so?

The text was updated successfully, but these errors were encountered:

n01r · 2024-11-14T21:57:27Z

Hi @Tissot11,

Could you run a test simulation only until step 6000 (assuming that it would always crash shortly after that)? Then you will get an AMReX report on memory usage.

Perhaps @atmyers could have a look at this memory report then, once you have it.

Your problem seems to be periodic. For more test runs you could reduce the size in this direction and use fewer resources.

What is the error when you are trying to read your data?
The OpenPMDTimeSeries may fail to initialize if there is a file that is unfinished (when the crash happened while the file was still being written). Although, if I am not mistaken, that was fixed and should only give a warning. Right, @RemiLehe?

@Tissot11, please post which version of openpmd_viewer you are using.

If, however, you are trying to access an unfinished file itself then that data might be corrupted. I only know of ADIOS2 being able to produce readable files even if the writing process crashes (and possible when some certain options are activated).

Tissot11 · 2024-11-15T09:59:10Z

I attach an out file of a finished simulation. In the AMReX report I somehow see lower memory usage and in the bottom of the file, I see about 600 GB while the device memory is about 29 GiB. Is the device memory meant for a single GPU?

outWarpX-2757117.txt

I am using openpmd_viewer version 1.10.0 installed from conda-forge

This is the message I get while reading file from the crashed simulation

Error: Read Error in backend HDF5
Object type:	File
Error type:	Inaccessible
Further description:	Failed to open HDF5 file /home/hk-project-obliques/hd_ff296/WarpXSimulations/2D-MA30_MS33_th75_rT216/d25_mi100-IP-NFlux/DIAGS/openpmd.h5


HDF5-DIAG: Error detected in HDF5 (1.12.2) thread 0:
  #000: H5F.c line 620 in H5Fopen(): unable to open file
    major: File accessibility
    minor: Unable to open file
  #001: H5VLcallback.c line 3501 in H5VL_file_open(): failed to iterate over available VOL connector plugins
    major: Virtual Object Layer
    minor: Iteration failed
  #002: H5PLpath.c line 578 in H5PL__path_table_iterate(): can't iterate over plugins in plugin path '(null)'
    major: Plugin for dynamically loaded library
    minor: Iteration failed
  #003: H5PLpath.c line 620 in H5PL__path_table_iterate_process_path(): can't open directory: /usr/local/hdf5/lib/plugin
    major: Plugin for dynamically loaded library
    minor: Can't open directory or file
  #004: H5VLcallback.c line 3351 in H5VL__file_open(): open failed
    major: Virtual Object Layer
    minor: Can't open object
  #005: H5VLnative_file.c line 97 in H5VL__native_file_open(): unable to open file
    major: File accessibility
    minor: Unable to open file
  #006: H5Fint.c line 1990 in H5F_open(): unable to read superblock
    major: File accessibility
    minor: Read failed
  #007: H5Fsuper.c line 405 in H5F__super_read(): file signature not found
    major: File accessibility
    minor: Not an HDF5 file
[AbstractIOHandlerImpl] IO Task OPEN_FILE failed with exception. Clearing IO queue and passing on the exception.

In my experience with other code, I can read the hdf5 files of a crashed simulation as well. I have no ADIOS2 only HDF5 configured with WarpX

atmyers · 2024-11-15T17:56:16Z

Could you also attach the Backtrace.68 file, referred to in errWarpX-2735006.txt?

Tissot11 · 2024-11-15T20:27:13Z

Sorry, the previous run files, I had deleted them. But I have just encountered another crashed run due to memory error. I attach the err, out and Backtrace (converted to txt) files.

errWarpX-2759955.txt
outWarpX-2759955.txt
Backtrace.42.txt

atmyers · 2024-11-15T21:13:23Z

I see. So the code is running out of memory in AddPlasmaFlux.

One thing is that, while the total memory across all GPUs may be enough to store your particles, if any one GPU runs out of memory the simulation will crash. Do you have a sense of how many particles are on each box prior to running out of memory?

Also, the reason why AddPlasmaFlux takes more memory than AddPlasma is that we add the particles to a temporary container in AddPlasmaFlux, thus briefly doubling the newly-added ones in memory. Perhaps we could add an option to clear them from the tmp container on the fly.

Tissot11 · 2024-11-15T21:18:10Z

Ok, this would be great to avoid these memory errors. It was also my observation and I asked in #5131 about using these two mechanisms for particle injections and the memory usages in each case. To be honest, I probably can not calculate the number of particles in each box during the simulation runtime.

Tissot11 · 2024-11-24T15:56:24Z

Any idea, when this would be implemented in WarpX?

n01r · 2024-11-25T18:53:07Z

@Tissot11, just out of curiosity: if feasible and you have the resources, what happens when you run your setup in WarpX' CPU-only mode and compare with your previous results?

n01r · 2024-11-25T18:58:33Z

Maybe you would not want to run the full simulation with just CPUs and the speed advantage of GPUs is why you chose WarpX. I am just generally curious if the WarpX CPU mode could produce first results for you that you could compare to what you already have, or if there are general road blocks.

Tissot11 · 2024-11-25T19:20:09Z

I can try WarpX on CPU for testing but you're right that GPU acceleration is the reason to switch to WarpX. I guess the problem is probably with the boundary conditions and the injector itself. I have tried continuous injection and I would like to try the NFlux method. Probably I try it tomorrow with CPU runs and see if I can reproduce the results.

Tissot11 · 2024-11-28T13:12:30Z

I tried running the job on 384 GPUs to avoid the out of memory errors. Yet, the job crashed, this time not because of the out of memory errors but something else. I attach out, err and Backtrace files.

Backtrace.226.txt
errWarpX-2796001.txt
outWarpX-2796001.txt

It seems that I can not run my jobs since there seem to be several issues involved with NFluxPerCell, openpmd_viewer and also boundary conditions. Please let me know if these issues are of priorities to you and fixable in near term future? Then I could be interested in providing feedback and also seriously consider using WarpX for my research work.

n01r · 2024-12-03T01:11:03Z

Hi @Tissot11, thank you for testing this.
It seems you encountered an illegal memory access which is concerning.
I will bring this up in our developer meeting tomorrow morning.

Were you also able to test the CPU run?

atmyers · 2024-12-03T17:22:56Z

Hi @Tissot11 - we'd like to understand and fix these issues. Could you share the exact inputs file that is causing these issues for you? Apologies if you shared it already. The file inputDeckCI.txt from Issue #5131 does not seem to use the NFluxPerCell option.

Tissot11 · 2024-12-04T20:18:42Z

It was essentially the same input deck except for AddPlasmaFlux. Anyway, I attach it here now

inputDeckAPF.txt

Thanks for looking into this! Please let me know your thoughts as soon as possible.

n01r · 2024-12-06T21:27:01Z

One thing that I see from the output is that the time is strongly dominated by the writing of the output. It should be possible to optimize that. But particularly for testing that it is running, one can deactivate this full output.

If you look below, the time steps take 0.13 seconds but between 30 and 40 seconds whenever there is an output which happens every 10 steps.

STEP 1859 starts ...
STEP 1859 ends. TIME = 2.907459751e-12 DT = 1.563991259e-15
Evolve time = 6580.870238 s; This step = 0.133235889 s; Avg. per step = 3.540005507 s

STEP 1860 starts ...
--- INFO    : re-sorting particles
--- INFO    : Writing openPMD file DIAGS/001860
STEP 1860 ends. TIME = 2.909023742e-12 DT = 1.563991259e-15
Evolve time = 6620.257175 s; This step = 39.38693745 s; Avg. per step = 3.559278051 s

You could compile the code in Debug mode and rerun without full output so we get more detailed info on where in the code the crash happens.
It would be interesting to see if the crash happens reproducibly at the same time step.

I also see that load balancing is active. Is there any chance, @atmyers, that load balancing could misbehave with the particle injection? The crash happens on step 1871, though, and load balancing happens every 100 time steps.

One could also turn on the LoadBalanceCosts reduced diagnostics. We could get some info on how heavy the load is on the GPUs that create new particles.

Tissot11 · 2024-12-07T07:16:37Z

I did increase the frequency of data writing (for better statistics) but the actual data written until the crash was < 2 GB. This is far smaller than 500 GB that I ended up writing because of the confusion with Full Diagnostic before. Even at that time, WarpX was way faster (with continuous_injection). So perhaps AddPlasmaFlux needs optimising.

This is actually a test simulation, I actually intend to run this simulation for 20 times longer durations and perhaps 10 times larger domain sizes. This is why I need something like WarpX.

Please feel free to turn off the data saving and see where the crash is occurring.

Tissot11 added the bug Something isn't working label Nov 14, 2024

RemiLehe assigned n01r Dec 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPU out of memory and hdf5 file not read for the crashed simulations #5459

GPU out of memory and hdf5 file not read for the crashed simulations #5459

Tissot11 commented Nov 14, 2024

n01r commented Nov 14, 2024

Tissot11 commented Nov 15, 2024

atmyers commented Nov 15, 2024

Tissot11 commented Nov 15, 2024

atmyers commented Nov 15, 2024

Tissot11 commented Nov 15, 2024 •

edited

Loading

Tissot11 commented Nov 24, 2024

n01r commented Nov 25, 2024

n01r commented Nov 25, 2024

Tissot11 commented Nov 25, 2024

Tissot11 commented Nov 28, 2024

n01r commented Dec 3, 2024

atmyers commented Dec 3, 2024

Tissot11 commented Dec 4, 2024 •

edited

Loading

n01r commented Dec 6, 2024 •

edited

Loading

Tissot11 commented Dec 7, 2024

GPU out of memory and hdf5 file not read for the crashed simulations #5459

GPU out of memory and hdf5 file not read for the crashed simulations #5459

Comments

Tissot11 commented Nov 14, 2024

n01r commented Nov 14, 2024

Tissot11 commented Nov 15, 2024

atmyers commented Nov 15, 2024

Tissot11 commented Nov 15, 2024

atmyers commented Nov 15, 2024

Tissot11 commented Nov 15, 2024 • edited Loading

Tissot11 commented Nov 24, 2024

n01r commented Nov 25, 2024

n01r commented Nov 25, 2024

Tissot11 commented Nov 25, 2024

Tissot11 commented Nov 28, 2024

n01r commented Dec 3, 2024

atmyers commented Dec 3, 2024

Tissot11 commented Dec 4, 2024 • edited Loading

n01r commented Dec 6, 2024 • edited Loading

Tissot11 commented Dec 7, 2024

Tissot11 commented Nov 15, 2024 •

edited

Loading

Tissot11 commented Dec 4, 2024 •

edited

Loading

n01r commented Dec 6, 2024 •

edited

Loading