Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GPU out of memory and hdf5 file not read for the crashed simulations #5459

Open
Tissot11 opened this issue Nov 14, 2024 · 16 comments
Open

GPU out of memory and hdf5 file not read for the crashed simulations #5459

Tissot11 opened this issue Nov 14, 2024 · 16 comments
Assignees
Labels
bug Something isn't working

Comments

@Tissot11
Copy link

Using the same input deck as in #5131, I could not finish a simulation due to GPU of memory. Setting amrex.abort_on_out_of_gpu_memory = 0 did not help. However, the stdout file generated by WarpX reports significant lower memory usage compared to the memory available. I attach err and std files. You can see in out file, WarpX reports only 650 GB memory usage which is far lower than the total memory of 32 GPUs with 40 GB each.

errWarpX-2735006.txt
outWarpX-2735006.txt

When this simulation does not finish, I tried reading the data using OpenPMD time series, but it can not read the files. Is this expected? In my experiences with other codes, I should be able to read whatever data was written for the crashed simulation. Do I need to compile HDF5 with some other flags or so?

@Tissot11 Tissot11 added the bug Something isn't working label Nov 14, 2024
@n01r
Copy link
Member

n01r commented Nov 14, 2024

Hi @Tissot11,

  • Could you run a test simulation only until step 6000 (assuming that it would always crash shortly after that)? Then you will get an AMReX report on memory usage.

Perhaps @atmyers could have a look at this memory report then, once you have it.

Your problem seems to be periodic. For more test runs you could reduce the size in this direction and use fewer resources.

What is the error when you are trying to read your data?
The OpenPMDTimeSeries may fail to initialize if there is a file that is unfinished (when the crash happened while the file was still being written). Although, if I am not mistaken, that was fixed and should only give a warning. Right, @RemiLehe?

  • @Tissot11, please post which version of openpmd_viewer you are using.

If, however, you are trying to access an unfinished file itself then that data might be corrupted. I only know of ADIOS2 being able to produce readable files even if the writing process crashes (and possible when some certain options are activated).

@Tissot11
Copy link
Author

I attach an out file of a finished simulation. In the AMReX report I somehow see lower memory usage and in the bottom of the file, I see about 600 GB while the device memory is about 29 GiB. Is the device memory meant for a single GPU?

outWarpX-2757117.txt

I am using openpmd_viewer version 1.10.0 installed from conda-forge

This is the message I get while reading file from the crashed simulation

Error: Read Error in backend HDF5
Object type:	File
Error type:	Inaccessible
Further description:	Failed to open HDF5 file /home/hk-project-obliques/hd_ff296/WarpXSimulations/2D-MA30_MS33_th75_rT216/d25_mi100-IP-NFlux/DIAGS/openpmd.h5


HDF5-DIAG: Error detected in HDF5 (1.12.2) thread 0:
  #000: H5F.c line 620 in H5Fopen(): unable to open file
    major: File accessibility
    minor: Unable to open file
  #001: H5VLcallback.c line 3501 in H5VL_file_open(): failed to iterate over available VOL connector plugins
    major: Virtual Object Layer
    minor: Iteration failed
  #002: H5PLpath.c line 578 in H5PL__path_table_iterate(): can't iterate over plugins in plugin path '(null)'
    major: Plugin for dynamically loaded library
    minor: Iteration failed
  #003: H5PLpath.c line 620 in H5PL__path_table_iterate_process_path(): can't open directory: /usr/local/hdf5/lib/plugin
    major: Plugin for dynamically loaded library
    minor: Can't open directory or file
  #004: H5VLcallback.c line 3351 in H5VL__file_open(): open failed
    major: Virtual Object Layer
    minor: Can't open object
  #005: H5VLnative_file.c line 97 in H5VL__native_file_open(): unable to open file
    major: File accessibility
    minor: Unable to open file
  #006: H5Fint.c line 1990 in H5F_open(): unable to read superblock
    major: File accessibility
    minor: Read failed
  #007: H5Fsuper.c line 405 in H5F__super_read(): file signature not found
    major: File accessibility
    minor: Not an HDF5 file
[AbstractIOHandlerImpl] IO Task OPEN_FILE failed with exception. Clearing IO queue and passing on the exception.

In my experience with other code, I can read the hdf5 files of a crashed simulation as well. I have no ADIOS2 only HDF5 configured with WarpX

@atmyers
Copy link
Member

atmyers commented Nov 15, 2024

Could you also attach the Backtrace.68 file, referred to in errWarpX-2735006.txt?

@Tissot11
Copy link
Author

Sorry, the previous run files, I had deleted them. But I have just encountered another crashed run due to memory error. I attach the err, out and Backtrace (converted to txt) files.

errWarpX-2759955.txt
outWarpX-2759955.txt
Backtrace.42.txt

@atmyers
Copy link
Member

atmyers commented Nov 15, 2024

I see. So the code is running out of memory in AddPlasmaFlux.

One thing is that, while the total memory across all GPUs may be enough to store your particles, if any one GPU runs out of memory the simulation will crash. Do you have a sense of how many particles are on each box prior to running out of memory?

Also, the reason why AddPlasmaFlux takes more memory than AddPlasma is that we add the particles to a temporary container in AddPlasmaFlux, thus briefly doubling the newly-added ones in memory. Perhaps we could add an option to clear them from the tmp container on the fly.

@Tissot11
Copy link
Author

Tissot11 commented Nov 15, 2024

Ok, this would be great to avoid these memory errors. It was also my observation and I asked in #5131 about using these two mechanisms for particle injections and the memory usages in each case. To be honest, I probably can not calculate the number of particles in each box during the simulation runtime.

@Tissot11
Copy link
Author

Any idea, when this would be implemented in WarpX?

@n01r
Copy link
Member

n01r commented Nov 25, 2024

@Tissot11, just out of curiosity: if feasible and you have the resources, what happens when you run your setup in WarpX' CPU-only mode and compare with your previous results?

@n01r
Copy link
Member

n01r commented Nov 25, 2024

Maybe you would not want to run the full simulation with just CPUs and the speed advantage of GPUs is why you chose WarpX. I am just generally curious if the WarpX CPU mode could produce first results for you that you could compare to what you already have, or if there are general road blocks.

@Tissot11
Copy link
Author

I can try WarpX on CPU for testing but you're right that GPU acceleration is the reason to switch to WarpX. I guess the problem is probably with the boundary conditions and the injector itself. I have tried continuous injection and I would like to try the NFlux method. Probably I try it tomorrow with CPU runs and see if I can reproduce the results.

@Tissot11
Copy link
Author

I tried running the job on 384 GPUs to avoid the out of memory errors. Yet, the job crashed, this time not because of the out of memory errors but something else. I attach out, err and Backtrace files.

Backtrace.226.txt
errWarpX-2796001.txt
outWarpX-2796001.txt

It seems that I can not run my jobs since there seem to be several issues involved with NFluxPerCell, openpmd_viewer and also boundary conditions. Please let me know if these issues are of priorities to you and fixable in near term future? Then I could be interested in providing feedback and also seriously consider using WarpX for my research work.

@n01r
Copy link
Member

n01r commented Dec 3, 2024

Hi @Tissot11, thank you for testing this.
It seems you encountered an illegal memory access which is concerning.
I will bring this up in our developer meeting tomorrow morning.

Were you also able to test the CPU run?

@atmyers
Copy link
Member

atmyers commented Dec 3, 2024

Hi @Tissot11 - we'd like to understand and fix these issues. Could you share the exact inputs file that is causing these issues for you? Apologies if you shared it already. The file inputDeckCI.txt from Issue #5131 does not seem to use the NFluxPerCell option.

@Tissot11
Copy link
Author

Tissot11 commented Dec 4, 2024

It was essentially the same input deck except for AddPlasmaFlux. Anyway, I attach it here now

inputDeckAPF.txt

Thanks for looking into this! Please let me know your thoughts as soon as possible.

@n01r
Copy link
Member

n01r commented Dec 6, 2024

One thing that I see from the output is that the time is strongly dominated by the writing of the output. It should be possible to optimize that. But particularly for testing that it is running, one can deactivate this full output.

If you look below, the time steps take 0.13 seconds but between 30 and 40 seconds whenever there is an output which happens every 10 steps.

STEP 1859 starts ...
STEP 1859 ends. TIME = 2.907459751e-12 DT = 1.563991259e-15
Evolve time = 6580.870238 s; This step = 0.133235889 s; Avg. per step = 3.540005507 s

STEP 1860 starts ...
--- INFO    : re-sorting particles
--- INFO    : Writing openPMD file DIAGS/001860
STEP 1860 ends. TIME = 2.909023742e-12 DT = 1.563991259e-15
Evolve time = 6620.257175 s; This step = 39.38693745 s; Avg. per step = 3.559278051 s

You could compile the code in Debug mode and rerun without full output so we get more detailed info on where in the code the crash happens.
It would be interesting to see if the crash happens reproducibly at the same time step.

I also see that load balancing is active. Is there any chance, @atmyers, that load balancing could misbehave with the particle injection? The crash happens on step 1871, though, and load balancing happens every 100 time steps.

One could also turn on the LoadBalanceCosts reduced diagnostics. We could get some info on how heavy the load is on the GPUs that create new particles.

@Tissot11
Copy link
Author

Tissot11 commented Dec 7, 2024

I did increase the frequency of data writing (for better statistics) but the actual data written until the crash was < 2 GB. This is far smaller than 500 GB that I ended up writing because of the confusion with Full Diagnostic before. Even at that time, WarpX was way faster (with continuous_injection). So perhaps AddPlasmaFlux needs optimising.

This is actually a test simulation, I actually intend to run this simulation for 20 times longer durations and perhaps 10 times larger domain sizes. This is why I need something like WarpX.

Please feel free to turn off the data saving and see where the crash is occurring.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants