-
Notifications
You must be signed in to change notification settings - Fork 198
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GPU out of memory and hdf5 file not read for the crashed simulations #5459
Comments
Hi @Tissot11,
Perhaps @atmyers could have a look at this memory report then, once you have it. Your problem seems to be periodic. For more test runs you could reduce the size in this direction and use fewer resources. What is the error when you are trying to read your data?
If, however, you are trying to access an unfinished file itself then that data might be corrupted. I only know of ADIOS2 being able to produce readable files even if the writing process crashes (and possible when some certain options are activated). |
I attach an out file of a finished simulation. In the AMReX report I somehow see lower memory usage and in the bottom of the file, I see about 600 GB while the device memory is about 29 GiB. Is the device memory meant for a single GPU? I am using This is the message I get while reading file from the crashed simulation
In my experience with other code, I can read the hdf5 files of a crashed simulation as well. I have no |
Could you also attach the |
Sorry, the previous run files, I had deleted them. But I have just encountered another crashed run due to memory error. I attach the err, out and Backtrace (converted to txt) files. |
I see. So the code is running out of memory in AddPlasmaFlux. One thing is that, while the total memory across all GPUs may be enough to store your particles, if any one GPU runs out of memory the simulation will crash. Do you have a sense of how many particles are on each box prior to running out of memory? Also, the reason why |
Ok, this would be great to avoid these memory errors. It was also my observation and I asked in #5131 about using these two mechanisms for particle injections and the memory usages in each case. To be honest, I probably can not calculate the number of particles in each box during the simulation runtime. |
Any idea, when this would be implemented in WarpX? |
@Tissot11, just out of curiosity: if feasible and you have the resources, what happens when you run your setup in WarpX' CPU-only mode and compare with your previous results? |
Maybe you would not want to run the full simulation with just CPUs and the speed advantage of GPUs is why you chose WarpX. I am just generally curious if the WarpX CPU mode could produce first results for you that you could compare to what you already have, or if there are general road blocks. |
I can try WarpX on CPU for testing but you're right that GPU acceleration is the reason to switch to WarpX. I guess the problem is probably with the boundary conditions and the injector itself. I have tried continuous injection and I would like to try the NFlux method. Probably I try it tomorrow with CPU runs and see if I can reproduce the results. |
I tried running the job on 384 GPUs to avoid the Backtrace.226.txt It seems that I can not run my jobs since there seem to be several issues involved with |
Hi @Tissot11, thank you for testing this. Were you also able to test the CPU run? |
It was essentially the same input deck except for Thanks for looking into this! Please let me know your thoughts as soon as possible. |
One thing that I see from the output is that the time is strongly dominated by the writing of the output. It should be possible to optimize that. But particularly for testing that it is running, one can deactivate this full output. If you look below, the time steps take 0.13 seconds but between 30 and 40 seconds whenever there is an output which happens every 10 steps.
You could compile the code in I also see that load balancing is active. Is there any chance, @atmyers, that load balancing could misbehave with the particle injection? The crash happens on step 1871, though, and load balancing happens every 100 time steps. One could also turn on the |
I did increase the frequency of data writing (for better statistics) but the actual data written until the crash was < 2 GB. This is far smaller than 500 GB that I ended up writing because of the confusion with This is actually a test simulation, I actually intend to run this simulation for 20 times longer durations and perhaps 10 times larger domain sizes. This is why I need something like Please feel free to turn off the data saving and see where the crash is occurring. |
Using the same input deck as in #5131, I could not finish a simulation due to GPU of memory. Setting
amrex.abort_on_out_of_gpu_memory = 0
did not help. However, the stdout file generated by WarpX reports significant lower memory usage compared to the memory available. I attach err and std files. You can see in out file, WarpX reports only 650 GB memory usage which is far lower than the total memory of 32 GPUs with 40 GB each.errWarpX-2735006.txt
outWarpX-2735006.txt
When this simulation does not finish, I tried reading the data using OpenPMD time series, but it can not read the files. Is this expected? In my experiences with other codes, I should be able to read whatever data was written for the crashed simulation. Do I need to compile HDF5 with some other flags or so?
The text was updated successfully, but these errors were encountered: