-
Notifications
You must be signed in to change notification settings - Fork 708
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MPI_Gatherv/MPI_Scatterv displacements overflow in frame/collect_on_comm.c #2156
Comments
@negin513 I created this issue and pointed to your PR with proposed fix coming. |
On Cray-EX systems under
Here's a full stack trace from
|
Describe the bug
The functions
col_on_comm()
&dst_on_comm()
inframe/collect_on_comm.c
useMPI_CHAR
as the underlying datatype inMPI_{Gather,Scattter}v
operations. This means the required displacements,displace[]
, are in terms of bytes. For large problems, and large local communicators, this can cause overflow in the displacement offsets, which manifests in MPI communication failure. Typically with a very obtuse error message.This seems to occur more frequently with large local communicators, typical of high-core-count nodes.
To Reproduce
We have boiled this down to a 6-rank example that is available on NSF NCAR/Derecho at
/glade/work/negins/consulting/RC-26919/high-res
, with a PR to be submitted with a proposed fix.Expected behavior
*typesize
so the displacements are smaller (elements instead of bytes),Additional context
Related to #1333
We think this is also the underlying issue with https://forum.mmm.ucar.edu/threads/cxil_map-write-error-with-real-exe.19321/
The text was updated successfully, but these errors were encountered: