Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Determine MPI Data Types in col_on_comm() & dst_on_comm() to prevent displacements overflow. (Fix for #2156) #2157

Open
wants to merge 4 commits into
base: develop
Choose a base branch
from

Conversation

benkirk
Copy link

@benkirk benkirk commented Jan 17, 2025

Determine MPI Data Types in col_on_comm() & dst_on_comm() to prevent displacements overflow.

TYPE: bug fix

KEYWORDS: prevent displacements overflow in MPI_Gatherv() and MPI_Scatterv() operations

SOURCE: Benjamin Kirk & Negin Sobhani (NSF NCAR / CISL)

DESCRIPTION OF CHANGES:
Problem:
The MPI_Gatherv() and MPI_Scatterv() operations require integer displacements into the communications buffers. Historically everything is passed as an MPI_CHAR, causing these displacements to be larger than otherwise necessary. For large domain sizes this can cause the displace[] offsets to exceed the maximum int, wrapping to negative values.

Solution:
This change introduces additional error checking and then uses the function MPI_Type_match_size() (available since MPI-2.0) to determine a suitable MPI_Datatype given the input *typesize. The result then is that the displace[] offsets are in terms of data type extents, rather than bytes, and less likely to overflow.

ISSUE: Fixes #2156

LIST OF MODIFIED FILES:
M frame/collect_on_comm.c

TESTS CONDUCTED:
Failed cases run now.

RELEASE NOTE:
Determine MPI Data Types in col_on_comm() & dst_on_comm() to prevent displacements overflow.

…displacements overflow.

The MPI_Gatherv() and MPI_Scatterv() operations require integer displacements into the communications buffers.
Historically everything is passed as an MPI_CHAR, causing these displacements to be larger than otherwise necessary.
For large domain sizes this can cause the displace[] offsets to exceed the maximum int, wrapping to negative values.

This change introduces additional error checking and then uses the function MPI_Type_match_size() (available since MPI-2.0)
to determine a suitable MPI_Datatype given the input *typesize.  The result then is that the displace[] offsets are in
terms of data type extents, rather than bytes, and less likely to overflow.
@benkirk benkirk requested a review from a team as a code owner January 17, 2025 20:52
@islas islas changed the base branch from master to develop January 17, 2025 21:20
@benkirk
Copy link
Author

benkirk commented Jan 17, 2025

Just for awareness, I can't see the output of the failed WRF-BUILD-2690; I get a timeout accessing
https://ncar_jenkins.scalacomputing.com/job/WRF-Feature-Regression-Test/2690/console

@dudhia
Copy link
Collaborator

dudhia commented Jan 17, 2025 via email

Fixes runtime failures caught by CI in in the underlying MPI_Gatherv().
Of course dtype needs to be MPI_Datatype, not an int.  This error sneaked through MPICH-based tests but not OpenMPI.
Hopefully this change will address previous CI failures.
@weiwangncar
Copy link
Collaborator

The regression test results:

Test Type              | Expected  | Received |  Failed
= = = = = = = = = = = = = = = = = = = = = = = =  = = = =
Number of Tests        : 23           24
Number of Builds       : 60           57
Number of Simulations  : 158           150        0
Number of Comparisons  : 95           86        0

Failed Simulations are: 
None
Which comparisons are not bit-for-bit: 
None

@weiwangncar
Copy link
Collaborator

@benkirk Thanks for the fix! I tested it for a few cases we could run before and it is working now.

@benkirk
Copy link
Author

benkirk commented Jan 21, 2025

Thanks for the success report @weiwangncar, happy to help!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

MPI_Gatherv/MPI_Scatterv displacements overflow in frame/collect_on_comm.c
3 participants