-
Notifications
You must be signed in to change notification settings - Fork 572
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tpetra: CrsGraph Segfault With Zero Owned Rows #12806
Comments
|
It looks like
There is a later comment that says "no need to fill-complete the nonlocals graph" Trilinos/packages/tpetra/core/src/Tpetra_CrsGraph_def.hpp Lines 3010 to 3011 in bee69a6
The CrsGraph's checkInternalState function expects Trilinos/packages/tpetra/core/src/Tpetra_CrsGraph_def.hpp Lines 1943 to 1950 in bee69a6
So I wonder if this is a case of trying to do something that is only allowed on fill-complete graphs? |
@JaredCrean2: @tjfulle volunteered to take a look at this. |
@JaredCrean2 - I started looking at this today, when I run the first executable, I see: $ mpirun -n 3 ./bug
line = 0 3 0 1 2
inserting global indices:
0, 3
0, 0
0, 1
0, 2
line = 1 3 0 1 2
inserting global indices:
1, 3
1, 0
1, 1
1, 2
line = 2 3 4 5 0 1 2
inserting global indices:
2, 3
2, 4
2, 5
2, 0
2, 1
2, 2
line = 3 3 4 5 0 1 2
inserting global indices:
3, 3
3, 4
3, 5
3, 0
3, 1
3, 2
line = 4 3 4 5 2
inserting global indices:
4, 3
4, 4
4, 5
4, 2
line = 5 3 4 5 2
inserting global indices:
5, 3
5, 4
5, 5
5, 2 and no segfault. What should I expect to see? |
|
Ahh, ok. I think I had it backwards. Thanks!
Cheers,
…-- Tim --
From: Jared Crean ***@***.***>
Date: Wednesday, March 20, 2024 at 3:00 PM
To: trilinos/Trilinos ***@***.***>
Cc: Fuller, Timothy Jesse ***@***.***>, Assign ***@***.***>
Subject: [EXTERNAL] Re: [trilinos/Trilinos] Tpetra: CrsGraph Segfault With Zero Owned Rows (Issue #12806)
bug is the one that needs 16 procs. bug2 is the simpler one that only needs 3 or more.
mpirun -n 3 ./bug2 segfaults for me on cee-build030.
—
Reply to this email directly, view it on GitHub<#12806 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AAL55ZZYGRSDYCBARRG4NVDYZH2GRAVCNFSM6AAAAABESNUVNCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAMJQGYZDQOBWGM>.
You are receiving this because you were assigned.Message ID: ***@***.***>
|
Bug Report
@csiefer2
Description
While working on Sparc, I found a case where
Tpetra::CrsGraph
reliably segfaults. The necessary condition seems to be a process that has zero owned rows. even if it has a non-zero number of used gids. I poked around in Totalview, and what I found is:globalAssemble
, Tpetra creates a secondCrsGraph
to represent the off-process entriesTpetra::Details::packCrsGraphNew()
gets called, which doessourceGraph.getColMap()->getLocalMap()
sourceGraph.getColMap()
is the nullptr, causinggetLocalMap()
to segfault.I'm reasonably sure this isn't a bug in how I am setting up the graph because
sourceGraph
is the secondCrsGraph
created by Tpetra, and it calls a constructor that does not allocate a column map, only a row map.Steps to Reproduce
The attached files contains two example that cause the segfault. The second one,
bug2
, is simpler. The setup is the first n-1 procs form a tridiagonal matrix, each proc owning 1 row, while the last proc does not own any rows, but uses the gid owned by the second to last proc.To build:
develop
branch on Feb 16.)cd ./tpetra_bug
mkdir -v ./build
cd ./build
../do_cmake.sh
mpirun -np 3 ./bug2
# this can be run on any number of procs > 3If you want to run the first example, do:
8.
mpirun -np 16 ./bug
# this can only be run on 16 procsOne additional request: if Tpetra could be made work when a process has zero owned and zero used gids (as long as the required collective MPI calls are made), that would be useful. The first example creates a subcommunicator that includes only procs that have used gids. This was an attempt to work around the bug I originally saw in Sparc, where a process had zero owned and zero used gids and segfaulted. If I could use the regular communicator and just have some procs that don't contribute to the matrix, that would make the code simpler.
The text was updated successfully, but these errors were encountered: