Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tpetra: CrsGraph Segfault With Zero Owned Rows #12806

Open
JaredCrean2 opened this issue Mar 12, 2024 · 6 comments
Open

Tpetra: CrsGraph Segfault With Zero Owned Rows #12806

JaredCrean2 opened this issue Mar 12, 2024 · 6 comments
Assignees
Labels
pkg: Tpetra type: bug The primary issue is a bug in Trilinos code or tests

Comments

@JaredCrean2
Copy link

Bug Report

@csiefer2

Description

While working on Sparc, I found a case where Tpetra::CrsGraph reliably segfaults. The necessary condition seems to be a process that has zero owned rows. even if it has a non-zero number of used gids. I poked around in Totalview, and what I found is:

  • During globalAssemble, Tpetra creates a second CrsGraph to represent the off-process entries
  • eventually, Tpetra::Details::packCrsGraphNew() gets called, which does sourceGraph.getColMap()->getLocalMap()
  • The problem is sourceGraph.getColMap() is the nullptr, causing getLocalMap() to segfault.
    I'm reasonably sure this isn't a bug in how I am setting up the graph because sourceGraph is the second CrsGraph created by Tpetra, and it calls a constructor that does not allocate a column map, only a row map.

Steps to Reproduce

The attached files contains two example that cause the segfault. The second one, bug2, is simpler. The setup is the first n-1 procs form a tridiagonal matrix, each proc owning 1 row, while the last proc does not own any rows, but uses the gid owned by the second to last proc.

To build:

  1. Build Tpetra (I am using the Sparc Trilinos, which last pulled in the Trilinos develop branch on Feb 16.)
  2. Untar the files: tpetra_bug.tar.gz
  3. cd ./tpetra_bug
  4. mkdir -v ./build
  5. cd ./build
  6. ../do_cmake.sh
  7. mpirun -np 3 ./bug2 # this can be run on any number of procs > 3

If you want to run the first example, do:
8. mpirun -np 16 ./bug # this can only be run on 16 procs

One additional request: if Tpetra could be made work when a process has zero owned and zero used gids (as long as the required collective MPI calls are made), that would be useful. The first example creates a subcommunicator that includes only procs that have used gids. This was an attempt to work around the bug I originally saw in Sparc, where a process had zero owned and zero used gids and segfaulted. If I could use the regular communicator and just have some procs that don't contribute to the matrix, that would make the code simpler.

@JaredCrean2 JaredCrean2 added the type: bug The primary issue is a bug in Trilinos code or tests label Mar 12, 2024
@cwpearson
Copy link
Contributor

const LMT local_col_map = sourceGraph.getColMap ()->getLocalMap ();

@cwpearson
Copy link
Contributor

It looks like globalAssemble uses the CrsGraph(rowMap, numEntriesPerRow) constructor, which does not set the colMap_, and puts the graph in a resume-fill state.

rcp(new crs_graph_type(nonlocalRowMap, numEntPerNonlocalRow()));

There is a later comment that says "no need to fill-complete the nonlocals graph"

// There's no need to fill-complete the nonlocals graph.
// We just use it as a temporary container for the Export.

The CrsGraph's checkInternalState function expects colMap_ to be set, but only after fill-complete:

TEUCHOS_TEST_FOR_EXCEPTION_CLASS_FUNC
(this->isFillComplete () &&
(this->colMap_.is_null () ||
this->rangeMap_.is_null () ||
this->domainMap_.is_null ()),
std::logic_error,
"Graph is full complete, but at least one of {column, range, domain} "
"Map is null." << suffix);

So I wonder if this is a case of trying to do something that is only allowed on fill-complete graphs?

@csiefer2
Copy link
Member

@JaredCrean2: @tjfulle volunteered to take a look at this.

@tjfulle tjfulle self-assigned this Mar 19, 2024
@tjfulle
Copy link
Contributor

tjfulle commented Mar 20, 2024

@JaredCrean2 - I started looking at this today, when I run the first executable, I see:

$ mpirun -n 3 ./bug
line = 0 3 0 1 2
inserting global indices:
  0, 3
  0, 0
  0, 1
  0, 2
line = 1 3 0 1 2
inserting global indices:
  1, 3
  1, 0
  1, 1
  1, 2
line = 2 3 4 5 0 1 2
inserting global indices:
  2, 3
  2, 4
  2, 5
  2, 0
  2, 1
  2, 2
line = 3 3 4 5 0 1 2
inserting global indices:
  3, 3
  3, 4
  3, 5
  3, 0
  3, 1
  3, 2
line = 4 3 4 5 2
inserting global indices:
  4, 3
  4, 4
  4, 5
  4, 2
line = 5 3 4 5 2
inserting global indices:
  5, 3
  5, 4
  5, 5
  5, 2

and no segfault. What should I expect to see?

@JaredCrean2
Copy link
Author

bug is the one that needs 16 procs. bug2 is the simpler one that only needs 3 or more.

mpirun -n 3 ./bug2 segfaults for me on cee-build030.

@tjfulle
Copy link
Contributor

tjfulle commented Mar 20, 2024 via email

@jhux2 jhux2 added this to Tpetra Aug 12, 2024
@jhux2 jhux2 moved this to Needs Triage in Tpetra Aug 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pkg: Tpetra type: bug The primary issue is a bug in Trilinos code or tests
Projects
Status: Needs Triage
Development

No branches or pull requests

4 participants