Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Teuchos: TeuchosCore_RCP_PerformanceTests_basic_MPI_1 test randomly failing maxRcpRawObjAccessRatio check #13728

Open
achauphan opened this issue Jan 16, 2025 · 6 comments
Labels
pkg: Teuchos Issues primarily dealing with the Teuchos Package type: bug The primary issue is a bug in Trilinos code or tests

Comments

@achauphan
Copy link
Contributor

achauphan commented Jan 16, 2025

Bug Report

@trilinos/teuchos

Description

The TeuchosCore_RCP_PerformanceTests_basic_MPI_1 test has been unstable and randomly failing across multiple observed PRs. In such cases, this test is the only failing result across the entire PR builds and is usually unrelated to any of the changes made in those PRs. The result of this test randomly failing and taking down an entire set otherwise passing builds is a developer adding a RETEST label in hopes of this test passing, wasting resources.

A general query of all instances of this test passing (in the last 4 months):

(See more refined query filter below).

Of those failures, here are a few examples where this was the only failing test where it flipped to a passing result after a retest on the same merge commit hash:

Above were cases where the merge commit hash between builds were the same and the test result flipped between runs. There are also suspected cases where this test is randomly failing, however, due to other changes in the PR between builds, the merge commit hash changes. and we cannot conclude that it is randomly failing based on only looking at the merge commit hash. There are likely lots of these types of examples, but here is one suspected case.

NOTE: There is no easy way to directly identify that two of the same PR builds between a set of PR builds are being tested on the same SHA for the branch being tested through CDash. One way is to look at the configure output for each build and observe the merge commit SHA outputted by TriBITS. Identifying two sets of builds being tested on the same SHA where one had a test fail and the other the same test pass indicates a random failing test.

Example configure output:

Trilinos repos versions:
--------------------------------------------------------------------------------
*** Base Git Repo: Trilinos
59b721cc145 [Thu Jan 2 07:56:22 2025 -0700] <[email protected]>
Merge commit '8abaee48b00ce8791d885b7e591323d02e29a9de' into develop
    *** Parent 1:
    85752f68cf8 [Mon Dec 30 13:44:28 2024 +0100] <[email protected]>
    Merge pull request #13696 from maxfirmbach/Amesos2-epetra-onnode-reindexing-nonc
    *** Parent 2:
    8abaee48b00 [Mon Dec 23 22:25:32 2024 +0000] <49699333+dependabot[bot]@users.noreply.github.com>
    Bump github/codeql-action from 3.27.9 to 3.28.0
 --------------------------------------------------------------------------------
@achauphan achauphan added the type: bug The primary issue is a bug in Trilinos code or tests label Jan 16, 2025
@cgcgcg cgcgcg added the pkg: Teuchos Issues primarily dealing with the Teuchos Package label Jan 16, 2025
@ndellingwood
Copy link
Contributor

ndellingwood commented Jan 16, 2025

This has been failing randomly (but infrequently) for quite some time, I first noticed and mentioned here #11921 (comment) , which correlated with the change to the default Teuchos_ENABLE_THREAD_SAFE=ON to enforce thread safety (PR #11946)

Edit: added a note that the observed random occurrence is infrequent

@ndellingwood
Copy link
Contributor

Previous reported issues, #8648 , #6429 . Is this a test whose results can be impacted with multiple tests executing simultaneously on the same resource?

@bartlettroscoe
Copy link
Member

Previous reported issues, #8648 , #6429 . Is this a test whose results can be impacted with multiple tests executing simultaneously on the same resource?

#6429 was a different check failing within this test. However, #8648 does appear to be an exact duplicate of this issue. The fix seems clear but it looks like I failed to follow up on that (see #8648 (comment)). (There have been a lot of random failures like this over the years.)

Just an FYI to improve this and future such bug reports, but you want to use the "Test Output" filter to better analyze random failures like this. The best query to show the current failure is this query (click "Show Matching Output") that includes the matching test output regex maxRcpRawObjAccessRatio.*FAILED.*RCP_Performance_UnitTests.cpp has 14 hits showing output like:

 finalRcpRawRatio = 14.5932 <= maxRcpRawObjAccessRatio = 13.5 : FAILED ==> /scratch/trilinos/workspace/PR_clang/Trilinos/packages/teuchos/core/test/MemoryManagement/RCP_Performance_UnitTests.cpp:516
 
 [FAILED]  (0.000319 sec) RCP_dereferenceOverhead_UnitTest
 Location: /scratch/trilinos/workspace/PR_clang/Trilinos/packages/teuchos/core/test/MemoryManagement/RCP_Performance_UnitTests.cpp:377

The query that does not contain the matching for the output given above shows 18 tests. If you filter out the tests that match the output regex maxRcpRawObjAccessRatio.*FAILED.*RCP_Performance_UnitTests.cpp using this query and look at the output manually, you can see that all of those failed due to inability to load libcuda.so:

/scratch/trilinos/workspace/PR_cuda-uvm/pull_request_test/packages/teuchos/core/test/MemoryManagement/TeuchosCore_RCP_PerformanceTests.exe: error while loading shared libraries: libcuda.so.1: cannot open shared object file: No such file or directory

You can verify that by running this query that that greps the output for libcuda.so. So that is a completely different type of random test failure that has nothing to do with this maxRcpRawObjAccessRatio.*FAILED.*RCP_Performance_UnitTests.cpp failure. NOTE: If look for all failed tests showing libcuda.so, you get a ton of failed tests in PR builds as shown in this query, showing 111591 failed tests. But if you look over those builds, you see that a bunch of tests fail. This should likely be a different Trilinos issue because it is impacting many PRs as well! (I will run the tool reate_trilinos_github_test_failure_issue_driver.sh).

@bartlettroscoe bartlettroscoe changed the title Teuchos: TeuchosCore_RCP_PerformanceTests_basic_MPI_1 test randomly failing Teuchos: TeuchosCore_RCP_PerformanceTests_basic_MPI_1 test randomly failing maxRcpRawObjAccessRatio check Jan 16, 2025
@bartlettroscoe
Copy link
Member

bartlettroscoe commented Jan 16, 2025

FYI: I ran:

../Trilinos/commonTools/framework/github_issue_creator/create_trilinos_github_test_failure_issue_driver.sh -u "https://trilinos-cdash.sandia.gov/queryTests.php?project=Trilinos&filtercount=4&showfilters=1&filtercombine=and&field1=testname&compare1=61&value1=TeuchosCore_RCP_PerformanceTests_basic_MPI_1&field2=status&compare2=61&value2=Failed&field3=buildstarttime&compare3=84&value3=now&field4=testoutput&compare4=97&value4=maxRcpRawObjAccessRatio.*FAILED.*RCP_Performance_UnitTests.cpp"
tribitsDir = '/home/rabartl/Trilinos.base/Trilinos/cmake/tribits'

which created the following GitHub Issue text ...

Next Action Status

Description

As shown in this query (click "Shown Matching Output" in upper right) the tests:

  • TeuchosCore_RCP_PerformanceTests_basic_MPI_1

in the unique GenConfig builds:

  • -11.0.1-openmpi-4.0.5_release
  • rhel8_sems-clang-11.0.1-openmpi-4.0.5-serial_release-debug_shared_no-kokkos-arch_no-asan_no-complex_no-fpic_mpi_no-pt_no-rdc_no-uvm_deprecated-on_no-package-enables
  • rhel8_sems-intel-2021.3-sems-openmpi-4.1.6_release-debug_shared_no-kokkos-arch_no-asan_no-complex_fpic_mpi_no-pt_no-rdc_no-uvm_deprecated-on_no-package-enables

started failing on testing day 2024-10-01.

The specific set of CDash builds impacted where:

  • PR-13484-test-rhel8_sems-clang-11.0.1-openmpi-4.0.5-serial_release-debug_shared_no-kokkos-arch_no-asan_no-complex_no-fpic_mpi_no-pt_no-rdc_no-uvm_deprecated-on_no-package-enables-611
  • PR-13530-test-rhel8_sems-clang-11.0.1-openmpi-4.0.5-serial_release-debug_shared_no-kokkos-arch_no-asan_no-complex_no-fpic_mpi_no-pt_no-rdc_no-uvm_deprecated-on_no-package-enables-683
  • PR-13531-test-rhel8_sems-clang-11.0.1-openmpi-4.0.5-serial_release-debug_shared_no-kokkos-arch_no-asan_no-complex_no-fpic_mpi_no-pt_no-rdc_no-uvm_deprecated-on_no-package-enables-674
  • PR-13554-test-rhel8_sems-intel-2021.3-sems-openmpi-4.1.6_release-debug_shared_no-kokkos-arch_no-asan_no-complex_fpic_mpi_no-pt_no-rdc_no-uvm_deprecated-on_no-package-enables-669
  • PR-13567-test-rhel8_sems-clang-11.0.1-openmpi-4.0.5-serial_release-debug_shared_no-kokkos-arch_no-asan_no-complex_no-fpic_mpi_no-pt_no-rdc_no-uvm_deprecated-on_no-package-enables-907
  • PR-13589-test-rhel8_sems-clang-11.0.1-openmpi-4.0.5-serial_release-debug_shared_no-kokkos-arch_no-asan_no-complex_no-fpic_mpi_no-pt_no-rdc_no-uvm_deprecated-on_no-package-enables-861
  • PR-13633-test-rhel8_sems-clang-11.0.1-openmpi-4.0.5-serial_release-debug_shared_no-kokkos-arch_no-asan_no-complex_no-fpic_mpi_no-pt_no-rdc_no-uvm_deprecated-on_no-package-enables-935
  • PR-13679-test-rhel8_sems-clang-11.0.1-openmpi-4.0.5-serial_release-debug_shared_no-kokkos-arch_no-asan_no-complex_no-fpic_mpi_no-pt_no-rdc_no-uvm_deprecated-on_no-package-enables-984
  • PR-13679-test-rhel8_sems-clang-11.0.1-openmpi-4.0.5-serial_release-debug_shared_no-kokkos-arch_no-asan_no-complex_no-fpic_mpi_no-pt_no-rdc_no-uvm_deprecated-on_no-package-enables-989
  • PR-13693-test-rhel8_sems-intel-2021.3-sems-openmpi-4.1.6_release-debug_shared_no-kokkos-arch_no-asan_no-complex_fpic_mpi_no-pt_no-rdc_no-uvm_deprecated-on_no-package-enables-919
  • PR-13713-test-rhel8_sems-intel-2021.3-sems-openmpi-4.1.6_release-debug_shared_no-kokkos-arch_no-asan_no-complex_fpic_mpi_no-pt_no-rdc_no-uvm_deprecated-on_no-package-enables-948
  • PR-13715-test-rhel8_sems-clang-11.0.1-openmpi-4.0.5-serial_release-debug_shared_no-kokkos-arch_no-asan_no-complex_no-fpic_mpi_no-pt_no-rdc_no-uvm_deprecated-on_no-package-enables-1040
  • PR-13715-test-rhel8_sems-clang-11.0.1-openmpi-4.0.5-serial_release-debug_shared_no-kokkos-arch_no-asan_no-complex_no-fpic_mpi_no-pt_no-rdc_no-uvm_deprecated-on_no-package-enables-1061
  • clang-11.0.1-openmpi-4.0.5_release-debug_shared

<Add details about what is failing and what the failures look like. Make sure to include strings that are easy to match with GitHub Issue searches.>

Current Status on CDash

Run the above query adjusting the "Begin" and "End" dates to match today any other date range or just click "CURRENT" in the top bar to see results for the current testing day.

Steps to Reproduce

See:

If you can't figure out what commands to run to reproduce the problem given this documentation, then please post a comment here and we will give you the exact minimal commands.

@bartlettroscoe
Copy link
Member

FYI, if run this query (click "Show Matching Output") that includes the matching test output regex maxRcpRawObjAccessRatio.*FAILED.*RCP_Performance_UnitTests.cpp and click "Show Matching Output" (and then copy and paste the output and grep for maxRcpRawObjAccessRatio and sort), you see the failures fall in the ranges of 13.6 to 17.8:

 finalRcpRawRatio = 13.6435 <= maxRcpRawObjAccessRatio = 13.5 : FAILED ==> ../Trilinos/packages/teuchos/core/test/MemoryManagement/RCP_Performance_UnitTests.cpp:51
 finalRcpRawRatio = 13.8132 <= maxRcpRawObjAccessRatio = 13.5 : FAILED ==> ../Trilinos/packages/teuchos/core/test/MemoryManagement/RCP_Performance_UnitTests.cpp:51
 finalRcpRawRatio = 14.1935 <= maxRcpRawObjAccessRatio = 13.5 : FAILED ==> ../Trilinos/packages/teuchos/core/test/MemoryManagement/RCP_Performance_UnitTests.cpp:51
 finalRcpRawRatio = 14.4877 <= maxRcpRawObjAccessRatio = 13.5 : FAILED ==> /scratch/trilinos/workspace/PR_clang/Trilinos/packages/teuchos/core/test/MemoryManagemen
 finalRcpRawRatio = 14.5932 <= maxRcpRawObjAccessRatio = 13.5 : FAILED ==> /scratch/trilinos/workspace/PR_clang/Trilinos/packages/teuchos/core/test/MemoryManagemen
 finalRcpRawRatio = 14.803 <= maxRcpRawObjAccessRatio = 13.5 : FAILED ==> ../Trilinos/packages/teuchos/core/test/MemoryManagement/RCP_Performance_UnitTests.cpp:51
 finalRcpRawRatio = 15.0302 <= maxRcpRawObjAccessRatio = 13.5 : FAILED ==> ../Trilinos/packages/teuchos/core/test/MemoryManagement/RCP_Performance_UnitTests.cpp:51
 finalRcpRawRatio = 15.0909 <= maxRcpRawObjAccessRatio = 13.5 : FAILED ==> /scratch/trilinos/workspace/PR_clang/Trilinos/packages/teuchos/core/test/MemoryManagemen
 finalRcpRawRatio = 15.8463 <= maxRcpRawObjAccessRatio = 13.5 : FAILED ==> /scratch/trilinos/jenkins/ascic166/workspace/Nightly/Trilinos_nightly_pipeline/Trilinos/
 finalRcpRawRatio = 16.3204 <= maxRcpRawObjAccessRatio = 13.5 : FAILED ==> ../Trilinos/packages/teuchos/core/test/MemoryManagement/RCP_Performance_UnitTests.cpp:51
 finalRcpRawRatio = 16.6284 <= maxRcpRawObjAccessRatio = 13.5 : FAILED ==> ../Trilinos/packages/teuchos/core/test/MemoryManagement/RCP_Performance_UnitTests.cpp:67
 finalRcpRawRatio = 16.7577 <= maxRcpRawObjAccessRatio = 13.5 : FAILED ==> ../Trilinos/packages/teuchos/core/test/MemoryManagement/RCP_Performance_UnitTests.cpp:51
 finalRcpRawRatio = 17.0477 <= maxRcpRawObjAccessRatio = 13.5 : FAILED ==> /scratch/trilinos/jenkins/ascic166/workspace/PR_intel/Trilinos/packages/teuchos/core/tes
 finalRcpRawRatio = 17.8221 <= maxRcpRawObjAccessRatio = 13.5 : FAILED ==> ../Trilinos/packages/teuchos/core/test/MemoryManagement/RCP_Performance_UnitTests.cpp:67

So if we increase maxRcpRawObjAccessRatio from 13.5 to say 20.0, that should solve the problem. I will put in that PR ASAP.

bartlettroscoe added a commit to bartlettroscoe/Trilinos that referenced this issue Jan 16, 2025
…8648, trilinos#13728)

That should be high enough to avoid every random failure of this check ever
observed in Trilinos PR testing.

It is debatable if a test such as this should be run in all builds or in just
dedicated performance builds.  (The default timing ratios are very loose.)  We
just want to make sure these tests are not broken in every build so that this
test will be able to run in performance builds.

Signed-off-by: Roscoe A. Bartlett <[email protected]>
@bartlettroscoe
Copy link
Member

@achauphan and @ndellingwood, the fixing PR is #13729. Please review and approve.

bartlettroscoe added a commit to bartlettroscoe/Trilinos that referenced this issue Jan 16, 2025
…8648, trilinos#13728)

That should be high enough to avoid every random failure of this check ever
observed in Trilinos PR testing.

It is debatable if a test such as this should be run in all builds or in just
dedicated performance builds.  (The default timing ratios are very loose.)  We
just want to make sure these tests are not broken in every build so that this
test will be able to run in performance builds.

Signed-off-by: Roscoe A. Bartlett <[email protected]>
trilinos-autotester added a commit that referenced this issue Jan 16, 2025
…-default-maxRcpRawObjAccessRatio

Automatically Merged using Trilinos Pull Request AutoTester
PR Title: b'Increase default maxRcpRawObjAccessRatio from 13.5 to 20.0 (#8648, #13728)'
PR Author: bartlettroscoe
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pkg: Teuchos Issues primarily dealing with the Teuchos Package type: bug The primary issue is a bug in Trilinos code or tests
Projects
None yet
Development

No branches or pull requests

4 participants