Teuchos: TeuchosCore_RCP_PerformanceTests_basic_MPI_1 test randomly failing maxRcpRawObjAccessRatio check #13728

achauphan · 2025-01-16T00:12:37Z

Bug Report

@trilinos/teuchos

Description

The TeuchosCore_RCP_PerformanceTests_basic_MPI_1 test has been unstable and randomly failing across multiple observed PRs. In such cases, this test is the only failing result across the entire PR builds and is usually unrelated to any of the changes made in those PRs. The result of this test randomly failing and taking down an entire set otherwise passing builds is a developer adding a RETEST label in hopes of this test passing, wasting resources.

A general query of all instances of this test passing (in the last 4 months):

https://trilinos-cdash.sandia.gov/queryTests.php?project=Trilinos&filtercount=3&showfilters=1&filtercombine=and&field1=testname&compare1=61&value1=TeuchosCore_RCP_PerformanceTests_basic_MPI_1&field2=status&compare2=61&value2=Failed&field3=buildstarttime&compare3=84&value3=now

(See more refined query filter below).

Of those failures, here are a few examples where this was the only failing test where it flipped to a passing result after a retest on the same merge commit hash:

Framework: Enable building of tests but skip running tests in cuda-uvm PR builds #13715
- CDash entire build history
- CDash isolated build history that contained the failing test
  - Failing build SHA 8fd2e8e8
  - Passing build SHA 8fd2e8e8
Framework: Update get dependencies call #13713
- CDash entire build history
- CDash isolated build history that contained the failing test
  - Failing build SHA 5d6c4970
  - Passing build SHA 5d6c4970
  - NOTE: Contains and AutoTester hiccup, the set of builds with the 50 build errors can be ignored. Notice the double AutoTester 'Starting' messages in the GitHub PR.
Bump github/codeql-action from 3.27.9 to 3.28.0 #13693
- CDash entire build history
- CDash isolated build history that contained the failing test
  - Failing build SHA 8abaee48
  - Passing build SHA 8abaee48

Above were cases where the merge commit hash between builds were the same and the test result flipped between runs. There are also suspected cases where this test is randomly failing, however, due to other changes in the PR between builds, the merge commit hash changes. and we cannot conclude that it is randomly failing based on only looking at the merge commit hash. There are likely lots of these types of examples, but here is one suspected case.

Kokkos + KokkosKernels Promotion To 4.5.1 #13679
- CDash entire build history
- CDash isolated build history that contained the failing test

NOTE: There is no easy way to directly identify that two of the same PR builds between a set of PR builds are being tested on the same SHA for the branch being tested through CDash. One way is to look at the configure output for each build and observe the merge commit SHA outputted by TriBITS. Identifying two sets of builds being tested on the same SHA where one had a test fail and the other the same test pass indicates a random failing test.

Example configure output:

Trilinos repos versions:
--------------------------------------------------------------------------------
*** Base Git Repo: Trilinos
59b721cc145 [Thu Jan 2 07:56:22 2025 -0700] <[email protected]>
Merge commit '8abaee48b00ce8791d885b7e591323d02e29a9de' into develop
    *** Parent 1:
    85752f68cf8 [Mon Dec 30 13:44:28 2024 +0100] <[email protected]>
    Merge pull request #13696 from maxfirmbach/Amesos2-epetra-onnode-reindexing-nonc
    *** Parent 2:
    8abaee48b00 [Mon Dec 23 22:25:32 2024 +0000] <49699333+dependabot[bot]@users.noreply.github.com>
    Bump github/codeql-action from 3.27.9 to 3.28.0
 --------------------------------------------------------------------------------

The text was updated successfully, but these errors were encountered:

ndellingwood · 2025-01-16T03:38:15Z

This has been failing randomly (but infrequently) for quite some time, I first noticed and mentioned here #11921 (comment) , which correlated with the change to the default Teuchos_ENABLE_THREAD_SAFE=ON to enforce thread safety (PR #11946)

Edit: added a note that the observed random occurrence is infrequent

ndellingwood · 2025-01-16T04:06:26Z

Previous reported issues, #8648 , #6429 . Is this a test whose results can be impacted with multiple tests executing simultaneously on the same resource?

bartlettroscoe · 2025-01-16T12:59:30Z

Previous reported issues, #8648 , #6429 . Is this a test whose results can be impacted with multiple tests executing simultaneously on the same resource?

#6429 was a different check failing within this test. However, #8648 does appear to be an exact duplicate of this issue. The fix seems clear but it looks like I failed to follow up on that (see #8648 (comment)). (There have been a lot of random failures like this over the years.)

Just an FYI to improve this and future such bug reports, but you want to use the "Test Output" filter to better analyze random failures like this. The best query to show the current failure is this query (click "Show Matching Output") that includes the matching test output regex maxRcpRawObjAccessRatio.*FAILED.*RCP_Performance_UnitTests.cpp has 14 hits showing output like:

 finalRcpRawRatio = 14.5932 <= maxRcpRawObjAccessRatio = 13.5 : FAILED ==> /scratch/trilinos/workspace/PR_clang/Trilinos/packages/teuchos/core/test/MemoryManagement/RCP_Performance_UnitTests.cpp:516
 
 [FAILED]  (0.000319 sec) RCP_dereferenceOverhead_UnitTest
 Location: /scratch/trilinos/workspace/PR_clang/Trilinos/packages/teuchos/core/test/MemoryManagement/RCP_Performance_UnitTests.cpp:377

The query that does not contain the matching for the output given above shows 18 tests. If you filter out the tests that match the output regex maxRcpRawObjAccessRatio.*FAILED.*RCP_Performance_UnitTests.cpp using this query and look at the output manually, you can see that all of those failed due to inability to load libcuda.so:

/scratch/trilinos/workspace/PR_cuda-uvm/pull_request_test/packages/teuchos/core/test/MemoryManagement/TeuchosCore_RCP_PerformanceTests.exe: error while loading shared libraries: libcuda.so.1: cannot open shared object file: No such file or directory

You can verify that by running this query that that greps the output for libcuda.so. So that is a completely different type of random test failure that has nothing to do with this maxRcpRawObjAccessRatio.*FAILED.*RCP_Performance_UnitTests.cpp failure. NOTE: If look for all failed tests showing libcuda.so, you get a ton of failed tests in PR builds as shown in this query, showing 111591 failed tests. But if you look over those builds, you see that a bunch of tests fail. This should likely be a different Trilinos issue because it is impacting many PRs as well! (I will run the tool reate_trilinos_github_test_failure_issue_driver.sh).

bartlettroscoe · 2025-01-16T13:05:55Z

FYI: I ran:

../Trilinos/commonTools/framework/github_issue_creator/create_trilinos_github_test_failure_issue_driver.sh -u "https://trilinos-cdash.sandia.gov/queryTests.php?project=Trilinos&filtercount=4&showfilters=1&filtercombine=and&field1=testname&compare1=61&value1=TeuchosCore_RCP_PerformanceTests_basic_MPI_1&field2=status&compare2=61&value2=Failed&field3=buildstarttime&compare3=84&value3=now&field4=testoutput&compare4=97&value4=maxRcpRawObjAccessRatio.*FAILED.*RCP_Performance_UnitTests.cpp"
tribitsDir = '/home/rabartl/Trilinos.base/Trilinos/cmake/tribits'

which created the following GitHub Issue text ...

Next Action Status

Description

As shown in this query (click "Shown Matching Output" in upper right) the tests:

TeuchosCore_RCP_PerformanceTests_basic_MPI_1

in the unique GenConfig builds:

-11.0.1-openmpi-4.0.5_release
rhel8_sems-clang-11.0.1-openmpi-4.0.5-serial_release-debug_shared_no-kokkos-arch_no-asan_no-complex_no-fpic_mpi_no-pt_no-rdc_no-uvm_deprecated-on_no-package-enables
rhel8_sems-intel-2021.3-sems-openmpi-4.1.6_release-debug_shared_no-kokkos-arch_no-asan_no-complex_fpic_mpi_no-pt_no-rdc_no-uvm_deprecated-on_no-package-enables

started failing on testing day 2024-10-01.

The specific set of CDash builds impacted where:

PR-13484-test-rhel8_sems-clang-11.0.1-openmpi-4.0.5-serial_release-debug_shared_no-kokkos-arch_no-asan_no-complex_no-fpic_mpi_no-pt_no-rdc_no-uvm_deprecated-on_no-package-enables-611
PR-13530-test-rhel8_sems-clang-11.0.1-openmpi-4.0.5-serial_release-debug_shared_no-kokkos-arch_no-asan_no-complex_no-fpic_mpi_no-pt_no-rdc_no-uvm_deprecated-on_no-package-enables-683
PR-13531-test-rhel8_sems-clang-11.0.1-openmpi-4.0.5-serial_release-debug_shared_no-kokkos-arch_no-asan_no-complex_no-fpic_mpi_no-pt_no-rdc_no-uvm_deprecated-on_no-package-enables-674
PR-13554-test-rhel8_sems-intel-2021.3-sems-openmpi-4.1.6_release-debug_shared_no-kokkos-arch_no-asan_no-complex_fpic_mpi_no-pt_no-rdc_no-uvm_deprecated-on_no-package-enables-669
PR-13567-test-rhel8_sems-clang-11.0.1-openmpi-4.0.5-serial_release-debug_shared_no-kokkos-arch_no-asan_no-complex_no-fpic_mpi_no-pt_no-rdc_no-uvm_deprecated-on_no-package-enables-907
PR-13589-test-rhel8_sems-clang-11.0.1-openmpi-4.0.5-serial_release-debug_shared_no-kokkos-arch_no-asan_no-complex_no-fpic_mpi_no-pt_no-rdc_no-uvm_deprecated-on_no-package-enables-861
PR-13633-test-rhel8_sems-clang-11.0.1-openmpi-4.0.5-serial_release-debug_shared_no-kokkos-arch_no-asan_no-complex_no-fpic_mpi_no-pt_no-rdc_no-uvm_deprecated-on_no-package-enables-935
PR-13679-test-rhel8_sems-clang-11.0.1-openmpi-4.0.5-serial_release-debug_shared_no-kokkos-arch_no-asan_no-complex_no-fpic_mpi_no-pt_no-rdc_no-uvm_deprecated-on_no-package-enables-984
PR-13679-test-rhel8_sems-clang-11.0.1-openmpi-4.0.5-serial_release-debug_shared_no-kokkos-arch_no-asan_no-complex_no-fpic_mpi_no-pt_no-rdc_no-uvm_deprecated-on_no-package-enables-989
PR-13693-test-rhel8_sems-intel-2021.3-sems-openmpi-4.1.6_release-debug_shared_no-kokkos-arch_no-asan_no-complex_fpic_mpi_no-pt_no-rdc_no-uvm_deprecated-on_no-package-enables-919
PR-13713-test-rhel8_sems-intel-2021.3-sems-openmpi-4.1.6_release-debug_shared_no-kokkos-arch_no-asan_no-complex_fpic_mpi_no-pt_no-rdc_no-uvm_deprecated-on_no-package-enables-948
PR-13715-test-rhel8_sems-clang-11.0.1-openmpi-4.0.5-serial_release-debug_shared_no-kokkos-arch_no-asan_no-complex_no-fpic_mpi_no-pt_no-rdc_no-uvm_deprecated-on_no-package-enables-1040
PR-13715-test-rhel8_sems-clang-11.0.1-openmpi-4.0.5-serial_release-debug_shared_no-kokkos-arch_no-asan_no-complex_no-fpic_mpi_no-pt_no-rdc_no-uvm_deprecated-on_no-package-enables-1061
clang-11.0.1-openmpi-4.0.5_release-debug_shared

Current Status on CDash

Run the above query adjusting the "Begin" and "End" dates to match today any other date range or just click "CURRENT" in the top bar to see results for the current testing day.

Steps to Reproduce

See:

https://github.com/trilinos/Trilinos/wiki/Reproducing-PR-Testing-Errors

If you can't figure out what commands to run to reproduce the problem given this documentation, then please post a comment here and we will give you the exact minimal commands.

bartlettroscoe · 2025-01-16T13:12:21Z

FYI, if run this query (click "Show Matching Output") that includes the matching test output regex maxRcpRawObjAccessRatio.*FAILED.*RCP_Performance_UnitTests.cpp and click "Show Matching Output" (and then copy and paste the output and grep for maxRcpRawObjAccessRatio and sort), you see the failures fall in the ranges of 13.6 to 17.8:

 finalRcpRawRatio = 13.6435 <= maxRcpRawObjAccessRatio = 13.5 : FAILED ==> ../Trilinos/packages/teuchos/core/test/MemoryManagement/RCP_Performance_UnitTests.cpp:51
 finalRcpRawRatio = 13.8132 <= maxRcpRawObjAccessRatio = 13.5 : FAILED ==> ../Trilinos/packages/teuchos/core/test/MemoryManagement/RCP_Performance_UnitTests.cpp:51
 finalRcpRawRatio = 14.1935 <= maxRcpRawObjAccessRatio = 13.5 : FAILED ==> ../Trilinos/packages/teuchos/core/test/MemoryManagement/RCP_Performance_UnitTests.cpp:51
 finalRcpRawRatio = 14.4877 <= maxRcpRawObjAccessRatio = 13.5 : FAILED ==> /scratch/trilinos/workspace/PR_clang/Trilinos/packages/teuchos/core/test/MemoryManagemen
 finalRcpRawRatio = 14.5932 <= maxRcpRawObjAccessRatio = 13.5 : FAILED ==> /scratch/trilinos/workspace/PR_clang/Trilinos/packages/teuchos/core/test/MemoryManagemen
 finalRcpRawRatio = 14.803 <= maxRcpRawObjAccessRatio = 13.5 : FAILED ==> ../Trilinos/packages/teuchos/core/test/MemoryManagement/RCP_Performance_UnitTests.cpp:51
 finalRcpRawRatio = 15.0302 <= maxRcpRawObjAccessRatio = 13.5 : FAILED ==> ../Trilinos/packages/teuchos/core/test/MemoryManagement/RCP_Performance_UnitTests.cpp:51
 finalRcpRawRatio = 15.0909 <= maxRcpRawObjAccessRatio = 13.5 : FAILED ==> /scratch/trilinos/workspace/PR_clang/Trilinos/packages/teuchos/core/test/MemoryManagemen
 finalRcpRawRatio = 15.8463 <= maxRcpRawObjAccessRatio = 13.5 : FAILED ==> /scratch/trilinos/jenkins/ascic166/workspace/Nightly/Trilinos_nightly_pipeline/Trilinos/
 finalRcpRawRatio = 16.3204 <= maxRcpRawObjAccessRatio = 13.5 : FAILED ==> ../Trilinos/packages/teuchos/core/test/MemoryManagement/RCP_Performance_UnitTests.cpp:51
 finalRcpRawRatio = 16.6284 <= maxRcpRawObjAccessRatio = 13.5 : FAILED ==> ../Trilinos/packages/teuchos/core/test/MemoryManagement/RCP_Performance_UnitTests.cpp:67
 finalRcpRawRatio = 16.7577 <= maxRcpRawObjAccessRatio = 13.5 : FAILED ==> ../Trilinos/packages/teuchos/core/test/MemoryManagement/RCP_Performance_UnitTests.cpp:51
 finalRcpRawRatio = 17.0477 <= maxRcpRawObjAccessRatio = 13.5 : FAILED ==> /scratch/trilinos/jenkins/ascic166/workspace/PR_intel/Trilinos/packages/teuchos/core/tes
 finalRcpRawRatio = 17.8221 <= maxRcpRawObjAccessRatio = 13.5 : FAILED ==> ../Trilinos/packages/teuchos/core/test/MemoryManagement/RCP_Performance_UnitTests.cpp:67

So if we increase maxRcpRawObjAccessRatio from 13.5 to say 20.0, that should solve the problem. I will put in that PR ASAP.

…8648, trilinos#13728) That should be high enough to avoid every random failure of this check ever observed in Trilinos PR testing. It is debatable if a test such as this should be run in all builds or in just dedicated performance builds. (The default timing ratios are very loose.) We just want to make sure these tests are not broken in every build so that this test will be able to run in performance builds. Signed-off-by: Roscoe A. Bartlett <[email protected]>

bartlettroscoe · 2025-01-16T13:32:18Z

@achauphan and @ndellingwood, the fixing PR is #13729. Please review and approve.

…8648, trilinos#13728) That should be high enough to avoid every random failure of this check ever observed in Trilinos PR testing. It is debatable if a test such as this should be run in all builds or in just dedicated performance builds. (The default timing ratios are very loose.) We just want to make sure these tests are not broken in every build so that this test will be able to run in performance builds. Signed-off-by: Roscoe A. Bartlett <[email protected]>

…-default-maxRcpRawObjAccessRatio Automatically Merged using Trilinos Pull Request AutoTester PR Title: b'Increase default maxRcpRawObjAccessRatio from 13.5 to 20.0 (#8648, #13728)' PR Author: bartlettroscoe

achauphan added the type: bug The primary issue is a bug in Trilinos code or tests label Jan 16, 2025

cgcgcg added the pkg: Teuchos Issues primarily dealing with the Teuchos Package label Jan 16, 2025

bartlettroscoe changed the title ~~Teuchos: TeuchosCore_RCP_PerformanceTests_basic_MPI_1 test randomly failing~~ Teuchos: TeuchosCore_RCP_PerformanceTests_basic_MPI_1 test randomly failing maxRcpRawObjAccessRatio check Jan 16, 2025

This was referenced Jan 16, 2025

Increase default maxRcpRawObjAccessRatio from 13.5 to 20.0 (#8648, #13728) #13729

Merged

TeuchosCore_RCP_PerformanceTests_basic_MPI_1 randomly failing in Trilinos_pullrequest_gcc_7.2.0_debug builds #8648

Closed

bartlettroscoe mentioned this issue Jan 16, 2025

Random mass test failures in PR builds due to failure to load libcuda.so starting at least by 2024-11-20 #13730

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Teuchos: TeuchosCore_RCP_PerformanceTests_basic_MPI_1 test randomly failing maxRcpRawObjAccessRatio check #13728

Teuchos: TeuchosCore_RCP_PerformanceTests_basic_MPI_1 test randomly failing maxRcpRawObjAccessRatio check #13728

achauphan commented Jan 16, 2025 •

edited by bartlettroscoe

Loading

ndellingwood commented Jan 16, 2025 •

edited

Loading

ndellingwood commented Jan 16, 2025

bartlettroscoe commented Jan 16, 2025

bartlettroscoe commented Jan 16, 2025 •

edited

Loading

bartlettroscoe commented Jan 16, 2025

bartlettroscoe commented Jan 16, 2025

Teuchos: TeuchosCore_RCP_PerformanceTests_basic_MPI_1 test randomly failing maxRcpRawObjAccessRatio check #13728

Teuchos: TeuchosCore_RCP_PerformanceTests_basic_MPI_1 test randomly failing maxRcpRawObjAccessRatio check #13728

Comments

achauphan commented Jan 16, 2025 • edited by bartlettroscoe Loading

Bug Report

Description

ndellingwood commented Jan 16, 2025 • edited Loading

ndellingwood commented Jan 16, 2025

bartlettroscoe commented Jan 16, 2025

bartlettroscoe commented Jan 16, 2025 • edited Loading

Next Action Status

Description

Current Status on CDash

Steps to Reproduce

bartlettroscoe commented Jan 16, 2025

bartlettroscoe commented Jan 16, 2025

achauphan commented Jan 16, 2025 •

edited by bartlettroscoe

Loading

ndellingwood commented Jan 16, 2025 •

edited

Loading

bartlettroscoe commented Jan 16, 2025 •

edited

Loading