Random test timeouts bringing down PR build & test iterations 2023 #12391

bartlettroscoe · 2023-10-10T23:16:17Z

CC: @ccober6, @sebrowne, @trilinos/framework

Description

Anecdotal evidence seems to suggest that random test failures, including random timeouts, are bringing down PR testing iterations fairly regularly. When this happens, all of the builds need to be run again from scratch, wasting testing computing resources, blocking PR testing iterations for other PRs, and delaying the merge of PRs.

For example, this query over the last two months suggest that random test timeouts took out PR testing iterations for the following PRs:

Note that the test timeouts for the PRs #12050 and #12297 shown in that query don't appear to be random. Filtering out those PRs yields this reduced query over the last two months shows the 7 randomly failing tests:

Site	Build Name	Test Name	Status	Time	Proc Time	Details	Build Time	Processors
ascic164	PR-12388-test-rhel7_sems-intel-2021.3-sems-openmpi-4.0.5_release-debug_shared_no-kokkos-arch_no-asan_no-complex_fpic_mpi_no-pt_no-rdc_no-uvm_deprecated-off_no-package-enables-1204	ROL_example_PDE-OPT_ginzburg-landau_example_01_MPI_4	Failed	10m 130ms	40m 520ms	Completed (Timeout)	2023-10-10T12:08:21 MDT	4
ascicgpu036	PR-12372-test-rhel7_sems-cuda-11.4.2-sems-gnu-10.1.0-sems-openmpi-4.0.5_release_static_Volta70_no-asan_complex_no-fpic_mpi_pt_no-rdc_no-uvm_deprecated-on_no-package-enables-2552	Adelus_vector_random_npr3_rhs4_MPI_3	Failed	10m 60ms	30m 180ms	Completed (Timeout)	2023-10-06T10:29:53 MDT	3
ascic114	PR-12367-test-rhel7_sems-intel-2021.3-sems-openmpi-4.0.5_release-debug_shared_no-kokkos-arch_no-asan_no-complex_fpic_mpi_no-pt_no-rdc_no-uvm_deprecated-off_no-package-enables-1181	MueLu_ParameterListInterpreterTpetra_MPI_4	Failed	10m 90ms	40m 360ms	Completed (Timeout)	2023-10-05T09:36:07 MDT	4
ascic166	PR-12281-test-rhel7_sems-gnu-8.3.0-openmpi-1.10.1-serial_debug_shared_no-kokkos-arch_no-asan_no-complex_no-fpic_mpi_no-pt_no-rdc_no-uvm_deprecated-on_no-package-enables-1452	Tempus_IMEX_RK_Partitioned_Staggered_FSA_Partitioned_IMEX_RK_1st_Order_MPI_1	Failed	10m 40ms	10m 40ms	Completed (Timeout)	2023-09-25T11:50:41 MDT	1
ascic166	PR-12259-test-rhel7_sems-gnu-8.3.0-openmpi-1.10.1-serial_debug_shared_no-kokkos-arch_no-asan_no-complex_no-fpic_mpi_no-pt_no-rdc_no-uvm_deprecated-on_no-package-enables-1347	Tempus_BackwardEuler_MPI_1	Failed	10m 30ms	10m 30ms	Completed (Timeout)	2023-09-14T00:58:50 MDT	1
ascic166	PR-12223-test-rhel7_sems-gnu-8.3.0-openmpi-1.10.1-serial_debug_shared_no-kokkos-arch_no-asan_no-complex_no-fpic_mpi_no-pt_no-rdc_no-uvm_deprecated-on_no-package-enables-1317	Tempus_BackwardEuler_MPI_1	Failed	10m 40ms	10m 40ms	Completed (Timeout)	2023-09-11T14:11:26 MDT	1
ascic164	PR-12103-test-rhel7_sems-intel-2021.3-sems-openmpi-4.0.5_release-debug_shared_no-kokkos-arch_no-asan_no-complex_fpic_mpi_no-pt_no-rdc_no-uvm_deprecated-off_no-package-enables-784	ROL_test_algorithm_TypeE_StabilizedLCL_MPI_1	Failed	10m 120ms	10m 120ms	Completed (Timeout)	2023-08-10T11:19:22 MDT	1

NOTE: Further analysis would be needed to confirm that all of these tests were random timeouts. But I believe that a tool could be written to automatically determine if a timeout (or any test failure) was random. It would actually not be that hard to do.

Suggested solution

The simple solution would seem for the ctest -S driver to just rerun the failing tests again, in serial, to avoid the timeouts. For example, CTest directly supports this with the --repeat after-timeout:<n> argument and the ctest_test() argument REPEAT after-timeout:<n>.

The text was updated successfully, but these errors were encountered:

bartlettroscoe · 2023-10-10T23:21:19Z

FYI: I created the following internal Trilinos HelpDesk issue asking about rerunning timing out tests:

https://sems-atlassian-son.sandia.gov/jira/servicedesk/customer/portal/7/TRILINOSHD-261

cgcgcg · 2023-10-11T00:39:12Z

@bartlettroscoe What was the random failure for "MueLu_ParameterListInterpreterTpetra_MPI_4"? How often does that happen?

sebrowne · 2023-10-11T12:39:44Z

I proposed a --repeat-until pass to the operational leaders about a month ago and we decided not to pursue it. I'm personally much more open to the --repeat after-timeout approach, and will bring this to the next operational leaders meeting.

ccober6 · 2023-10-11T13:41:20Z

@bartlettroscoe , #12316 went through about 3 weeks ago and should have taken care of Tempus_BackwardEuler timeouts. It is believed that these timeouts where due to debug builds and machine loads. Have you seen timeouts from this in the past ~three weeks?

ccober6 · 2023-10-11T13:46:07Z

Ok, looking closer at the table above, those timeouts occurred 9/14 and 9/11 and #12316 went through on 9/23. Hopefully this fixed it, but let me know if they pop up again.

bartlettroscoe · 2023-10-11T13:47:59Z

I proposed a --repeat-until pass to the operational leaders about a month ago and we decided not to pursue it.

@sebrowne, I don't think that would be effective because there may be persistently failing tests that will never pass. I think that if the Trilinos Framework CTest -S driver reran the failing tests just once, that would likely resolve 95% of random test failures (including random test timeouts) and avoid having to run all of the builds from scratch again.

In addition, The Trilinos PR testing system desperately needs to have some automated monitoring set up to detect random failures so they can be addressed. Right now Trilinos is flying blind and only when someone does manual CDash queries like I have done above do you see the issues. Simply rerunning randomly failing tests is a good step forward but does not address the underlying problems with the Trilinos test suite.

NOTE: Random timeouts are a much easier problem because you can just rerun them in serial and they will pass. That is why I scoped this issue to focus on just random timeouts. (Also note that some timeouts are not random but are due to changes on the PR topic branch.)

Can we discuss this topic at the next Trilinos Developers meeting or during the TUG developer day?

ccober6 · 2023-10-11T13:48:28Z

I will look at Tempus_IMEX_RK_Partitioned_Staggered_FSA_Partitioned_IMEX_RK_1st_Order and see if I can break it down into smaller jobs. Otherwise we may need to disable it for debug builds.

bartlettroscoe · 2023-10-11T13:51:03Z

What was the random failure for "MueLu_ParameterListInterpreterTpetra_MPI_4"? How often does that happen?

Have you seen timeouts from this in the past ~three weeks?

@cgcgcg and @ccober6, those are all questions that can be answered by running CDash queries, like I did above. But note that there is insufficient data on CDash to fully analyzing what is happening. That is, you don't know the exact version of the 'develop' branch and the PR topic branches that are being tested by just looking at CDash. You have to also look at the Trilinos PR comments. Not clear how hard it would be to write a tool that could do that (since you would have to parse the HTML pages).

bartlettroscoe · 2023-10-11T13:56:49Z

I'm personally much more open to the --repeat after-timeout approach, and will bring this to the next operational leaders meeting.

@sebrowne, I would be curious what the argument could be against running a small number of timing out tests again in serial at the end of a PR build.

csiefer2 · 2023-10-11T14:07:29Z

For the one time MueLu_ParameterListInterpreterTpetra_MPI_4 timed out in the last month, we were getting this:

[ascic114][[11936,1],2][btl_tcp.c:559:mca_btl_tcp_recv_blocking] recv(17) failed: Connection reset by peer (104)
[ascic114][[11936,1],1][btl_tcp.c:559:mca_btl_tcp_recv_blocking] recv(17) failed: Connection reset by peer (104)
[ascic114][[11936,1],2][btl_tcp.c:559:mca_btl_tcp_recv_blocking] recv(17) failed: Connection reset by peer (104)
[ascic114][[11936,1],1][btl_tcp.c:559:mca_btl_tcp_recv_blocking] recv(17) failed: Connection reset by peer (104)
[ascic114][[11936,1],2][btl_tcp.c:559:mca_btl_tcp_recv_blocking] recv(18) failed: Connection reset by peer (104)
[ascic114][[11936,1],1][btl_tcp.c:559:mca_btl_tcp_recv_blocking] recv(18) failed: Connection reset by peer (104)
[ascic114][[11936,1],2][btl_tcp.c:559:mca_btl_tcp_recv_blocking] recv(18) failed: Connection reset by peer (104)
[ascic114][[11936,1],2][btl_tcp.c:559:mca_btl_tcp_recv_blocking] recv(18) failed: Connection reset by peer (104)
[ascic114][[11936,1],2][btl_tcp.c:559:mca_btl_tcp_recv_blocking] recv(18) failed: Connection reset by peer (104)
[ascic114][[11936,1],1][btl_tcp.c:559:mca_btl_tcp_recv_blocking] recv(18) failed: Connection reset by peer (104)
[ascic114][[11936,1],2][btl_tcp.c:559:mca_btl_tcp_recv_blocking] recv(17) failed: Connection reset by peer (104)

This looks like a system-level MPI problem to me. I don't think this is something we, as Trilinos developers need to worry about.

rppawlo · 2023-10-11T14:32:38Z

@sebrowne, I would be curious what the argument could be against running a small number of timing out tests again in serial at the end of a PR build.

Race conditions. A lot of race conditions show up only when we put a load on the cuda cards with multiple tests. I've got many examples from the atdm apps where the tests pass fine when run by themselves, but when run with other tests it triggers the race condition. Rerunning failed tests by themselves would allow a much higher chance of getting a race condition into the code base. Personally I'd rather hit retest once in a while than try to figure out which commit might be causing a race condition. That was the thinking.

bartlettroscoe · 2023-10-11T14:33:18Z

This looks like a system-level MPI problem to me. I don't think this is something we, as Trilinos developers need to worry about.

@csiefer2, that is not the point of this issue #12391. The point of this issues is that we should not have the Trilinos PR testing system blow away all of the builds and start from scratch for one single test timeout due to a system issue (like overloading the machine, or some silly MPI communication problems).

sebrowne · 2023-10-11T14:40:26Z

The ability to restart single jobs to rerun ONLY those jobs is one of the considerations for the GitHub Actions based CI that we are working towards. That may help address some of the concern at least with respect to the overhead of re-running a more-appropriate set of jobs.

bartlettroscoe · 2023-10-11T14:44:14Z

The ability to restart single jobs to rerun ONLY those jobs is one of the considerations for the GitHub Actions based CI that we are working towards. That may help address some of the concern at least with respect to the overhead of re-running a more-appropriate set of jobs.

@sebrowne, you should be able to do that from within the same CTest -S invocation with another call to ctest_test() and just submit the results again (and determine pass/fail based on the 2nd ctest_test() invocation that reruns all of the failing tests). You will need put in some logic to ensure you don't rerun like 2000 failing tests in serial. (For example, you would only rerun failing tests if say less than 5 tests or something failed. A safe and logic trigger can be discussed)

bartlettroscoe · 2023-10-11T14:54:22Z

@sebrowne, I would be curious what the argument could be against running a small number of timing out tests again in serial at the end of a PR build.

Race conditions. A lot of race conditions show up only when we put a load on the cuda cards with multiple tests. I've got many examples from the atdm apps where the tests pass fine when run by themselves, but when run with other tests it triggers the race condition.

@rppawlo, only one of the timeouts shown above was a CUDA build. Also, in a production run of these APP codes, it is almost always run on one MPI process per node so this is typically not a production problem (just a testing problem and harder to get people excited about). Also, with the test resource allocation method added to CTest by Kitware and used by Trilinos (see here). And for production runs, Kokkos contains automatic logic for spreading out the GPU work.

Rerunning failed tests by themselves would allow a much higher chance of getting a race condition into the code base. Personally I'd rather hit retest once in a while than try to figure out which commit might be causing a race condition. That was the thinking.

@rppawlo, that would be great if people actually looked into these random failures and stopped the merge of the PR until they addressed the issue to make sure that they were not injecting a new race condition. But no one (except for me as far as I can tell) actually does that. Consider what happened in the PRs listed above when these tests were encountered:

Teuchos: Fix LARND with Kokkos::complex<double> #12388 (comment) (Put AT: RETEST with no indication test results were examined for a possible new race condition)
Revert "Framework: Enable CMAKE_LINK_LIBRARIES_ONLY_TARGETS" #12372 (comment) (I put AT: RETEST on only after confirming it was a random test timeout)
Framework: Disable randomly failing ROL test #12367 (comment) (Put AT: RETEST with no indication test results were examined for a possible new race condition)
Belos: add Tpetra pseudo_cg_indefinite test #12281 (comment) (Put AT: RETEST with no indication test results were examined for a possible new race condition)
Tpetra: Adding labels to Kokkos::fence(), fixing bug #12259 (comment) (@csiefer2 looked at the failing test and told @ccober6 about the failure; thanks @csiefer2!)
Tpetra: Adding Kokkos::fence tracking support to timer injection #12223 (comment) (Put AT: RETEST with no indication test results were examined for a possible new race condition)
Tpetra: use kokkos kernels BsrMatrix spmv #12103 (comment) (Put AT: RETEST with no indication test results were examined for a possible new race condition)

The point is that almost no one bothers looking at the failures or reporting them so they don't bring down other PR testing iterations in the future. (But kudos to @csiefer2 for doing that and reporting it to @ccober6.)

So beyond just rerunning a small number of timing out tests automatically (which is the purpose of this issue #12391), we need to set up some automated monitoring to catch random failures like above (and other more common non-timeout random failures).

bartlettroscoe · 2023-10-11T14:56:22Z

One can argue that the current Trilinos PR testing system is training developers to ignore random test failures that may be injecting race conditions into the 'develop' branch (because there are so many random failures that are not related to the PR branch, and it is so easy to ignore the test results and just set AT: RETEST).

bartlettroscoe · 2023-10-11T15:24:49Z

Race conditions.

@rppawlo, it just occurred to me, that if Trilinos really wants to get serious about detecting race conditions, the best way to do that is to run a CDash monitoring tool that automatically generates a new Trilinos GitHub issue when it detects the following:

A given test randomly fails more than once in two or more different PR or nightly builds

In theory, it is easy to determine if a test likely to be randomly failing:

If the test failed/passed in one PR iteration but switched to passed/failed in the next PR iteration and there were no changes to the PR topic branch

In that situation, the only ways that a test could go from failing to passing (where there were no changes on the PR topic branch) and the test is not randomly failing are:

The test was failing on the 'develop' branch when the first PR build iteration ran but new commits were merged to the 'develop' branch that fixed the test failure before the second PR build iteration ran. (That is unlikely but we have seen that happen before.)
A problem with the system resulted in the test failing for the first PR build iteration but got fixed before the second PR build iteration. (That is extremely unlikely but could happen.)

By only flagging a single test that randomly fails twice, we avoid random system failures (like random MPI_init() failures that are independent of the test) that kill tests indiscriminately (where it would be very unlikely that the same test would be effected twice).

hkthorn · 2023-10-11T15:39:58Z

@csiefer2 FWIW, in April of 2022 the Xyce team was experiencing errors like this when running on our ascic development machines. This was also observed by Charon analysts when running on cee-compute nodes.

[ascic114][[11936,1],2][btl_tcp.c:559:mca_btl_tcp_recv_blocking] recv(17) failed: Connection reset by peer (104)
[ascic114][[11936,1],1][btl_tcp.c:559:mca_btl_tcp_recv_blocking] recv(17) failed: Connection reset by peer (104)
[ascic114][[11936,1],2][btl_tcp.c:559:mca_btl_tcp_recv_blocking] recv(17) failed: Connection reset by peer (104)
[ascic114][[11936,1],1][btl_tcp.c:559:mca_btl_tcp_recv_blocking] recv(17) failed: Connection reset by peer (104)
[ascic114][[11936,1],2][btl_tcp.c:559:mca_btl_tcp_recv_blocking] recv(18) failed: Connection reset by peer (104)
[ascic114][[11936,1],1][btl_tcp.c:559:mca_btl_tcp_recv_blocking] recv(18) failed: Connection reset by peer (104)
[ascic114][[11936,1],2][btl_tcp.c:559:mca_btl_tcp_recv_blocking] recv(18) failed: Connection reset by peer (104)
[ascic114][[11936,1],2][btl_tcp.c:559:mca_btl_tcp_recv_blocking] recv(18) failed: Connection reset by peer (104)
[ascic114][[11936,1],2][btl_tcp.c:559:mca_btl_tcp_recv_blocking] recv(18) failed: Connection reset by peer (104)
[ascic114][[11936,1],1][btl_tcp.c:559:mca_btl_tcp_recv_blocking] recv(18) failed: Connection reset by peer (104)
[ascic114][[11936,1],2][btl_tcp.c:559:mca_btl_tcp_recv_blocking] recv(17) failed: Connection reset by peer (104)

In the end, I found Erik Illescas (Org. 9327) who recommended that we set environmental variables to address this:

In older OpenMPI versions, setting this parameter use to work:
export OMPI_MCA_btl=tcp,sm,self

In newer versions:
export OMPI_MCA_btl=tcp,vader,self

Try removing tcp:
export OMPI_MCA_btl=vader,self

Set this parameter prior to running the application. Most OpenMPI will have both tcp and shared mem configured by default.

It did resolve this type of runtime error from occurring, at least for Xyce.

Additionally, David Collins also suggested a method for passing in the same variables during the execution:

Perhaps making sure that OpenMPI isn’t using TCP for its BTL would be helpful:

mpirun --mca oob_tcp_if_include lo --mca btl self,sm -np 4 <<path to binary + args go here>>

csiefer2 · 2023-10-11T20:10:46Z

Tempus just bit me again :) #12393

ccober6 · 2023-10-11T20:20:44Z

@csiefer2, does not look like a Tempus issue. It appears to be a network problem.

`
...
Total Time: 3.425e+01 sec

Summary: total = 2, run = 2, passed = 2, failed = 0

End Result: TEST PASSED
[ascic114][[2995,1],0][btl_tcp.c:559:mca_btl_tcp_recv_blocking] recv(22) failed: Connection reset by peer (104)
[ascic114][[2995,1],0][btl_tcp.c:559:mca_btl_tcp_recv_blocking] recv(16) failed: Connection reset by peer (104)
[ascic114][[2995,1],0][btl_tcp.c:559:mca_btl_tcp_recv_blocking] recv(17) failed: Connection reset by peer (104)
[ascic114][[2995,1],0][btl_tcp.c:559:mca_btl_tcp_recv_blocking] recv(18) failed: Connection reset by peer (104)`

csiefer2 · 2023-10-11T21:26:14Z

@sebrowne Can we get Trilinos to try the magic OMPI_MCA_btl flags?

rppawlo · 2023-10-12T01:31:45Z

So with rerunning tests, any race condition (not just ones that show up with overloaded cuda cards) can just be ignored by our PR system by getting lucky. This seems dangerous but I guess it depends on how often race conditions get introduced. Personally, I'd rather try to catch the code before it gets into the develop branch rather than rely on post push testing that could take a month to identify. With the introduction of kokkos, we've had some really nasty race conditions that were hard to track down and burned a lot of manpower. Once a bug is in develop, it potentially impacts other developers and downstream applications (with multiple teams spending time triaging). I realize we can't catch them all and think that your script for checking failures should be done regardless. Just trying to minimize broken code in the develop branch. Guess we can talk about this more at TUG.

Many of the random failures look like the hang @hkthorn showed. I've seen these in empire and local workstation testing. Thanks for that recommendation Heidi!

Side note: We use the cmake resource management tools in empire, but for unit testing allocations we purposely overload the cards with the resource tool. Unit tests as so small they have no chance of saturating a gpu so we allocate 3 mpi processes to each cuda card during unit testing. This brought the empire unit test time on cuda down from 3 hours (using cuda resources with one mpi process per cuda card) to 1 hour 10 minutes (with the 3 mpi procs per cuda card).

sebrowne · 2023-10-12T12:56:07Z

See #11391 for an introduction of the resource manager into PR testing. I've never gotten around to diagnosing the STK unit test failure, but we should probably do that too (what you said about EMPIRE seemed to track with Trilinos testing as well in terms of time reduction).

sebrowne · 2023-10-12T12:57:00Z

@sebrowne Can we get Trilinos to try the magic OMPI_MCA_btl flags?

Yes, I'll schedule work for it ASAP. I'll have to research and understand which features we're explicitly enabling, and I'd like to compare them with SIERRA's launch options for curiosity's sake.

bartlettroscoe · 2023-10-12T14:15:36Z

So with rerunning tests, any race condition (not just ones that show up with overloaded cuda cards) can just be ignored by our PR system by getting lucky. This seems dangerous but I guess it depends on how often race conditions get introduced.

@rppawlo, but if it is a rare race condition it will mostly likely not cause a test failure in the PR that introduces the problem and it will get merged anyway. So you will not even detect it for some time later (like what happened with that Kokkos CUDA Scan bug years ago).

I'd rather try to catch the code before it gets into the develop branch rather than rely on post push testing that could take a month to identify.

The CDash monitoring process I describe above can be looking at active PRs before they are even merged to 'develop'. And it will do a better job than humans (because it will be done instead of ignored).

But to be clear, the majority of people are just not manually looking at and reporting these random test failures in their PRs. They are just setting AT: RETEST and that is not going to change (because most people want to maximize their own productivity and that means kicking the can down the road for someone else to deal with). What is the personal reward for doing this?

Side note: We use the cmake resource management tools in empire, but for unit testing allocations we purposely overload the cards with the resource tool. Unit tests as so small they have no chance of saturating a gpu so we allocate 3 mpi processes to each cuda card during unit testing. This brought the empire unit test time on cuda down from 3 hours (using cuda resources with one mpi process per cuda card) to 1 hour 10 minutes (with the 3 mpi procs per cuda card).

Trilinos could do that as well based on the test test category BASIC, CONTINUOUS, NIGHTLY, HEAVY, but that would require Trilinos tests to be marked correctly based on time and resource usage. (But we would need profiling to know what tests use how much resources to know if they should be marked BASIC.)

bartlettroscoe · 2023-10-20T14:56:08Z

Here is another random timeout taking out an entire set of Trilinos PR builds building every Trilinos package and test suite :-(

Teuchos: support initializer_list construction in Array class #12411 (comment)

This time, it was the test:

SubProject	Name	Status	Time Status	Time	Proc Time	Details	Labels	Summary	Processors
Tempus	Tempus_BDF2_MPI_1	Failed	Passed	10m 40ms	10m 40ms	Completed (Timeout)	Tempus	Unstable	1

in the build:

rhel7_sems-gnu-8.3.0-openmpi-1.10.1-serial_debug_shared_no-kokkos-arch_no-asan_no-complex_no-fpic_mpi_no-pt_no-rdc_no-uvm_deprecated-on_no-package-enables

jhux2 · 2023-10-25T22:09:27Z

#12437 was just taken down by Tempus_BDF2_MPI_1:

https://trilinos-cdash.sandia.gov/index.php?project=Trilinos&filtercount=3&showfilters=1&filtercombine=and&field1=buildname&compare1=63&value1=PR-12437-test&field2=buildstarttime&compare2=84&value2=NOW&field3=testsfailed&compare3=42&value3=0

ccober6 · 2023-10-25T22:17:34Z

Yeah, it is on my list. :(

jhux2 · 2023-10-25T23:10:34Z

Yeah, it is on my list. :(

Sorry, wasn't trying to nag, just adding information.

bartlettroscoe · 2023-10-26T00:49:01Z

The PR iteration #12442 (comment) also experienced a random test timeout Piro_TempusSolver_SensitivitySinCosUnitTests_MPI_4.

ccober6 · 2023-10-26T15:09:25Z

Yeah, it is on my list. :(

Sorry, wasn't trying to nag, just adding information.

Not a problem. It is just a little frustrating, when these tests run in under 10 seconds (in optimized mode) on my laptop, but in debug mode with a platform under load, run longer than 10 minutes!

ccober6 · 2023-11-08T14:46:02Z

Here is another random timeout taking out an entire set of Trilinos PR builds building every Trilinos package and test suite :-(
* [Teuchos: support initializer_list construction in Array class #12411 (comment)](https://github.com/trilinos/Trilinos/pull/12411#issuecomment-1772034937)
This time, it was the test:
SubProject Name Status Time Status Time Proc Time Details Labels Summary Processors
Tempus Tempus_BDF2_MPI_1 Failed Passed 10m 40ms 10m 40ms Completed (Timeout) Tempus Unstable 1

in the build:
* [rhel7_sems-gnu-8.3.0-openmpi-1.10.1-serial_debug_shared_no-kokkos-arch_no-asan_no-complex_no-fpic_mpi_no-pt_no-rdc_no-uvm_deprecated-on_no-package-enables](https://trilinos-cdash.sandia.gov/index.php?project=Trilinos&parentid=1208729&filtercount=1&showfilters=1&field1=buildstarttime&compare1=84&value1=NOW&filtercombine=and)

This should be fixed with #12484.

bartlettroscoe · 2024-01-02T19:03:11Z

Not a timeout but here was another random test failure bringing down an entire PR test iteration:

Teuchos: Add optional yaml-cpp parser #12599 (comment)

github-actions · 2025-01-04T12:32:19Z

This issue has had no activity for 365 days and is marked for closure. It will be closed after an additional 30 days of inactivity.
If you would like to keep this issue open please add a comment and/or remove the MARKED_FOR_CLOSURE label.
If this issue should be kept open even with no activity beyond the time limits you can add the label DO_NOT_AUTOCLOSE.
If it is ok for this issue to be closed, feel free to go ahead and close it. Please do not add any comments or change any labels or otherwise touch this issue unless your intention is to reset the inactivity counter for an additional year.

bartlettroscoe added the PA: Framework Issues that fall under the Trilinos Framework Product Area label Oct 10, 2023

csiefer2 mentioned this issue Oct 11, 2023

Ifpack2: Adding BC-respecting initial guess for Chebyshev eigenvalue estimation #12393

Merged

bartlettroscoe mentioned this issue Oct 20, 2023

Teuchos: support initializer_list construction in Array class #12411

Merged

bartlettroscoe mentioned this issue Oct 26, 2023

Framework: Relocate CXX standard setup #12442

Merged

bartlettroscoe mentioned this issue Oct 26, 2023

Random build errors bringing down PR build & test iterations 2023 #12450

Closed

bartlettroscoe changed the title ~~Random test timeouts bringing down PRs 2023~~ Random test timeouts bringing down PR build & test iterations 2023 Oct 26, 2023

fryeguy52 mentioned this issue Dec 5, 2023

Framework: Explicitly disable tcp for PR testing with openmpi #12585

Merged

achauphan mentioned this issue Jan 17, 2024

Add tool for analyzing and reporting random CDash test failures TriBITSPub/TriBITS#600

Open

github-actions bot added the MARKED_FOR_CLOSURE Issue or PR is marked for auto-closure by the GitHub Actions bot. label Jan 4, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Random test timeouts bringing down PR build & test iterations 2023 #12391

Random test timeouts bringing down PR build & test iterations 2023 #12391

bartlettroscoe commented Oct 10, 2023

bartlettroscoe commented Oct 10, 2023

cgcgcg commented Oct 11, 2023

sebrowne commented Oct 11, 2023

ccober6 commented Oct 11, 2023

ccober6 commented Oct 11, 2023

bartlettroscoe commented Oct 11, 2023

ccober6 commented Oct 11, 2023

bartlettroscoe commented Oct 11, 2023

bartlettroscoe commented Oct 11, 2023

csiefer2 commented Oct 11, 2023

rppawlo commented Oct 11, 2023

bartlettroscoe commented Oct 11, 2023

sebrowne commented Oct 11, 2023

bartlettroscoe commented Oct 11, 2023 •

edited

Loading

bartlettroscoe commented Oct 11, 2023

bartlettroscoe commented Oct 11, 2023

bartlettroscoe commented Oct 11, 2023 •

edited

Loading

hkthorn commented Oct 11, 2023 •

edited

Loading

csiefer2 commented Oct 11, 2023

ccober6 commented Oct 11, 2023

csiefer2 commented Oct 11, 2023

rppawlo commented Oct 12, 2023

sebrowne commented Oct 12, 2023

sebrowne commented Oct 12, 2023

bartlettroscoe commented Oct 12, 2023

bartlettroscoe commented Oct 20, 2023

jhux2 commented Oct 25, 2023

ccober6 commented Oct 25, 2023

jhux2 commented Oct 25, 2023

bartlettroscoe commented Oct 26, 2023

ccober6 commented Oct 26, 2023

ccober6 commented Nov 8, 2023

bartlettroscoe commented Jan 2, 2024

github-actions bot commented Jan 4, 2025

Random test timeouts bringing down PR build & test iterations 2023 #12391

Random test timeouts bringing down PR build & test iterations 2023 #12391

Comments

bartlettroscoe commented Oct 10, 2023

Description

Suggested solution

bartlettroscoe commented Oct 10, 2023

cgcgcg commented Oct 11, 2023

sebrowne commented Oct 11, 2023

ccober6 commented Oct 11, 2023

ccober6 commented Oct 11, 2023

bartlettroscoe commented Oct 11, 2023

ccober6 commented Oct 11, 2023

bartlettroscoe commented Oct 11, 2023

bartlettroscoe commented Oct 11, 2023

csiefer2 commented Oct 11, 2023

rppawlo commented Oct 11, 2023

bartlettroscoe commented Oct 11, 2023

sebrowne commented Oct 11, 2023

bartlettroscoe commented Oct 11, 2023 • edited Loading

bartlettroscoe commented Oct 11, 2023

bartlettroscoe commented Oct 11, 2023

bartlettroscoe commented Oct 11, 2023 • edited Loading

hkthorn commented Oct 11, 2023 • edited Loading

csiefer2 commented Oct 11, 2023

ccober6 commented Oct 11, 2023

csiefer2 commented Oct 11, 2023

rppawlo commented Oct 12, 2023

sebrowne commented Oct 12, 2023

sebrowne commented Oct 12, 2023

bartlettroscoe commented Oct 12, 2023

bartlettroscoe commented Oct 20, 2023

jhux2 commented Oct 25, 2023

ccober6 commented Oct 25, 2023

jhux2 commented Oct 25, 2023

bartlettroscoe commented Oct 26, 2023

ccober6 commented Oct 26, 2023

ccober6 commented Nov 8, 2023

bartlettroscoe commented Jan 2, 2024

github-actions bot commented Jan 4, 2025

bartlettroscoe commented Oct 11, 2023 •

edited

Loading

bartlettroscoe commented Oct 11, 2023 •

edited

Loading

hkthorn commented Oct 11, 2023 •

edited

Loading