-
Notifications
You must be signed in to change notification settings - Fork 572
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Random test timeouts bringing down PR build & test iterations 2023 #12391
Comments
FYI: I created the following internal Trilinos HelpDesk issue asking about rerunning timing out tests: |
@bartlettroscoe What was the random failure for "MueLu_ParameterListInterpreterTpetra_MPI_4"? How often does that happen? |
I proposed a |
@bartlettroscoe , #12316 went through about 3 weeks ago and should have taken care of Tempus_BackwardEuler timeouts. It is believed that these timeouts where due to debug builds and machine loads. Have you seen timeouts from this in the past ~three weeks? |
Ok, looking closer at the table above, those timeouts occurred 9/14 and 9/11 and #12316 went through on 9/23. Hopefully this fixed it, but let me know if they pop up again. |
@sebrowne, I don't think that would be effective because there may be persistently failing tests that will never pass. I think that if the Trilinos Framework CTest -S driver reran the failing tests just once, that would likely resolve 95% of random test failures (including random test timeouts) and avoid having to run all of the builds from scratch again. In addition, The Trilinos PR testing system desperately needs to have some automated monitoring set up to detect random failures so they can be addressed. Right now Trilinos is flying blind and only when someone does manual CDash queries like I have done above do you see the issues. Simply rerunning randomly failing tests is a good step forward but does not address the underlying problems with the Trilinos test suite. NOTE: Random timeouts are a much easier problem because you can just rerun them in serial and they will pass. That is why I scoped this issue to focus on just random timeouts. (Also note that some timeouts are not random but are due to changes on the PR topic branch.) Can we discuss this topic at the next Trilinos Developers meeting or during the TUG developer day? |
I will look at Tempus_IMEX_RK_Partitioned_Staggered_FSA_Partitioned_IMEX_RK_1st_Order and see if I can break it down into smaller jobs. Otherwise we may need to disable it for debug builds. |
@cgcgcg and @ccober6, those are all questions that can be answered by running CDash queries, like I did above. But note that there is insufficient data on CDash to fully analyzing what is happening. That is, you don't know the exact version of the 'develop' branch and the PR topic branches that are being tested by just looking at CDash. You have to also look at the Trilinos PR comments. Not clear how hard it would be to write a tool that could do that (since you would have to parse the HTML pages). |
@sebrowne, I would be curious what the argument could be against running a small number of timing out tests again in serial at the end of a PR build. |
For the one time MueLu_ParameterListInterpreterTpetra_MPI_4 timed out in the last month, we were getting this:
This looks like a system-level MPI problem to me. I don't think this is something we, as Trilinos developers need to worry about. |
Race conditions. A lot of race conditions show up only when we put a load on the cuda cards with multiple tests. I've got many examples from the atdm apps where the tests pass fine when run by themselves, but when run with other tests it triggers the race condition. Rerunning failed tests by themselves would allow a much higher chance of getting a race condition into the code base. Personally I'd rather hit retest once in a while than try to figure out which commit might be causing a race condition. That was the thinking. |
@csiefer2, that is not the point of this issue #12391. The point of this issues is that we should not have the Trilinos PR testing system blow away all of the builds and start from scratch for one single test timeout due to a system issue (like overloading the machine, or some silly MPI communication problems). |
The ability to restart single jobs to rerun ONLY those jobs is one of the considerations for the GitHub Actions based CI that we are working towards. That may help address some of the concern at least with respect to the overhead of re-running a more-appropriate set of jobs. |
@sebrowne, you should be able to do that from within the same CTest -S invocation with another call to |
@rppawlo, only one of the timeouts shown above was a CUDA build. Also, in a production run of these APP codes, it is almost always run on one MPI process per node so this is typically not a production problem (just a testing problem and harder to get people excited about). Also, with the test resource allocation method added to CTest by Kitware and used by Trilinos (see here). And for production runs, Kokkos contains automatic logic for spreading out the GPU work.
@rppawlo, that would be great if people actually looked into these random failures and stopped the merge of the PR until they addressed the issue to make sure that they were not injecting a new race condition. But no one (except for me as far as I can tell) actually does that. Consider what happened in the PRs listed above when these tests were encountered:
The point is that almost no one bothers looking at the failures or reporting them so they don't bring down other PR testing iterations in the future. (But kudos to @csiefer2 for doing that and reporting it to @ccober6.) So beyond just rerunning a small number of timing out tests automatically (which is the purpose of this issue #12391), we need to set up some automated monitoring to catch random failures like above (and other more common non-timeout random failures). |
One can argue that the current Trilinos PR testing system is training developers to ignore random test failures that may be injecting race conditions into the 'develop' branch (because there are so many random failures that are not related to the PR branch, and it is so easy to ignore the test results and just set |
@rppawlo, it just occurred to me, that if Trilinos really wants to get serious about detecting race conditions, the best way to do that is to run a CDash monitoring tool that automatically generates a new Trilinos GitHub issue when it detects the following:
In theory, it is easy to determine if a test likely to be randomly failing:
In that situation, the only ways that a test could go from failing to passing (where there were no changes on the PR topic branch) and the test is not randomly failing are:
By only flagging a single test that randomly fails twice, we avoid random system failures (like random MPI_init() failures that are independent of the test) that kill tests indiscriminately (where it would be very unlikely that the same test would be effected twice). |
@csiefer2 FWIW, in April of 2022 the Xyce team was experiencing errors like this when running on our ascic development machines. This was also observed by Charon analysts when running on cee-compute nodes.
In the end, I found Erik Illescas (Org. 9327) who recommended that we set environmental variables to address this:
It did resolve this type of runtime error from occurring, at least for Xyce. Additionally, David Collins also suggested a method for passing in the same variables during the execution:
|
Tempus just bit me again :) #12393 |
@csiefer2, does not look like a Tempus issue. It appears to be a network problem. ` Summary: total = 2, run = 2, passed = 2, failed = 0 End Result: TEST PASSED |
@sebrowne Can we get Trilinos to try the magic OMPI_MCA_btl flags? |
So with rerunning tests, any race condition (not just ones that show up with overloaded cuda cards) can just be ignored by our PR system by getting lucky. This seems dangerous but I guess it depends on how often race conditions get introduced. Personally, I'd rather try to catch the code before it gets into the develop branch rather than rely on post push testing that could take a month to identify. With the introduction of kokkos, we've had some really nasty race conditions that were hard to track down and burned a lot of manpower. Once a bug is in develop, it potentially impacts other developers and downstream applications (with multiple teams spending time triaging). I realize we can't catch them all and think that your script for checking failures should be done regardless. Just trying to minimize broken code in the develop branch. Guess we can talk about this more at TUG. Many of the random failures look like the hang @hkthorn showed. I've seen these in empire and local workstation testing. Thanks for that recommendation Heidi! Side note: We use the cmake resource management tools in empire, but for unit testing allocations we purposely overload the cards with the resource tool. Unit tests as so small they have no chance of saturating a gpu so we allocate 3 mpi processes to each cuda card during unit testing. This brought the empire unit test time on cuda down from 3 hours (using cuda resources with one mpi process per cuda card) to 1 hour 10 minutes (with the 3 mpi procs per cuda card). |
See #11391 for an introduction of the resource manager into PR testing. I've never gotten around to diagnosing the STK unit test failure, but we should probably do that too (what you said about EMPIRE seemed to track with Trilinos testing as well in terms of time reduction). |
Yes, I'll schedule work for it ASAP. I'll have to research and understand which features we're explicitly enabling, and I'd like to compare them with SIERRA's launch options for curiosity's sake. |
@rppawlo, but if it is a rare race condition it will mostly likely not cause a test failure in the PR that introduces the problem and it will get merged anyway. So you will not even detect it for some time later (like what happened with that Kokkos CUDA Scan bug years ago).
The CDash monitoring process I describe above can be looking at active PRs before they are even merged to 'develop'. And it will do a better job than humans (because it will be done instead of ignored). But to be clear, the majority of people are just not manually looking at and reporting these random test failures in their PRs. They are just setting
Trilinos could do that as well based on the test test category BASIC, CONTINUOUS, NIGHTLY, HEAVY, but that would require Trilinos tests to be marked correctly based on time and resource usage. (But we would need profiling to know what tests use how much resources to know if they should be marked BASIC.) |
Here is another random timeout taking out an entire set of Trilinos PR builds building every Trilinos package and test suite :-( This time, it was the test:
in the build: |
Yeah, it is on my list. :( |
Sorry, wasn't trying to nag, just adding information. |
The PR iteration #12442 (comment) also experienced a random test timeout Piro_TempusSolver_SensitivitySinCosUnitTests_MPI_4. |
Not a problem. It is just a little frustrating, when these tests run in under 10 seconds (in optimized mode) on my laptop, but in debug mode with a platform under load, run longer than 10 minutes! |
This should be fixed with #12484. |
Not a timeout but here was another random test failure bringing down an entire PR test iteration: |
This issue has had no activity for 365 days and is marked for closure. It will be closed after an additional 30 days of inactivity. |
CC: @ccober6, @sebrowne, @trilinos/framework
Description
Anecdotal evidence seems to suggest that random test failures, including random timeouts, are bringing down PR testing iterations fairly regularly. When this happens, all of the builds need to be run again from scratch, wasting testing computing resources, blocking PR testing iterations for other PRs, and delaying the merge of PRs.
For example, this query over the last two months suggest that random test timeouts took out PR testing iterations for the following PRs:
Note that the test timeouts for the PRs #12050 and #12297 shown in that query don't appear to be random. Filtering out those PRs yields this reduced query over the last two months shows the 7 randomly failing tests:
NOTE: Further analysis would be needed to confirm that all of these tests were random timeouts. But I believe that a tool could be written to automatically determine if a timeout (or any test failure) was random. It would actually not be that hard to do.
Suggested solution
The simple solution would seem for the ctest -S driver to just rerun the failing tests again, in serial, to avoid the timeouts. For example, CTest directly supports this with the
--repeat after-timeout:<n>
argument and thectest_test()
argumentREPEAT after-timeout:<n>
.The text was updated successfully, but these errors were encountered: