-
Notifications
You must be signed in to change notification settings - Fork 572
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Teuchos: TeuchosCore_RCP_PerformanceTests_basic_MPI_1 test randomly failing maxRcpRawObjAccessRatio check #13728
Comments
This has been failing randomly (but infrequently) for quite some time, I first noticed and mentioned here #11921 (comment) , which correlated with the change to the default Edit: added a note that the observed random occurrence is infrequent |
#6429 was a different check failing within this test. However, #8648 does appear to be an exact duplicate of this issue. The fix seems clear but it looks like I failed to follow up on that (see #8648 (comment)). (There have been a lot of random failures like this over the years.) Just an FYI to improve this and future such bug reports, but you want to use the "Test Output" filter to better analyze random failures like this. The best query to show the current failure is this query (click "Show Matching Output") that includes the matching test output regex
The query that does not contain the matching for the output given above shows 18 tests. If you filter out the tests that match the output regex
You can verify that by running this query that that greps the output for |
FYI: I ran:
which created the following GitHub Issue text ... Next Action StatusDescriptionAs shown in this query (click "Shown Matching Output" in upper right) the tests:
in the unique GenConfig builds:
started failing on testing day 2024-10-01. The specific set of CDash builds impacted where:
<Add details about what is failing and what the failures look like. Make sure to include strings that are easy to match with GitHub Issue searches.> Current Status on CDashRun the above query adjusting the "Begin" and "End" dates to match today any other date range or just click "CURRENT" in the top bar to see results for the current testing day. Steps to ReproduceSee: If you can't figure out what commands to run to reproduce the problem given this documentation, then please post a comment here and we will give you the exact minimal commands. |
FYI, if run this query (click "Show Matching Output") that includes the matching test output regex
So if we increase |
…8648, trilinos#13728) That should be high enough to avoid every random failure of this check ever observed in Trilinos PR testing. It is debatable if a test such as this should be run in all builds or in just dedicated performance builds. (The default timing ratios are very loose.) We just want to make sure these tests are not broken in every build so that this test will be able to run in performance builds. Signed-off-by: Roscoe A. Bartlett <[email protected]>
@achauphan and @ndellingwood, the fixing PR is #13729. Please review and approve. |
…8648, trilinos#13728) That should be high enough to avoid every random failure of this check ever observed in Trilinos PR testing. It is debatable if a test such as this should be run in all builds or in just dedicated performance builds. (The default timing ratios are very loose.) We just want to make sure these tests are not broken in every build so that this test will be able to run in performance builds. Signed-off-by: Roscoe A. Bartlett <[email protected]>
Bug Report
@trilinos/teuchos
Description
The
TeuchosCore_RCP_PerformanceTests_basic_MPI_1
test has been unstable and randomly failing across multiple observed PRs. In such cases, this test is the only failing result across the entire PR builds and is usually unrelated to any of the changes made in those PRs. The result of this test randomly failing and taking down an entire set otherwise passing builds is a developer adding aRETEST
label in hopes of this test passing, wasting resources.A general query of all instances of this test passing (in the last 4 months):
(See more refined query filter below).
Of those failures, here are a few examples where this was the only failing test where it flipped to a passing result after a retest on the same merge commit hash:
Above were cases where the merge commit hash between builds were the same and the test result flipped between runs. There are also suspected cases where this test is randomly failing, however, due to other changes in the PR between builds, the merge commit hash changes. and we cannot conclude that it is randomly failing based on only looking at the merge commit hash. There are likely lots of these types of examples, but here is one suspected case.
NOTE: There is no easy way to directly identify that two of the same PR builds between a set of PR builds are being tested on the same SHA for the branch being tested through CDash. One way is to look at the configure output for each build and observe the merge commit SHA outputted by TriBITS. Identifying two sets of builds being tested on the same SHA where one had a test fail and the other the same test pass indicates a random failing test.
Example configure output:
The text was updated successfully, but these errors were encountered: