-
Notifications
You must be signed in to change notification settings - Fork 572
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Framework: Switch CUDA AT2 build to be non-UVM and enable tests #13439
Conversation
The CUDA tests look good, with four exceptions, detailed here: https://sems-cdash-son.sandia.gov/cdash/viewTest.php?onlyfailed&buildid=211376 @trilinos/intrepid2 I show that failing test was set to RUN SERIAL for CUDA builds, I can do that here as well if that's still what we want to do. If any developers from the tagged teams can provide any insight for the four failing tests (and they do fail reliably), it would be much appreciated! I can turn them off, but I wanted to at least do SOME due diligence and see what the community thinks. |
Yes, please. The |
@cgcgcg - would you mind taking a look at the panzer/mini-em failure here? Looks to be a linear solver issue similar to what you have fixed in the past. |
I see this message in the output of the failing Stratimikos and Panzer tests:
|
That looks like issues with |
@sebrowne Do we set |
We do not. I did do some debuggery and that particular error went away when I disabled the |
New results with the Kokkos option, my disable of the smcuda BTL, and running the Intrepid2 test serially: https://sems-cdash-son.sandia.gov/cdash/viewTest.php?onlyfailed&buildid=217579 Seeing the same tests fail (except for the Intrepid2 one), but in perhaps more-straightforwards way? I see NaN errors from Belos. |
@sebrowne Thanks for adding the option. Seems like the message went away. I'll have another look to see what's wrong. |
Yes, it does, but I'm going to leave standing the UVM build up "for real" for a later date. I've disabled X11 for the non-UVM build here. |
CUDA 11 had internal compiler errors for these four source files for the container with 11.4.2 installed. Note that it's now LESS necessary to have the CUDA11 block in config-specs.ini, but I left it so that we can re-evaluate the need to run some tests serially, and whether or not to disable the ROL test, when we move to CUDA12. Signed-off-by: Samuel E. Browne <[email protected]>
I want to increase the testing slots per GPU from 2, since I think we're underutilizing our testing resources. The AT2 GPU machines only have 2 A100 GPUs each, so I want to try out using something more like 8 slots per GPU. Signed-off-by: Samuel E. Browne <[email protected]>
Values chosen: Max of 56 build cores: machines have 56 physical cores 112 test parallelism: Machines have 56 x2threaded cores 8 slots per GPU: I was manually testing out different values to see if I could get the GPUs to both be 100% utilized every time I checked during a test suite, and even this many slots seemed to leave some wiggle room. Signed-off-by: Samuel E. Browne <[email protected]>
Status Flag 'Pre-Test Inspection' - Auto Inspected - Inspection is Not Necessary for this Pull Request. |
Status Flag 'Pull Request AutoTester' - Testing Jenkins Projects: Pull Request Auto Testing STARTING (click to expand)Build InformationTest Name: PR_gcc-openmpi-openmp
Jenkins Parameters
Build InformationTest Name: PR_gcc
Jenkins Parameters
Build InformationTest Name: PR_gcc-openmpi_debug
Jenkins Parameters
Build InformationTest Name: PR_clang
Jenkins Parameters
Build InformationTest Name: Trilinos_PR_python3
Jenkins Parameters
Build InformationTest Name: PR_cuda
Jenkins Parameters
Build InformationTest Name: PR_intel
Jenkins Parameters
Build InformationTest Name: PR_cuda-uvm
Jenkins Parameters
Using Repos:
Pull Request Author: sebrowne |
Status Flag 'Pull Request AutoTester' - Jenkins Testing: all Jobs PASSED Pull Request Auto Testing has PASSED (click to expand)Build InformationTest Name: PR_gcc-openmpi-openmp
Jenkins Parameters
Build InformationTest Name: PR_gcc
Jenkins Parameters
Build InformationTest Name: PR_gcc-openmpi_debug
Jenkins Parameters
Build InformationTest Name: PR_clang
Jenkins Parameters
Build InformationTest Name: Trilinos_PR_python3
Jenkins Parameters
Build InformationTest Name: PR_cuda
Jenkins Parameters
Build InformationTest Name: PR_intel
Jenkins Parameters
Build InformationTest Name: PR_cuda-uvm
Jenkins Parameters
|
Status Flag 'Pre-Merge Inspection' - - This Pull Request Requires Inspection... The code must be inspected by a member of the Team before Testing/Merging |
All Jobs Finished; status = PASSED, However Inspection must be performed before merge can occur... |
This is now ready for review and merge. |
Status Flag 'Pre-Merge Inspection' - - This Pull Request Requires Inspection... The code must be inspected by a member of the Team before Testing/Merging |
All Jobs Finished; status = PASSED, However Inspection must be performed before merge can occur... |
1 similar comment
All Jobs Finished; status = PASSED, However Inspection must be performed before merge can occur... |
Status Flag 'Pre-Merge Inspection' - SUCCESS: The last commit to this Pull Request has been INSPECTED AND APPROVED by [ achauphan ]! |
Status Flag 'Pull Request AutoTester' - Pull Request will be Automerged |
Merge on Pull Request# 13439: IS A SUCCESS - Pull Request successfully merged |
@trilinos/framework
Motivation
Want to align the CUDA AT2 build with the old AutoTester one.
Related Issues
https://sems-atlassian-son.sandia.gov/jira/browse/TRILFRAME-673