Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

{2023.06}[foss/2023a] CUDA 12.1.1 (rebuild) #720

Merged
merged 4 commits into from
Sep 25, 2024

Conversation

casparvl
Copy link
Collaborator

@casparvl casparvl commented Sep 18, 2024

Copy link

eessi-bot bot commented Sep 18, 2024

Instance eessi-bot-mc-aws is configured to build for:

  • architectures: x86_64/generic, x86_64/intel/haswell, x86_64/intel/skylake_avx512, x86_64/amd/zen2, x86_64/amd/zen3, aarch64/generic, aarch64/neoverse_n1, aarch64/neoverse_v1
  • repositories: eessi.io-2023.06-compat, eessi-hpc.org-2023.06-software, eessi-hpc.org-2023.06-compat, eessi.io-2023.06-software

Copy link

eessi-bot bot commented Sep 18, 2024

Instance eessi-bot-mc-azure is configured to build for:

  • architectures: x86_64/amd/zen4
  • repositories: eessi-hpc.org-2023.06-software, eessi-hpc.org-2023.06-compat, eessi.io-2023.06-software, eessi.io-2023.06-compat

Instance boegel-bot-deucalion is configured to build for:

  • architectures: aarch64/a64fx
  • repositories: eessi.io-2023.06-software

@casparvl
Copy link
Collaborator Author

bot: build repo:eessi.io-2023.06-software arch:x86_64/amd/zen2 accel:nvidia/cc80

Copy link

eessi-bot bot commented Sep 18, 2024

Updates by the bot instance eessi-bot-mc-aws (click for details)
  • received bot command build repo:eessi.io-2023.06-software arch:x86_64/amd/zen2 accel:nvidia/cc80 from casparvl

    • expanded format: build repository:eessi.io-2023.06-software architecture:x86_64/amd/zen2 accelerator:nvidia/cc80
  • handling command build repository:eessi.io-2023.06-software architecture:x86_64/amd/zen2 accelerator:nvidia/cc80 resulted in:

Copy link

eessi-bot bot commented Sep 18, 2024

Updates by the bot instance eessi-bot-mc-azure (click for details)
  • received bot command build repo:eessi.io-2023.06-software arch:x86_64/amd/zen2 accel:nvidia/cc80 from casparvl

    • expanded format: build repository:eessi.io-2023.06-software architecture:x86_64/amd/zen2 accelerator:nvidia/cc80
  • handling command build repository:eessi.io-2023.06-software architecture:x86_64/amd/zen2 accelerator:nvidia/cc80 resulted in:

    • no jobs were submitted

Updates by the bot instance boegel-bot-deucalion (click for details)
  • account casparvl has NO permission to send commands to the bot

Copy link

eessi-bot bot commented Sep 18, 2024

New job on instance eessi-bot-mc-aws for CPU micro-architecture x86_64-amd-zen2 and accelerator nvidia/cc80 for repository eessi.io-2023.06-software in job dir /project/def-users/SHARED/jobs/2024.09/pr_720/18982

date job status comment
Sep 18 19:34:51 UTC 2024 submitted job id 18982 awaits release by job manager
Sep 18 19:35:14 UTC 2024 released job awaits launch by Slurm scheduler
Sep 18 19:42:22 UTC 2024 running job 18982 is running
Sep 18 19:43:25 UTC 2024 finished
😢 FAILURE (click triangle for details)
Details
✅ job output file slurm-18982.out
✅ no message matching ERROR:
✅ no message matching FAILED:
✅ no message matching required modules missing:
✅ found message(s) matching No missing installations
✅ found message matching .tar.gz created!
Artefacts
No artefacts were created or found.
Sep 18 19:43:25 UTC 2024 test result
😢 FAILURE (click triangle for details)
Reason
EESSI test suite was not run, test step itself failed to execute.
Details
✅ job output file slurm-18982.out
❌ found message matching ERROR:
✅ no message matching [\s*FAILED\s*].*Ran .* test case

@casparvl
Copy link
Collaborator Author

== Running parse hook for CUDA-12.1.1.eb...
== CUDA/12.1.1 is already installed (module found), skipping
== No easyconfigs left to be built.
== Build succeeded for 0 out of 0

We'll probably need to put it in a rebuild EasyStack file since the module exists in the CPU prefix

@casparvl
Copy link
Collaborator Author

bot: build repo:eessi.io-2023.06-software arch:x86_64/amd/zen2 accel:nvidia/cc80

Copy link

eessi-bot bot commented Sep 18, 2024

Updates by the bot instance eessi-bot-mc-aws (click for details)
  • received bot command build repo:eessi.io-2023.06-software arch:x86_64/amd/zen2 accel:nvidia/cc80 from casparvl

    • expanded format: build repository:eessi.io-2023.06-software architecture:x86_64/amd/zen2 accelerator:nvidia/cc80
  • handling command build repository:eessi.io-2023.06-software architecture:x86_64/amd/zen2 accelerator:nvidia/cc80 resulted in:

Copy link

eessi-bot bot commented Sep 18, 2024

Updates by the bot instance eessi-bot-mc-azure (click for details)
  • received bot command build repo:eessi.io-2023.06-software arch:x86_64/amd/zen2 accel:nvidia/cc80 from casparvl

    • expanded format: build repository:eessi.io-2023.06-software architecture:x86_64/amd/zen2 accelerator:nvidia/cc80
  • handling command build repository:eessi.io-2023.06-software architecture:x86_64/amd/zen2 accelerator:nvidia/cc80 resulted in:

    • no jobs were submitted

Updates by the bot instance boegel-bot-deucalion (click for details)
  • account casparvl has NO permission to send commands to the bot

Copy link

eessi-bot bot commented Sep 18, 2024

New job on instance eessi-bot-mc-aws for CPU micro-architecture x86_64-amd-zen2 and accelerator nvidia/cc80 for repository eessi.io-2023.06-software in job dir /project/def-users/SHARED/jobs/2024.09/pr_720/18985

date job status comment
Sep 18 19:54:43 UTC 2024 submitted job id 18985 awaits release by job manager
Sep 18 19:55:30 UTC 2024 released job awaits launch by Slurm scheduler
Sep 18 19:56:32 UTC 2024 running job 18985 is running
Sep 18 19:57:33 UTC 2024 finished
😢 FAILURE (click triangle for details)
Details
✅ job output file slurm-18985.out
❌ found message matching ERROR:
✅ no message matching FAILED:
✅ no message matching required modules missing:
✅ found message(s) matching No missing installations
✅ found message matching .tar.gz created!
Artefacts
No artefacts were created or found.
Sep 18 19:57:33 UTC 2024 test result
😢 FAILURE (click triangle for details)
Reason
EESSI test suite was not run, test step itself failed to execute.
Details
✅ job output file slurm-18985.out
❌ found message matching ERROR:
✅ no message matching [\s*FAILED\s*].*Ran .* test case

@casparvl
Copy link
Collaborator Author

Removal fails, we have to updated the EESSI-remove-software.sh script as well to make sure the CPU prefix is also on the MODULEPATH. I.e. This section from the EESSI-install-software.sh:

# if an accelerator target is specified, we need to make sure that the CPU-only modules are also still available
if [ ! -z ${EESSI_ACCELERATOR_TARGET} ]; then
    CPU_ONLY_MODULES_PATH=$(echo $EASYBUILD_INSTALLPATH | sed "s@/accel/${EESSI_ACCELERATOR_TARGET}@@g")/modules/all
    if [ -d ${CPU_ONLY_MODULES_PATH} ]; then
        module use ${CPU_ONLY_MODULES_PATH}
    else
        fatal_error "Derived path to CPU-only modules does not exist: ${CPU_ONLY_MODULES_PATH}"
    fi
fi

@casparvl
Copy link
Collaborator Author

bot: build repo:eessi.io-2023.06-software arch:x86_64/amd/zen2 accel:nvidia/cc80

Copy link

eessi-bot bot commented Sep 18, 2024

Updates by the bot instance eessi-bot-mc-aws (click for details)
  • received bot command build repo:eessi.io-2023.06-software arch:x86_64/amd/zen2 accel:nvidia/cc80 from casparvl

    • expanded format: build repository:eessi.io-2023.06-software architecture:x86_64/amd/zen2 accelerator:nvidia/cc80
  • handling command build repository:eessi.io-2023.06-software architecture:x86_64/amd/zen2 accelerator:nvidia/cc80 resulted in:

Copy link

eessi-bot bot commented Sep 18, 2024

Updates by the bot instance eessi-bot-mc-azure (click for details)
  • received bot command build repo:eessi.io-2023.06-software arch:x86_64/amd/zen2 accel:nvidia/cc80 from casparvl

    • expanded format: build repository:eessi.io-2023.06-software architecture:x86_64/amd/zen2 accelerator:nvidia/cc80
  • handling command build repository:eessi.io-2023.06-software architecture:x86_64/amd/zen2 accelerator:nvidia/cc80 resulted in:

    • no jobs were submitted

Updates by the bot instance boegel-bot-deucalion (click for details)
  • account casparvl has NO permission to send commands to the bot

Copy link

eessi-bot bot commented Sep 18, 2024

New job on instance eessi-bot-mc-aws for CPU micro-architecture x86_64-amd-zen2 and accelerator nvidia/cc80 for repository eessi.io-2023.06-software in job dir /project/def-users/SHARED/jobs/2024.09/pr_720/18986

date job status comment
Sep 18 20:23:20 UTC 2024 submitted job id 18986 awaits release by job manager
Sep 18 20:23:38 UTC 2024 released job awaits launch by Slurm scheduler
Sep 18 20:24:40 UTC 2024 running job 18986 is running
Sep 18 20:25:41 UTC 2024 finished
😢 FAILURE (click triangle for details)
Details
✅ job output file slurm-18986.out
✅ no message matching ERROR:
✅ no message matching FAILED:
✅ no message matching required modules missing:
✅ found message(s) matching No missing installations
✅ found message matching .tar.gz created!
Artefacts
No artefacts were created or found.
Sep 18 20:25:41 UTC 2024 test result
😢 FAILURE (click triangle for details)
Reason
EESSI test suite was not run, test step itself failed to execute.
Details
✅ job output file slurm-18986.out
❌ found message matching ERROR:
✅ no message matching [\s*FAILED\s*].*Ran .* test case

@casparvl
Copy link
Collaborator Author

casparvl commented Sep 18, 2024

Ouch, the EESSI-remove-installation.sh script is really too naive for this new setup:

                for app in ${rebuild_apps}; do
                    app_dir=${EASYBUILD_INSTALLPATH}/software/${app}
                    app_module=${EASYBUILD_INSTALLPATH}/modules/all/${app}.lua
                    echo_yellow "Removing ${app_dir} and ${app_module}..."
                    rm -rf ${app_dir}
                    rm -rf ${app_module}

It just assumes that the module is on the EASYBUILD_INSTALLPATH, so this is what it tries to remove:

ESC[33mRemoving /cvmfs/software.eessi.io/versions/2023.06/software/linux/x86_64/amd/zen2/accel/nvidia/cc80/software/CUDA/12.1.1 and /cvmfs/software.eessi.io/
versions/2023.06/software/linux/x86_64/amd/zen2/accel/nvidia/cc80/modules/all/CUDA/12.1.1.lua...ESC[0m

We could make that more intelligent, but note that it is only specific to the case where we want to rebuild something that is currently in the CPU prefix into the accel prefix.

@casparvl
Copy link
Collaborator Author

bot: build repo:eessi.io-2023.06-software arch:x86_64/amd/zen2 accel:nvidia/cc80

Copy link

eessi-bot bot commented Sep 18, 2024

Updates by the bot instance eessi-bot-mc-aws (click for details)
  • received bot command build repo:eessi.io-2023.06-software arch:x86_64/amd/zen2 accel:nvidia/cc80 from casparvl

    • expanded format: build repository:eessi.io-2023.06-software architecture:x86_64/amd/zen2 accelerator:nvidia/cc80
  • handling command build repository:eessi.io-2023.06-software architecture:x86_64/amd/zen2 accelerator:nvidia/cc80 resulted in:

Copy link

eessi-bot bot commented Sep 18, 2024

Updates by the bot instance eessi-bot-mc-azure (click for details)
  • received bot command build repo:eessi.io-2023.06-software arch:x86_64/amd/zen2 accel:nvidia/cc80 from casparvl

    • expanded format: build repository:eessi.io-2023.06-software architecture:x86_64/amd/zen2 accelerator:nvidia/cc80
  • handling command build repository:eessi.io-2023.06-software architecture:x86_64/amd/zen2 accelerator:nvidia/cc80 resulted in:

    • no jobs were submitted

Updates by the bot instance boegel-bot-deucalion (click for details)
  • account casparvl has NO permission to send commands to the bot

Copy link

eessi-bot bot commented Sep 18, 2024

New job on instance eessi-bot-mc-aws for CPU micro-architecture x86_64-amd-zen2 and accelerator nvidia/cc80 for repository eessi.io-2023.06-software in job dir /project/def-users/SHARED/jobs/2024.09/pr_720/18988

date job status comment
Sep 18 21:02:17 UTC 2024 submitted job id 18988 awaits release by job manager
Sep 18 21:02:47 UTC 2024 released job awaits launch by Slurm scheduler
Sep 18 21:07:49 UTC 2024 running job 18988 is running
Sep 18 21:08:50 UTC 2024 finished
😢 FAILURE (click triangle for details)
Details
✅ job output file slurm-18988.out
✅ no message matching ERROR:
✅ no message matching FAILED:
✅ no message matching required modules missing:
✅ found message(s) matching No missing installations
✅ found message matching .tar.gz created!
Artefacts
No artefacts were created or found.
Sep 18 21:08:50 UTC 2024 test result
😢 FAILURE (click triangle for details)
Reason
EESSI test suite was not run, test step itself failed to execute.
Details
✅ job output file slurm-18988.out
❌ found message matching ERROR:
✅ no message matching [\s*FAILED\s*].*Ran .* test case

@casparvl
Copy link
Collaborator Author

bot: build repo:eessi.io-2023.06-software arch:x86_64/amd/zen2 accel:nvidia/cc80

Copy link

eessi-bot bot commented Sep 18, 2024

Updates by the bot instance eessi-bot-mc-aws (click for details)
  • received bot command build repo:eessi.io-2023.06-software arch:x86_64/amd/zen2 accel:nvidia/cc80 from casparvl

    • expanded format: build repository:eessi.io-2023.06-software architecture:x86_64/amd/zen2 accelerator:nvidia/cc80
  • handling command build repository:eessi.io-2023.06-software architecture:x86_64/amd/zen2 accelerator:nvidia/cc80 resulted in:

Copy link

eessi-bot bot commented Sep 18, 2024

Updates by the bot instance eessi-bot-mc-azure (click for details)
  • received bot command build repo:eessi.io-2023.06-software arch:x86_64/amd/zen2 accel:nvidia/cc80 from casparvl

    • expanded format: build repository:eessi.io-2023.06-software architecture:x86_64/amd/zen2 accelerator:nvidia/cc80
  • handling command build repository:eessi.io-2023.06-software architecture:x86_64/amd/zen2 accelerator:nvidia/cc80 resulted in:

    • no jobs were submitted

Copy link

eessi-bot bot commented Sep 19, 2024

Updates by the bot instance eessi-bot-mc-aws (click for details)
  • received bot command build repo:eessi.io-2023.06-software arch:x86_64/amd/zen2 accel:nvidia/cc80 from casparvl

    • expanded format: build repository:eessi.io-2023.06-software architecture:x86_64/amd/zen2 accelerator:nvidia/cc80
  • handling command build repository:eessi.io-2023.06-software architecture:x86_64/amd/zen2 accelerator:nvidia/cc80 resulted in:

Copy link

eessi-bot bot commented Sep 19, 2024

Updates by the bot instance eessi-bot-mc-azure (click for details)
  • received bot command build repo:eessi.io-2023.06-software arch:x86_64/amd/zen2 accel:nvidia/cc80 from casparvl

    • expanded format: build repository:eessi.io-2023.06-software architecture:x86_64/amd/zen2 accelerator:nvidia/cc80
  • handling command build repository:eessi.io-2023.06-software architecture:x86_64/amd/zen2 accelerator:nvidia/cc80 resulted in:

    • no jobs were submitted

Updates by the bot instance boegel-bot-deucalion (click for details)
  • account casparvl has NO permission to send commands to the bot

Copy link

eessi-bot bot commented Sep 19, 2024

New job on instance eessi-bot-mc-aws for CPU micro-architecture x86_64-amd-zen2 and accelerator nvidia/cc80 for repository eessi.io-2023.06-software in job dir /project/def-users/SHARED/jobs/2024.09/pr_720/19104

date job status comment
Sep 19 07:14:54 UTC 2024 submitted job id 19104 awaits release by job manager
Sep 19 07:15:38 UTC 2024 released job awaits launch by Slurm scheduler
Sep 19 07:16:40 UTC 2024 running job 19104 is running

@casparvl
Copy link
Collaborator Author

Weird, it seems consistent here, but not in the other PRs...

@casparvl
Copy link
Collaborator Author

bot: build repo:eessi.io-2023.06-software arch:x86_64/amd/zen2 accel:nvidia/cc80

Copy link

eessi-bot bot commented Sep 19, 2024

Updates by the bot instance eessi-bot-mc-aws (click for details)
  • received bot command build repo:eessi.io-2023.06-software arch:x86_64/amd/zen2 accel:nvidia/cc80 from casparvl

    • expanded format: build repository:eessi.io-2023.06-software architecture:x86_64/amd/zen2 accelerator:nvidia/cc80
  • handling command build repository:eessi.io-2023.06-software architecture:x86_64/amd/zen2 accelerator:nvidia/cc80 resulted in:

Copy link

eessi-bot bot commented Sep 19, 2024

Updates by the bot instance eessi-bot-mc-azure (click for details)
  • received bot command build repo:eessi.io-2023.06-software arch:x86_64/amd/zen2 accel:nvidia/cc80 from casparvl

    • expanded format: build repository:eessi.io-2023.06-software architecture:x86_64/amd/zen2 accelerator:nvidia/cc80
  • handling command build repository:eessi.io-2023.06-software architecture:x86_64/amd/zen2 accelerator:nvidia/cc80 resulted in:

    • no jobs were submitted

Updates by the bot instance boegel-bot-deucalion (click for details)
  • account casparvl has NO permission to send commands to the bot

Copy link

eessi-bot bot commented Sep 19, 2024

New job on instance eessi-bot-mc-aws for CPU micro-architecture x86_64-amd-zen2 and accelerator nvidia/cc80 for repository eessi.io-2023.06-software in job dir /project/def-users/SHARED/jobs/2024.09/pr_720/19110

date job status comment
Sep 19 08:23:35 UTC 2024 submitted job id 19110 awaits release by job manager
Sep 19 08:24:01 UTC 2024 released job awaits launch by Slurm scheduler
Sep 19 08:25:06 UTC 2024 running job 19110 is running

@bedroge
Copy link
Collaborator

bedroge commented Sep 23, 2024

bot: build repo:eessi.io-2023.06-software arch:x86_64/amd/zen2 accel:nvidia/cc80

Copy link

eessi-bot bot commented Sep 23, 2024

Updates by the bot instance eessi-bot-mc-aws (click for details)
  • received bot command build repo:eessi.io-2023.06-software arch:x86_64/amd/zen2 accel:nvidia/cc80 from bedroge

    • expanded format: build repository:eessi.io-2023.06-software architecture:x86_64/amd/zen2 accelerator:nvidia/cc80
  • handling command build repository:eessi.io-2023.06-software architecture:x86_64/amd/zen2 accelerator:nvidia/cc80 resulted in:

Copy link

eessi-bot bot commented Sep 23, 2024

Updates by the bot instance eessi-bot-mc-azure (click for details)
  • received bot command build repo:eessi.io-2023.06-software arch:x86_64/amd/zen2 accel:nvidia/cc80 from bedroge

    • expanded format: build repository:eessi.io-2023.06-software architecture:x86_64/amd/zen2 accelerator:nvidia/cc80
  • handling command build repository:eessi.io-2023.06-software architecture:x86_64/amd/zen2 accelerator:nvidia/cc80 resulted in:

    • no jobs were submitted

Copy link

eessi-bot bot commented Sep 23, 2024

New job on instance eessi-bot-mc-aws for CPU micro-architecture x86_64-amd-zen2 and accelerator nvidia/cc80 for repository eessi.io-2023.06-software in job dir /project/def-users/SHARED/jobs/2024.09/pr_720/19569

date job status comment
Sep 23 07:30:13 UTC 2024 submitted job id 19569 awaits release by job manager
Sep 23 07:30:39 UTC 2024 released job awaits launch by Slurm scheduler
Sep 23 07:36:43 UTC 2024 running job 19569 is running
Sep 23 08:15:37 UTC 2024 finished
😁 SUCCESS (click triangle for details)
Details
✅ job output file slurm-19569.out
✅ no message matching ERROR:
✅ no message matching FAILED:
✅ no message matching required modules missing:
✅ found message(s) matching No missing installations
✅ found message matching .tar.gz created!
Artefacts
eessi-2023.06-software-linux-x86_64-amd-zen2-1727077925.tar.gzsize: 2067 MiB (2167679201 bytes)
entries: 5518
modules under 2023.06/software/linux/x86_64/amd/zen2/accel/nvidia/cc80/modules/all
CUDA/12.1.1.lua
software under 2023.06/software/linux/x86_64/amd/zen2/accel/nvidia/cc80/software
CUDA/12.1.1
other under 2023.06/software/linux/x86_64/amd/zen2/accel/nvidia/cc80
no other files in tarball
Sep 23 08:15:37 UTC 2024 test result
😁 SUCCESS (click triangle for details)
ReFrame Summary
[ PASSED ] Ran 9/9 test case(s) from 9 check(s) (0 failure(s), 0 skipped, 0 aborted)
Details
✅ job output file slurm-19569.out
✅ no message matching ERROR:
✅ no message matching [\s*FAILED\s*].*Ran .* test case

@ocaisa
Copy link
Member

ocaisa commented Sep 24, 2024

@casparvl I've included this PR in #735 where I symlink to a host CUDA without the accel/nvidia/cc80 in the path. Including it is just duplication and would require additional changes in our GPU scripts

@boegel boegel added the 2023.06-software.eessi.io 2023.06 version of software.eessi.io label Sep 25, 2024
@boegel boegel changed the title {2023.06}[foss/2023a] CUDA 12.1.1 {2023.06}[foss/2023a] CUDA 12.1.1 (rebuild) Sep 25, 2024
@bedroge bedroge merged commit dfe71c7 into EESSI:2023.06-software.eessi.io Sep 25, 2024
35 checks passed
Copy link

eessi-bot bot commented Sep 25, 2024

PR merged! Moved ['/project/def-users/SHARED/jobs/2024.09/pr_720/18982', '/project/def-users/SHARED/jobs/2024.09/pr_720/18985', '/project/def-users/SHARED/jobs/2024.09/pr_720/18986', '/project/def-users/SHARED/jobs/2024.09/pr_720/18988', '/project/def-users/SHARED/jobs/2024.09/pr_720/18989', '/project/def-users/SHARED/jobs/2024.09/pr_720/19103', '/project/def-users/SHARED/jobs/2024.09/pr_720/19104', '/project/def-users/SHARED/jobs/2024.09/pr_720/19110', '/project/def-users/SHARED/jobs/2024.09/pr_720/19569'] to /project/def-users/SHARED/trash_bin/EESSI/software-layer/2024.09.25

PR merged! Moved [] to $HOME/trash_bin/EESSI/software-layer/2024.09.25

Copy link

eessi-bot bot commented Sep 25, 2024

PR merged! Moved [] to /project/def-users/SHARED/trash_bin/EESSI/software-layer/2024.09.25

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
2023.06-software.eessi.io 2023.06 version of software.eessi.io accel:nvidia
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants