Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parallellization of WRF #9

Open
Peter9192 opened this issue Jul 2, 2024 · 8 comments
Open

Parallellization of WRF #9

Peter9192 opened this issue Jul 2, 2024 · 8 comments

Comments

@Peter9192
Copy link
Contributor

I was looking into the different options for parallelizing WRF, always find it quite confusing.

From what I understand now, it works like this:

  • MPI is used for multiprocessing (DMPAR).
    The domain is decomposed into "patches". Each patch is assigned it's own process (an MPI "task").
    A process can use multiple CPU cores, potentially distributed across multiple nodes
    Controlled in slurm by "n_tasks"
  • OpenMP is used for multithreading (SMPAR).
    Each "patch" is further split up into "tiles". Each tile gets its own thread.
    Threads share the same memory
    Controlled in slurm by "cpus_per_task"

The reason to use MPI is because each patch only has fewer grid cells to process, i.e. shorter loops, i.e. faster execution. However, the the overhead for communication between patches increases with the number of patches.

From this, I constructed the following test script:

#!/bin/bash
#SBATCH --job-name=wrf_experiment     # Job name
#SBATCH --partition=thin              # Request thin partition. Up to one hour goes to fast queue
#SBATCH --time=1:00:00                # Maximum runtime (D-HH:MM:SS)
#SBATCH --nodes=1                     # Number of nodes (one thin node has 128 cpu cores)
#SBATCH --ntasks=64                   # Number of tasks per node / number of patches in the domain - parallelized with MPI / DMPAR / multiprocessing
#SBATCH --cpus-per-task=2             # Number of CPU cores per task / number of tiles within each patch - parallelized with OpenMP / SMPAR / multithreading

# Note: number cpus-per-task * ntasks should not exceed the total available cores on requested nodes
# 8*16 = 128 exactly fits on one thin node.

# Load dependencies
module load 2023
module load netCDF-Fortran/4.6.1-gompi-2023a  # also loads gcc and gompi
module load Python/3.11.3-GCCcore-12.3.0

export NETCDF=$(nf-config --prefix)

# Each process can do multithreading but limited to the number of cpu cores allocated to each process
export OMP_NUM_THREADS=${SLURM_CPUS_PER_TASK}

mpiexec $HOME/wrf-model/WRF/run/wrf.exe

Then, I did a few small sensitivity tests with my current testcase. Here's the results:

1 thin node of 128 cores

n_tasks cpus_per_task Timing for main on D01 after completing 1 minute comment
16 8 128 s. First try 'evenly'
8 16 162 s. Less DMPAR, more SMPAR
128 1 - Crashed: <10 grid cells per patch in y-direction (7)
64 2 43 s. Apparently DMPAR scales better than SMPAR for this case
@Peter9192
Copy link
Contributor Author

Some more statistics (after finishing the job, i.e. 1 hour simulated):

Nodes: 1
Cores per node: 128
CPU Utilized: 1-11:25:40
CPU Efficiency: 49.18% of 3-00:02:08 core-walltime
Job Wall-clock time: 00:33:46
Memory Utilized: 51.87 GB
Memory Efficiency: 23.16% of 224.00 GB

So there's still room for improvement :-)

@Peter9192
Copy link
Contributor Author

I just realized that I performed the above tests with WRF compiled for dmpar only, so it makes sense that it scaled better. Should try again with compile option 35 instead of 34

@Peter9192
Copy link
Contributor Author

Peter9192 commented Aug 6, 2024

New set of tests with WRF compiled for DMPAR + SMPAR and the reference setup for high-res Amsterdam

Using rome partition of 128 cores per node

nodes n_tasks cpus_per_task Timing for main on D01 after completing 10 seconds comment
1 16 8 145 s.
1 8 16 193 s. / 144s.
1 128 1 - crashed due to too small domain
1 1 128 - terribly slow - after 5 minutes it had not yet solved 2 seconds of simulation time
1 64 2 48 s.
1/2 64 1 46 s. faster without multithreading with only two threads
4 64 8 68 s. / 56 s. more threads doesn't seem to scale at all
8 64 16 91 s. indeed, more thread no help
1/4 24 1 95 s. / 86 s. sharing node; upper range of sweet spot according to https://forum.mmm.ucar.edu/threads/choosing-an-appropriate-number-of-processors.5082/
1 4 32 66 s. using --map-by node:PE=$OMP_NUM_THREADS --rank-by core
1 32 4 21 s. ,,
1 64 2 25 s. ,,
2 64 4 18 s. ,,
2 32 8 19 s. ,,
4 32 16 18 s. ,,
4 64 8 21 s. ,,
8 64 16 18 s. ,,

@Peter9192
Copy link
Contributor Author

Note; the above timings are for the first timestep. After that, the simulation speeds up, and subsequent timings for main are roughly two times faster. So we're around 1:1 simulation/run time

@Peter9192
Copy link
Contributor Author

New benchmarks now also comparing with intel compilers and on Genoa nodes for a change

GNU compilers (like above) / Genoa with 24 tasks / 8 cpus per task (1 node) - no mapping/binding/ranking
Timing for main: time 2019-07-23_06:00:10 on domain   1:   64.65104 elapsed seconds
Timing for main: time 2019-07-23_06:00:20 on domain   1:   49.12214 elapsed seconds

GNU compilers (like above) Genoa with 24 tasks / 8 cpus per task (1 node) - `--map-by node:PE=$OMP_NUM_THREADS --rank-by core`
Timing for main: time 2019-07-23_06:00:10 on domain   1:   14.50725 elapsed seconds
Timing for main: time 2019-07-23_06:00:20 on domain   1:    6.32626 elapsed seconds

GNU compilers (like above) Genoa with 24 tasks / 16 cpus per task (2 node) - `--map-by node:PE=$OMP_NUM_THREADS --rank-by core`
Timing for main: time 2019-07-23_06:00:10 on domain   1:   21.65084 elapsed seconds
Timing for main: time 2019-07-23_06:00:20 on domain   1:   10.15081 elapsed seconds

Intel compilers Genoa with 24 tasks / 8 cpus per task (1 node) - no mapping/binding/ranking
Timing for main: time 2019-07-23_06:00:10 on domain   1:   16.87352 elapsed seconds
Timing for main: time 2019-07-23_06:00:20 on domain   1:    7.99886 elapsed seconds

Intel compilers Genoa with 24 tasks / 16 cpus per task (2 nodes) - no extra specifiers
Timing for main: time 2019-07-23_06:00:10 on domain   1:   40.71106 elapsed seconds
Timing for main: time 2019-07-23_06:00:20 on domain   1:   25.24836 elapsed seconds

Intel compilers Genoa with 24 tasks / 16 cpus per task (2 nodes) - `--ppn $((SLURM_NTASKS / SLURM_JOB_NUM_NODES))`
Timing for main: time 2019-07-23_06:00:10 on domain   1:   40.94793 elapsed seconds
Timing for main: time 2019-07-23_06:00:20 on domain   1:   28.25645 elapsed seconds

Intel compilers Genoa with 24 tasks / 16 cpus per task (2 nodes) - `export I_MPI_PIN_DOMAIN=omp`
Timing for main: time 2019-07-23_06:00:10 on domain   1:   33.41033 elapsed seconds
Timing for main: time 2019-07-23_06:00:20 on domain   1:   34.02773 elapsed seconds

Intel compilers Genoa with 24 tasks / 8 cpus per task (1 node) - `export I_MPI_PIN_DOMAIN=omp`
Timing for main: time 2019-07-23_06:00:10 on domain   1:   16.24445 elapsed seconds
Timing for main: time 2019-07-23_06:00:20 on domain   1:    7.98943 elapsed seconds

Intel version doesn't seem to be much faster out of the box, nor does it scale better without further tuning. Also it seems to respond less well to my tweaking attempts with -ppn and domain pinning.

@Peter9192
Copy link
Contributor Author

Running with srun instead of mpirun/mpiexec might automatically/better be able to map to the hardware...
https://nrel.github.io/HPC/blog/2021-06-18-srun/#3-why-not-just-use-mpiexecmpirun

GNU compilers - Genoa with 24 tasks / 16 cpus per task (2 node) - srun
Timing for main: time 2019-07-23_06:00:10 on domain   1:   22.77393 elapsed seconds
Timing for main: time 2019-07-23_06:00:20 on domain   1:   10.48143 elapsed seconds

GNU compilers - Genoa with 24 tasks / 16 cpus per task (2 node) - srun - OMP_PLACES=cores and OMP_PROC_BIND=spread
Timing for main: time 2019-07-23_06:00:10 on domain   1:   15.32052 elapsed seconds
Timing for main: time 2019-07-23_06:00:20 on domain   1:    6.33800 elapsed seconds

GNU compilers - Genoa with 24 tasks / 8 cpus per task (1 node) - srun - OMP_PLACES=cores and OMP_PROC_BIND=spread
Timing for main: time 2019-07-23_06:00:10 on domain   1:   14.41277 elapsed seconds
Timing for main: time 2019-07-23_06:00:20 on domain   1:    6.80804 elapsed seconds

GNU compilers - Genoa with 24 tasks / 16 cpus per task (2 nodes) - srun - OMP_PLACES=cores and OMP_PROC_BIND=true
Timing for main: time 2019-07-23_06:00:10 on domain   1:   14.27320 elapsed seconds
Timing for main: time 2019-07-23_06:00:20 on domain   1:    6.14655 elapsed seconds

GNU compilers - Genoa with 24 tasks / 16 cpus per task (2 nodes) - srun - OMP_PLACES=cores and OMP_PROC_BIND=close
Timing for main: time 2019-07-23_06:00:10 on domain   1:   14.33385 elapsed seconds
Timing for main: time 2019-07-23_06:00:20 on domain   1:    6.20183 elapsed seconds

OMP_PLACES = sockets gives bad performance, `threads` is similar to cores. PROC_BIND=master is terribly slow.

GNU compilers - Genoa with 24 tasks / 16 cpus per task (2 nodes) - srun - OMP_PLACES=cores and OMP_PROC_BIND=true and --ntasks-per-core=1 -n $SLURM_NTASKS
Timing for main: time 2019-07-23_06:00:10 on domain   1:   14.58144 elapsed seconds
Timing for main: time 2019-07-23_06:00:20 on domain   1:    6.84186 elapsed seconds

GNU compilers - Genoa with 24 tasks / 4 cpus per task (1/2 node) - srun - OMP_PLACES=cores
Timing for main: time 2019-07-23_06:00:10 on domain   1:   18.84828 elapsed seconds
Timing for main: time 2019-07-23_06:00:20 on domain   1:   10.00387 elapsed seconds
--> So openmp from 4 to 8 cores does improve performance (but within a node).

GNU compilers - Genoa with 48 tasks / 8 cpus per task (2 nodes) - srun - OMP_PLACES=cores
Timing for main: time 2019-07-23_06:00:10 on domain   1:   12.61810 elapsed seconds
Timing for main: time 2019-07-23_06:00:20 on domain   1:    4.67543 elapsed seconds

GNU compilers - Genoa with 2 nodes / 24 tasks per node / 8 cpus per task (2 nodes) - srun - OMP_PLACES=cores
Timing for main: time 2019-07-23_06:00:10 on domain   1:   12.06588 elapsed seconds
Timing for main: time 2019-07-23_06:00:20 on domain   1:    4.35846 elapsed seconds

GNU compilers - Rome with 3 nodes / 16 tasks per node / 8 cpus per task - srun - OMP_PLACES=cores
Timing for main: time 2019-07-23_06:00:10 on domain   1:   17.40340 elapsed seconds
Timing for main: time 2019-07-23_06:00:20 on domain   1:    7.67457 elapsed seconds
--> Same number of cores but spread over more nodes is slower.

Intel compilers - Genoa with 2 nodes / 24 tasks per node / 8 cpus per task - srun
Failed to launch
Fixed by `export I_MPI_OFI_PROVIDER=mlx` as per [this suggestion](https://community.intel.com/t5/Intel-MPI-Library/Unable-to-run-with-Intel-MPI-on-any-fabric-setting-except-TCP/m-p/1408609)
Very slow / doesn't reach timings within one minute of job

Intel compilers - Genoa with 2 nodes / 24 tasks per node / 8 cpus per task - srun with KMP_AFFINITY=compact
Very slow / doesn't reach timings within one minute of job

@Peter9192
Copy link
Contributor Author

GNU compilers - Genoa with 3 nodes / 24 tasks per node / 8 cpus per task - srun - OMP_PLACES=cores
Timing for main: time 2019-07-23_06:00:10 on domain   1:   11.59579 elapsed seconds
Timing for main: time 2019-07-23_06:00:20 on domain   1:    4.05039 elapsed seconds

GNU compilers - Genoa with 4 nodes / 24 tasks per node / 8 cpus per task - srun - OMP_PLACES=cores
Timing for main: time 2019-07-23_06:00:10 on domain   1:   11.19707 elapsed seconds
Timing for main: time 2019-07-23_06:00:20 on domain   1:    3.87675 elapsed seconds

GNU compilers - Genoa with 2 nodes / 48 tasks per node / 4 cpus per task - srun - OMP_PLACES=cores
Timing for main: time 2019-07-23_06:00:10 on domain   1:   12.77392 elapsed seconds
Timing for main: time 2019-07-23_06:00:20 on domain   1:    4.69613 elapsed seconds

GNU compilers - Genoa with 1 nodes / 96 tasks per node / 2 cpus per task - srun - OMP_PLACES=cores
Timing for main: time 2019-07-23_06:00:10 on domain   1:   15.96459 elapsed seconds
Timing for main: time 2019-07-23_06:00:20 on domain   1:    7.38751 elapsed seconds

@Peter9192
Copy link
Contributor Author

Conclusions so far

  • WRF scales much better within one compute node than across nodes
  • For optimal MPI+OpenMPI, need to tweak affinity, so far obtained best results with:
    • mpiexec with --map-by node:PE=$OMP_NUM_THREADS --rank-by core
    • srun with export OMP_PLACES=cores
  • So far haven't had much success with intel compilers
  • Best setups so far:
    • Overall efficiency: Genoa - 1 node - 24 tasks - 8 cpus/task runs 6:10 (real time: simulated time) (48/4 split is similar)
    • Fastest run: Genoa - 4 nodes - 96 tasks - 8 cpus/task runs just below 4:10

Peter9192 added a commit that referenced this issue Aug 15, 2024
* modify jobscript with faster configuration

* Explored some more options

* Update jobscript with findings from #9

* add script used for intel compiled wrf
Peter9192 added a commit that referenced this issue Aug 22, 2024
* modify jobscript with faster configuration

* Explored some more options

* Update wrf-runner

* Update jobscript with findings from #9

* add script used for intel compiled wrf

* gitignore output

* Make it work

* one is enough as long as you don't make a typo

* docs + by default, don't run wrf

* move things around

* Suggestions from code review
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant