Parallellization of WRF #9

Peter9192 · 2024-07-02T14:40:25Z

I was looking into the different options for parallelizing WRF, always find it quite confusing.

From what I understand now, it works like this:

MPI is used for multiprocessing (DMPAR).
The domain is decomposed into "patches". Each patch is assigned it's own process (an MPI "task").
A process can use multiple CPU cores, potentially distributed across multiple nodes
Controlled in slurm by "n_tasks"
OpenMP is used for multithreading (SMPAR).
Each "patch" is further split up into "tiles". Each tile gets its own thread.
Threads share the same memory
Controlled in slurm by "cpus_per_task"

The reason to use MPI is because each patch only has fewer grid cells to process, i.e. shorter loops, i.e. faster execution. However, the the overhead for communication between patches increases with the number of patches.

From this, I constructed the following test script:

#!/bin/bash
#SBATCH --job-name=wrf_experiment     # Job name
#SBATCH --partition=thin              # Request thin partition. Up to one hour goes to fast queue
#SBATCH --time=1:00:00                # Maximum runtime (D-HH:MM:SS)
#SBATCH --nodes=1                     # Number of nodes (one thin node has 128 cpu cores)
#SBATCH --ntasks=64                   # Number of tasks per node / number of patches in the domain - parallelized with MPI / DMPAR / multiprocessing
#SBATCH --cpus-per-task=2             # Number of CPU cores per task / number of tiles within each patch - parallelized with OpenMP / SMPAR / multithreading

# Note: number cpus-per-task * ntasks should not exceed the total available cores on requested nodes
# 8*16 = 128 exactly fits on one thin node.

# Load dependencies
module load 2023
module load netCDF-Fortran/4.6.1-gompi-2023a  # also loads gcc and gompi
module load Python/3.11.3-GCCcore-12.3.0

export NETCDF=$(nf-config --prefix)

# Each process can do multithreading but limited to the number of cpu cores allocated to each process
export OMP_NUM_THREADS=${SLURM_CPUS_PER_TASK}

mpiexec $HOME/wrf-model/WRF/run/wrf.exe

Then, I did a few small sensitivity tests with my current testcase. Here's the results:

1 thin node of 128 cores

n_tasks	cpus_per_task	Timing for main on D01 after completing 1 minute	comment
16	8	128 s.	First try 'evenly'
8	16	162 s.	Less DMPAR, more SMPAR
128	1	-	Crashed: <10 grid cells per patch in y-direction (7)
64	2	43 s.	Apparently DMPAR scales better than SMPAR for this case

Peter9192 · 2024-07-02T15:12:36Z

Some more statistics (after finishing the job, i.e. 1 hour simulated):

Nodes: 1
Cores per node: 128
CPU Utilized: 1-11:25:40
CPU Efficiency: 49.18% of 3-00:02:08 core-walltime
Job Wall-clock time: 00:33:46
Memory Utilized: 51.87 GB
Memory Efficiency: 23.16% of 224.00 GB

So there's still room for improvement :-)

Peter9192 · 2024-08-02T11:27:43Z

I just realized that I performed the above tests with WRF compiled for dmpar only, so it makes sense that it scaled better. Should try again with compile option 35 instead of 34

Peter9192 · 2024-08-06T10:54:13Z

New set of tests with WRF compiled for DMPAR + SMPAR and the reference setup for high-res Amsterdam

Using rome partition of 128 cores per node

nodes	n_tasks	cpus_per_task	Timing for main on D01 after completing 10 seconds	comment
1	16	8	145 s.
1	8	16	193 s. / 144s.
1	128	1	-	crashed due to too small domain
1	1	128	-	terribly slow - after 5 minutes it had not yet solved 2 seconds of simulation time
1	64	2	48 s.
1/2	64	1	46 s.	faster without multithreading with only two threads
4	64	8	68 s. / 56 s.	more threads doesn't seem to scale at all
8	64	16	91 s.	indeed, more thread no help
1/4	24	1	95 s. / 86 s.	sharing node; upper range of sweet spot according to https://forum.mmm.ucar.edu/threads/choosing-an-appropriate-number-of-processors.5082/
1	4	32	66 s.	using `--map-by node:PE=$OMP_NUM_THREADS --rank-by core`
1	32	4	21 s.	,,
1	64	2	25 s.	,,
2	64	4	18 s.	,,
2	32	8	19 s.	,,
4	32	16	18 s.	,,
4	64	8	21 s.	,,
8	64	16	18 s.	,,

Peter9192 · 2024-08-06T11:28:07Z

Note; the above timings are for the first timestep. After that, the simulation speeds up, and subsequent timings for main are roughly two times faster. So we're around 1:1 simulation/run time

Peter9192 · 2024-08-14T14:44:23Z

New benchmarks now also comparing with intel compilers and on Genoa nodes for a change

GNU compilers (like above) / Genoa with 24 tasks / 8 cpus per task (1 node) - no mapping/binding/ranking
Timing for main: time 2019-07-23_06:00:10 on domain   1:   64.65104 elapsed seconds
Timing for main: time 2019-07-23_06:00:20 on domain   1:   49.12214 elapsed seconds

GNU compilers (like above) Genoa with 24 tasks / 8 cpus per task (1 node) - `--map-by node:PE=$OMP_NUM_THREADS --rank-by core`
Timing for main: time 2019-07-23_06:00:10 on domain   1:   14.50725 elapsed seconds
Timing for main: time 2019-07-23_06:00:20 on domain   1:    6.32626 elapsed seconds

GNU compilers (like above) Genoa with 24 tasks / 16 cpus per task (2 node) - `--map-by node:PE=$OMP_NUM_THREADS --rank-by core`
Timing for main: time 2019-07-23_06:00:10 on domain   1:   21.65084 elapsed seconds
Timing for main: time 2019-07-23_06:00:20 on domain   1:   10.15081 elapsed seconds

Intel compilers Genoa with 24 tasks / 8 cpus per task (1 node) - no mapping/binding/ranking
Timing for main: time 2019-07-23_06:00:10 on domain   1:   16.87352 elapsed seconds
Timing for main: time 2019-07-23_06:00:20 on domain   1:    7.99886 elapsed seconds

Intel compilers Genoa with 24 tasks / 16 cpus per task (2 nodes) - no extra specifiers
Timing for main: time 2019-07-23_06:00:10 on domain   1:   40.71106 elapsed seconds
Timing for main: time 2019-07-23_06:00:20 on domain   1:   25.24836 elapsed seconds

Intel compilers Genoa with 24 tasks / 16 cpus per task (2 nodes) - `--ppn $((SLURM_NTASKS / SLURM_JOB_NUM_NODES))`
Timing for main: time 2019-07-23_06:00:10 on domain   1:   40.94793 elapsed seconds
Timing for main: time 2019-07-23_06:00:20 on domain   1:   28.25645 elapsed seconds

Intel compilers Genoa with 24 tasks / 16 cpus per task (2 nodes) - `export I_MPI_PIN_DOMAIN=omp`
Timing for main: time 2019-07-23_06:00:10 on domain   1:   33.41033 elapsed seconds
Timing for main: time 2019-07-23_06:00:20 on domain   1:   34.02773 elapsed seconds

Intel compilers Genoa with 24 tasks / 8 cpus per task (1 node) - `export I_MPI_PIN_DOMAIN=omp`
Timing for main: time 2019-07-23_06:00:10 on domain   1:   16.24445 elapsed seconds
Timing for main: time 2019-07-23_06:00:20 on domain   1:    7.98943 elapsed seconds

Intel version doesn't seem to be much faster out of the box, nor does it scale better without further tuning. Also it seems to respond less well to my tweaking attempts with -ppn and domain pinning.

Peter9192 · 2024-08-15T07:59:12Z

Running with srun instead of mpirun/mpiexec might automatically/better be able to map to the hardware...
https://nrel.github.io/HPC/blog/2021-06-18-srun/#3-why-not-just-use-mpiexecmpirun

GNU compilers - Genoa with 24 tasks / 16 cpus per task (2 node) - srun
Timing for main: time 2019-07-23_06:00:10 on domain   1:   22.77393 elapsed seconds
Timing for main: time 2019-07-23_06:00:20 on domain   1:   10.48143 elapsed seconds

GNU compilers - Genoa with 24 tasks / 16 cpus per task (2 node) - srun - OMP_PLACES=cores and OMP_PROC_BIND=spread
Timing for main: time 2019-07-23_06:00:10 on domain   1:   15.32052 elapsed seconds
Timing for main: time 2019-07-23_06:00:20 on domain   1:    6.33800 elapsed seconds

GNU compilers - Genoa with 24 tasks / 8 cpus per task (1 node) - srun - OMP_PLACES=cores and OMP_PROC_BIND=spread
Timing for main: time 2019-07-23_06:00:10 on domain   1:   14.41277 elapsed seconds
Timing for main: time 2019-07-23_06:00:20 on domain   1:    6.80804 elapsed seconds

GNU compilers - Genoa with 24 tasks / 16 cpus per task (2 nodes) - srun - OMP_PLACES=cores and OMP_PROC_BIND=true
Timing for main: time 2019-07-23_06:00:10 on domain   1:   14.27320 elapsed seconds
Timing for main: time 2019-07-23_06:00:20 on domain   1:    6.14655 elapsed seconds

GNU compilers - Genoa with 24 tasks / 16 cpus per task (2 nodes) - srun - OMP_PLACES=cores and OMP_PROC_BIND=close
Timing for main: time 2019-07-23_06:00:10 on domain   1:   14.33385 elapsed seconds
Timing for main: time 2019-07-23_06:00:20 on domain   1:    6.20183 elapsed seconds

OMP_PLACES = sockets gives bad performance, `threads` is similar to cores. PROC_BIND=master is terribly slow.

GNU compilers - Genoa with 24 tasks / 16 cpus per task (2 nodes) - srun - OMP_PLACES=cores and OMP_PROC_BIND=true and --ntasks-per-core=1 -n $SLURM_NTASKS
Timing for main: time 2019-07-23_06:00:10 on domain   1:   14.58144 elapsed seconds
Timing for main: time 2019-07-23_06:00:20 on domain   1:    6.84186 elapsed seconds

GNU compilers - Genoa with 24 tasks / 4 cpus per task (1/2 node) - srun - OMP_PLACES=cores
Timing for main: time 2019-07-23_06:00:10 on domain   1:   18.84828 elapsed seconds
Timing for main: time 2019-07-23_06:00:20 on domain   1:   10.00387 elapsed seconds
--> So openmp from 4 to 8 cores does improve performance (but within a node).

GNU compilers - Genoa with 48 tasks / 8 cpus per task (2 nodes) - srun - OMP_PLACES=cores
Timing for main: time 2019-07-23_06:00:10 on domain   1:   12.61810 elapsed seconds
Timing for main: time 2019-07-23_06:00:20 on domain   1:    4.67543 elapsed seconds

GNU compilers - Genoa with 2 nodes / 24 tasks per node / 8 cpus per task (2 nodes) - srun - OMP_PLACES=cores
Timing for main: time 2019-07-23_06:00:10 on domain   1:   12.06588 elapsed seconds
Timing for main: time 2019-07-23_06:00:20 on domain   1:    4.35846 elapsed seconds

GNU compilers - Rome with 3 nodes / 16 tasks per node / 8 cpus per task - srun - OMP_PLACES=cores
Timing for main: time 2019-07-23_06:00:10 on domain   1:   17.40340 elapsed seconds
Timing for main: time 2019-07-23_06:00:20 on domain   1:    7.67457 elapsed seconds
--> Same number of cores but spread over more nodes is slower.

Intel compilers - Genoa with 2 nodes / 24 tasks per node / 8 cpus per task - srun
Failed to launch
Fixed by `export I_MPI_OFI_PROVIDER=mlx` as per [this suggestion](https://community.intel.com/t5/Intel-MPI-Library/Unable-to-run-with-Intel-MPI-on-any-fabric-setting-except-TCP/m-p/1408609)
Very slow / doesn't reach timings within one minute of job

Intel compilers - Genoa with 2 nodes / 24 tasks per node / 8 cpus per task - srun with KMP_AFFINITY=compact
Very slow / doesn't reach timings within one minute of job

Peter9192 · 2024-08-15T08:45:29Z

GNU compilers - Genoa with 3 nodes / 24 tasks per node / 8 cpus per task - srun - OMP_PLACES=cores
Timing for main: time 2019-07-23_06:00:10 on domain   1:   11.59579 elapsed seconds
Timing for main: time 2019-07-23_06:00:20 on domain   1:    4.05039 elapsed seconds

GNU compilers - Genoa with 4 nodes / 24 tasks per node / 8 cpus per task - srun - OMP_PLACES=cores
Timing for main: time 2019-07-23_06:00:10 on domain   1:   11.19707 elapsed seconds
Timing for main: time 2019-07-23_06:00:20 on domain   1:    3.87675 elapsed seconds

GNU compilers - Genoa with 2 nodes / 48 tasks per node / 4 cpus per task - srun - OMP_PLACES=cores
Timing for main: time 2019-07-23_06:00:10 on domain   1:   12.77392 elapsed seconds
Timing for main: time 2019-07-23_06:00:20 on domain   1:    4.69613 elapsed seconds

GNU compilers - Genoa with 1 nodes / 96 tasks per node / 2 cpus per task - srun - OMP_PLACES=cores
Timing for main: time 2019-07-23_06:00:10 on domain   1:   15.96459 elapsed seconds
Timing for main: time 2019-07-23_06:00:20 on domain   1:    7.38751 elapsed seconds

Peter9192 · 2024-08-15T09:07:30Z

Conclusions so far

WRF scales much better within one compute node than across nodes
For optimal MPI+OpenMPI, need to tweak affinity, so far obtained best results with:
- mpiexec with --map-by node:PE=$OMP_NUM_THREADS --rank-by core
- srun with export OMP_PLACES=cores
So far haven't had much success with intel compilers
Best setups so far:
- Overall efficiency: Genoa - 1 node - 24 tasks - 8 cpus/task runs 6:10 (real time: simulated time) (48/4 split is similar)
- Fastest run: Genoa - 4 nodes - 96 tasks - 8 cpus/task runs just below 4:10

* modify jobscript with faster configuration * Explored some more options * Update jobscript with findings from #9 * add script used for intel compiled wrf

* modify jobscript with faster configuration * Explored some more options * Update wrf-runner * Update jobscript with findings from #9 * add script used for intel compiled wrf * gitignore output * Make it work * one is enough as long as you don't make a typo * docs + by default, don't run wrf * move things around * Suggestions from code review

Peter9192 mentioned this issue Aug 6, 2024

Update jobscript + benchmarking #21

Merged

Peter9192 added a commit that referenced this issue Aug 15, 2024

Update jobscript with findings from #9

ca32205

Peter9192 added a commit that referenced this issue Aug 15, 2024

Update jobscript + benchmarking (#21)

c81d40d

* modify jobscript with faster configuration * Explored some more options * Update jobscript with findings from #9 * add script used for intel compiled wrf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parallellization of WRF #9

Parallellization of WRF #9

Peter9192 commented Jul 2, 2024

Peter9192 commented Jul 2, 2024

Peter9192 commented Aug 2, 2024

Peter9192 commented Aug 6, 2024 •

edited

Loading

Peter9192 commented Aug 6, 2024

Peter9192 commented Aug 14, 2024

Peter9192 commented Aug 15, 2024

Peter9192 commented Aug 15, 2024

Peter9192 commented Aug 15, 2024

Parallellization of WRF #9

Parallellization of WRF #9

Comments

Peter9192 commented Jul 2, 2024

Peter9192 commented Jul 2, 2024

Peter9192 commented Aug 2, 2024

Peter9192 commented Aug 6, 2024 • edited Loading

Peter9192 commented Aug 6, 2024

Peter9192 commented Aug 14, 2024

Peter9192 commented Aug 15, 2024

Peter9192 commented Aug 15, 2024

Peter9192 commented Aug 15, 2024

Peter9192 commented Aug 6, 2024 •

edited

Loading