-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parallellization of WRF #9
Comments
Some more statistics (after finishing the job, i.e. 1 hour simulated): Nodes: 1
Cores per node: 128
CPU Utilized: 1-11:25:40
CPU Efficiency: 49.18% of 3-00:02:08 core-walltime
Job Wall-clock time: 00:33:46
Memory Utilized: 51.87 GB
Memory Efficiency: 23.16% of 224.00 GB So there's still room for improvement :-) |
I just realized that I performed the above tests with WRF compiled for dmpar only, so it makes sense that it scaled better. Should try again with compile option 35 instead of 34 |
New set of tests with WRF compiled for DMPAR + SMPAR and the reference setup for high-res Amsterdam Using rome partition of 128 cores per node
|
Note; the above timings are for the first timestep. After that, the simulation speeds up, and subsequent timings for main are roughly two times faster. So we're around 1:1 simulation/run time |
New benchmarks now also comparing with intel compilers and on Genoa nodes for a change
Intel version doesn't seem to be much faster out of the box, nor does it scale better without further tuning. Also it seems to respond less well to my tweaking attempts with -ppn and domain pinning. |
Running with srun instead of mpirun/mpiexec might automatically/better be able to map to the hardware...
|
|
Conclusions so far
|
* modify jobscript with faster configuration * Explored some more options * Update jobscript with findings from #9 * add script used for intel compiled wrf
* modify jobscript with faster configuration * Explored some more options * Update wrf-runner * Update jobscript with findings from #9 * add script used for intel compiled wrf * gitignore output * Make it work * one is enough as long as you don't make a typo * docs + by default, don't run wrf * move things around * Suggestions from code review
I was looking into the different options for parallelizing WRF, always find it quite confusing.
From what I understand now, it works like this:
The domain is decomposed into "patches". Each patch is assigned it's own process (an MPI "task").
A process can use multiple CPU cores, potentially distributed across multiple nodes
Controlled in slurm by "n_tasks"
Each "patch" is further split up into "tiles". Each tile gets its own thread.
Threads share the same memory
Controlled in slurm by "cpus_per_task"
The reason to use MPI is because each patch only has fewer grid cells to process, i.e. shorter loops, i.e. faster execution. However, the the overhead for communication between patches increases with the number of patches.
From this, I constructed the following test script:
Then, I did a few small sensitivity tests with my current testcase. Here's the results:
1 thin node of 128 cores
The text was updated successfully, but these errors were encountered: