-
Notifications
You must be signed in to change notification settings - Fork 35
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG/ISSUE] Significant performance degradation on multiple platforms affecting GEOS-Chem Classic and GCHP #57
Comments
Moving the entire code of |
As @jimmielin suggested, it definitely appears to be related to OpenMP. Here is my timing test result for GCHP alpha.10 with OpenMP enabled and disabled. These were 2-week timing tests at C48 with 192 cores and gfortran 8.3.
Here is the same figure I showed above, but including this new timing test with OpenMP disabled. Note that the blue X corresponds to this new timing test with OpenMP disabled. Also note that the blue X is the same timing test as the solid red point below it. There is a 40-90% increase in speed if OpenMP is disabled. I think this confirms it's related to OpenMP! |
Interesting. I'll do my HEMCO standalone tests for both OpenMP enabled and disabled. |
I am seeing significant differences in run-time in HEMCO standalone due to OpenMP. The differences across compilers are negligible in comparison. For all compilers disabling OpenMP results in reduction of run-time. Wall-times for 1-month run
All tests used HEMCO branch |
For a 6-hour simulation in GCHPctm with 30 cores on 1 node using gfortran 9.3 and Spack-built OpenMPI 4.0.5, disabling OpenMP yields a total run time of 14m14s. With OpenMP enabled, the same run has not finished after 1h40m (it's 3h40m into the run). Mine may not be an issue unique to HEMCO based on cursory glances at the pace of the run outside of the very slow emissions timesteps. I'll have to rerun for more details as Cannon is about to shutdown. |
Has there been any more work into this issue? I am wondering if maybe using a combination of |
I have been looking into this. It seems that testing if pointers LevDct1, LevDct2 are associated within routine GetVertIndx can have a big performance hit. The pointers are associated in GET_CURRENT_EMISSIONS and so can be tested there, and logical arguments can be passed to GetVertIndx indicating if the pointers are null or not. Also a similar check on whether the PBLHEIGHT and BXHEIGHT fields are in the HcoState can be moved out of routine GetIdx, which is called on each (I,J). I am not sure if this breaks anything though. I will do more testing/profiling. At first glance the COLLAPSE does not bring as big of a benefit as I thought, but cleaning up these tests if pointers are associated seem to be the way to get more performance here. |
This issue has been automatically marked as stale because it has not had recent activity. If there are no updates within 7 days it will be closed. You can add the "never stale" tag to prevent the Stale bot from closing this issue. |
From comment above: "Also a similar check on whether the PBLHEIGHT and BXHEIGHT fields are in the HcoState can be moved out of routine GetIdx, which is called on each (I,J). I am not sure if this breaks anything though." It turns out this does break things in the new HEMCO intermediate grid update I am bringing in (GEOS-Chem PR 681). Initialization of The problem with this is that during initialization I'm considering two different possible fixes: |
Hi Lizzie - I think we can just use I think simply replacing with a call to |
Okay, I will put it back to I'm still wondering if we could move those |
Does anyone know if this OpenMP problem in HEMCO is still an issue in GEOS-Chem 13? @jimmielin, @WilliamDowns, @LiamBindle |
Sorry, I don't. I'm not aware of any updates on it though so I suspect it still exists. |
Has there been any update on this? I'm noticing when running MEGAN offline that it doesn't scale at all with CPUs. |
There appears to be a significant hit in GEOS-Chem Classic and GCHP performance on some platforms, particularly those using
gfortran
, stemming from somewhere in HEMCO. This issue has been observed by @lizziel, @jimmielin, @WilliamDowns, and myself.The four of us just finished a call discussing the issue and how to proceed. Some notes from that meeting are in the collapsed block below
Zoom call notes
Recordings of the issue
Lizzie's internal benchmarks show that HEMCO's performance in GEOS-Chem Classic with gfortran has been deteriorating. Wall times for HEMCO in GEOS-Chem Classic :
In some GCHP timing tests that Lizzie and I ran a few weeks ago, I observed a very significant drop off in GCHP's scaling (see figure below). Note that the solid line are Lizzie's tests with
ifort
and the dashed lines are my tests withgfortran
. The drop off in performance was dominated by a big slow down inHOCI_GC_RUN(...)
.Further information
Action items
We will continue discussing in this thread.
The text was updated successfully, but these errors were encountered: