Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dynamic_cast failures on MacOS: A consequence of Apple's Command Line Tools' update. #4014

Open
colegruninger97 opened this issue Nov 21, 2024 · 15 comments

Comments

@colegruninger97
Copy link

colegruninger97 commented Nov 21, 2024

Hello,

Just a heads up: after updating to MacOS command line tools 16.1, I encountered dynamic_cast failures in IBAMR (which uses libmesh to handle finite elements) when compiler optimizations are enabled. The issue occurs in the following function:

inline void
copy_and_synch(libMesh::NumericVector<double>& v_in,
               libMesh::NumericVector<double>& v_out,
               const bool close_v_in = true,
               const bool close_v_out = true)
{
#if defined(NDEBUG)
    auto v_in_petsc = static_cast<libMesh::PetscVector<double>*>(&v_in);
    auto v_out_petsc = static_cast<libMesh::PetscVector<double>*>(&v_out);
#else
    auto v_in_petsc = dynamic_cast<libMesh::PetscVector<double>*>(&v_in);
    auto v_out_petsc = dynamic_cast<libMesh::PetscVector<double>*>(&v_out);
    TBOX_ASSERT(v_in_petsc);
    TBOX_ASSERT(v_out_petsc);
#endif
    if (close_v_in) v_in.close();
    PetscErrorCode ierr = VecCopy(v_in_petsc->vec(), v_out_petsc->vec());
    IBTK_CHKERRQ(ierr);
    if (close_v_out) v_out.close();
}

The assertion error appears to be caused by a compiler bug associated with how the updated compiler handles virtual tables for shared libraries (see: https://stackoverflow.com/questions/79192304/macos-xcode-16-breaks-dynamic-cast-for-final-types-defined-in-shared-library).

Following the stackoverflow post, I found that removing the final keyword from the PetscVector class declaration in libmesh/include/numerics/petsc_vector.h resolves the issue. I suppose this issue will crop up whenever an object from a class marked as 'final' is dynamically cast with optimizations turned on and so removing the final keyword could provide a workaround until this compiler bug is fixed.

@jwpeterson
Copy link
Member

Thanks for the detailed analysis, that sounds like it would have been very annoying to debug.

I think I'd be in favor of dropping the final from class declarations as I've always been somewhat dubious whether that provides any optimization in practice, but @roystgnr added all those, so he may feel differently. The "right" thing to do would probably be to add a configure test that checks whether the compiler/toolchain has this bug, but that would also have its own issues since we'd have to both compile and run a small test code in order to detect the bug, and I don't currently have access to a machine like the one you described.

@roystgnr
Copy link
Member

Oh, what a travesty. We still have a libmesh_final macro, for backwards compatibility with really old application codes, but looking through our logs, it was all the way back in 2018 that we decided we'd given compilers enough time to support C++11 correctly and we switched all our internal uses to plain final.

I'm loathe to make our code uglier again because a three trillion dollar company can't be bothered to properly regression test one of their core products. And obviously from a users' perspective the right thing to do here is not "work around one bit of breakage but then cross our fingers and continue using the obviously-broken product", it's "downgrade until Apple finally notices and fixes their junk".

I hate leaving users to learn about this little mess on their own in their own applications, though. That's IBAMR code breaking above, right? Do we not hit anything in libMesh make check that triggers the same breakage? If not then we could at least add some unit test that runs when LIBMESH_HAVE_RTTI and screams about this bug if it fails.

Maybe there's something else reasonable we can do on our end? ... our current test for RTTI just tries to compile some typeid code and sees if that fails, but if we were to also check for runtime failures then we could disable RTTI in those cases too, and then any code that uses libMesh::cast_ptr or cast_ref (rather than reinventing them for some reason, as in the snippet above...) would just automatically keep working. The trouble is that it's hard to set up tests for runtime behavior without breaking cross-compiling, and I don't want to break that by accident. AFAIK cross-compiling to embedded systems is just about the only real use case where you actually want to workaround missing RTTI rather than just switching to a better compiler.

@roystgnr
Copy link
Member

I think I'd be in favor of dropping the final from class declarations as I've always been somewhat dubious whether that provides any optimization in practice, but @roystgnr added all those, so he may feel differently.

In theory it changes every self-call from a final class from "bounce off the vtable" to "direct function call, can even be inline". I'd hope compilers are taking advantage of that.

I could be talked into going back to libmesh_final, though, and putting some #if APPLE_VERSION_SUCKS #else #endif wrappers in the definition of that. I just don't know if we'd been doing anyone any favors by encouraging them not to downgrade upon finding themselves with a broken and inadequately tested compiler. A nullptr dereference is a nice clean doesn't-corrupt-data way for breakage to manifest; there are definitely much worse ways to discover optimizer problems.

@colegruninger97
Copy link
Author

Hi @roystgnr, Yes, I only experienced this issue using IBAMR. When I ran make check after configuring libmesh the first time, all of the tests passed and I didn't notice any problems when configuring. I think you're right that the appropriate course of action would be to downgrade Apple's command line tools until they fix the bug. Who knows how long that will be. I just wanted to send a "heads up" just in case anyone else runs into this issue.

When I have some spare time, I'd be happy to write short test mirroring the stackoverflow post I attached above to check for this compiler issue. Just let me know.

@roystgnr
Copy link
Member

When I have some spare time, I'd be happy to write short test mirroring the stackoverflow post I attached above to check for this compiler issue. Just let me know.

I'll put up a PR shortly, but I'll beg you to test it before we merge; thanks!

@roystgnr
Copy link
Member

Hmm... actually, is this something we can work around on our side, by moving the virtual NumericVector and PetscVector destructors to numeric_vector.C and petsc_vector.C instead of leaving them inline? Those StackOverflow answers are suggesting that XCode is now generating RTTI information in the compilation unit with the definition of the first non-inline virtual function and that the problem thus only occurs for classes with no non-inline virtual functions, and that can't be true here because NumericVector and PetscVector do have non-inline virtual functions, but maybe the bug is as simple as XCode 16 generating RTTI as soon as it sees any virtual function, including our inline ~PetscVector?

On top of the test I'll try a workaround on this theory; you could look at both the test before the workaround and the test afterward to see if we get lucky here.

@roystgnr
Copy link
Member

I gave up on the workaround (too many classes that could possibly be affected depending on how accurate my vague understanding of the problem is), but if you could give the unit test in #4015 a try I'd appreciate it.

@colegruninger97
Copy link
Author

@roystgnr I tried out the unit tests implemented in #4015 and, surprisingly, they all pass. Perhaps the issue is specifically related to the "final" keyword.

`./unit_tests-opt --re Casting
Will run the following tests:
DistributedVectorTest::testCasting
LaspackVectorTest::testCasting
PetscVectorTest::testCasting
--- Running 3 tests in total.
...

OK (3 tests)`

@roystgnr
Copy link
Member

Okay, that's disturbing. There's nothing different here with regards to the final keyword; I did not remove it from any classes. Your code is failing when it tries to dynamic_cast<libMesh::PetscVector<double>*>(libMesh::NumericVector<double>* foo), and yet that's all the unit test does too.

When I have some spare time, I'd be happy to write short test mirroring the stackoverflow post

No rush, but I'm afraid I want to take you up on this now.

@colegruninger97
Copy link
Author

colegruninger97 commented Dec 9, 2024

Hi @roystgnr, sorry for the delay, but I was able to reproduce the error observed in the original stack overflow post. Here is another relevant github post which seems to be addressing the same issue: RobotLocomotion/drake#22204

@roystgnr
Copy link
Member

roystgnr commented Dec 9, 2024

Any chance you could come up with a unit test that reproduces it? I'm not sure why my attempt didn't.

Have you tried that --copt=-fno-assume-unique-vtables option?

@colegruninger97
Copy link
Author

colegruninger97 commented Dec 10, 2024

Hi @roystgnr, Sure! However, it would be helpful if you could provide me with some guidance regarding how unit tests are generally set up. I can't promise I'll be able to get to this right away, but I should be able to get to it in the next couple of days or so.

I tried using the flag --copt=-fno-assume-unique-vtables and it does fix the dynamic_cast issue.

@roystgnr
Copy link
Member

You need cppunit (either in a system path or in paths specified via configure --with-cppunit-include=foo --with-cppunit-lib=bar) to compile our unit tests, which live under the tests subdirectory. make check runs the whole swath of them right after running submodule tests and before running example programs, but LIBMESH_OPTIONS='--re Foo' make -C tests check is useful for restricting that run to just the subset of unit tests matching Foo.

@colegruninger97
Copy link
Author

Hi @roystgnr, thanks! I have cppunit and I already used it to the tests you asked me to try above. I was more curious about the workflow for writing a test and if there is anything more nuanced that I should be aware of.

@roystgnr
Copy link
Member

Oh, of course you did. Sorry; cppunit is probably our optional package with the highest ratio of "usefulness for developers" over "likelihood of already being installed" so I go into my "make sure you installed cppunit" spiel at the drop of a hat.

The only other hassle likely to apply to you is that if your test doesn't "belong" in an existing group of tests then you'll need to modify tests/Makefile.am and re-run bootstrap before automake will be convinced you want to use the new .C file you've put it in, and if you then want to put all that in a PR we want the bootstrap to be done with our "blessed" version of autotools and its output to be in a separate commit from "real" output. But if you want to base the test around the NumericVector hierarchy and you can get it to work better than my attempt did then you can put it in the same existing files for those classes, and my failed PR is probably a good guide to the few gotchas (like forgetting to register the test) that sometimes bite new cppunit test authors.

The hassle that applies to most users writing new tests is that we run our tests in a zillion different configurations, and it's easy to write something that breaks if Number==Complex or PETSc is disabled or AMR is disabled or Mesh is a DistributedMesh on 11 processors or some such, but I don't think any of that is likely to hit you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants