Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use PyNVML 12 #345

Merged
merged 3 commits into from
Dec 20, 2024
Merged

Use PyNVML 12 #345

merged 3 commits into from
Dec 20, 2024

Conversation

jakirkham
Copy link
Member

Bump pynvml from 11 to 12. This version of pynvml also now depends on nvidia-ml-py for core functionality.

Copy link

copy-pr-bot bot commented Dec 19, 2024

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@jakirkham jakirkham added improvement Improves an existing functionality non-breaking Introduces a non-breaking change labels Dec 19, 2024
@jakirkham jakirkham requested a review from rjzamora December 19, 2024 04:21
@jakirkham
Copy link
Member Author

/ok to test

@jakirkham jakirkham marked this pull request as ready for review December 20, 2024 07:35
@jakirkham jakirkham requested a review from a team as a code owner December 20, 2024 07:35
@jakirkham jakirkham requested a review from jameslamb December 20, 2024 07:35
Copy link
Member

@jameslamb jameslamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One wheel-tests job is failing with segfaults.

parser.c:2305 UCX WARN (set UCX_WARN_UNUSED_ENV_VARS=n to suppress this warning)
parser.c:1339 UCX ERROR Invalid value for SEG_SIZE: 'invalid-size'. Expected: memory units: [b|kb|mb|gb], "inf", or "auto"
Fatal Python error: Segmentation fault

full stack trace (click me)
Fatal Python error: Segmentation fault

Thread 0x00007f215df596c0 (most recent call first):
  File "/pyenv/versions/3.10.16/lib/python3.10/site-packages/ucxx/_lib_async/notifier_thread.py", line 37 in _notifierThread
  File "/pyenv/versions/3.10.16/lib/python3.10/threading.py", line 953 in run
  File "/pyenv/versions/3.10.16/lib/python3.10/threading.py", line 1016 in _bootstrap_inner
  File "/pyenv/versions/3.10.16/lib/python3.10/threading.py", line 973 in _bootstrap

Current thread 0x00007f216eea8b80 (most recent call first):
  File "/pyenv/versions/3.10.16/lib/python3.10/site-packages/ucxx/_lib_async/endpoint.py", line 34 in _finalizer
  File "/pyenv/versions/3.10.16/lib/python3.10/weakref.py", line 591 in __call__
  File "/pyenv/versions/3.10.16/lib/python3.10/site-packages/ucxx/_lib_async/listener.py", line 209 in _listener_handler_coroutine
  File "/pyenv/versions/3.10.16/lib/python3.10/asyncio/events.py", line 80 in _run
  File "/pyenv/versions/3.10.16/lib/python3.10/asyncio/base_events.py", line 1909 in _run_once
  File "/pyenv/versions/3.10.16/lib/python3.10/asyncio/base_events.py", line 603 in run_forever
  File "/pyenv/versions/3.10.16/lib/python3.10/asyncio/base_events.py", line 636 in run_until_complete
  File "/pyenv/versions/3.10.16/lib/python3.10/site-packages/pytest_asyncio/plugin.py", line 906 in inner
  File "/pyenv/versions/3.10.16/lib/python3.10/site-packages/_pytest/python.py", line 194 in pytest_pyfunc_call
  File "/pyenv/versions/3.10.16/lib/python3.10/site-packages/pluggy/_callers.py", line 103 in _multicall
  File "/pyenv/versions/3.10.16/lib/python3.10/site-packages/pluggy/_manager.py", line 120 in _hookexec
  File "/pyenv/versions/3.10.16/lib/python3.10/site-packages/pluggy/_hooks.py", line 513 in __call__
  File "/pyenv/versions/3.10.16/lib/python3.10/site-packages/_pytest/python.py", line 1792 in runtest
  File "/pyenv/versions/3.10.16/lib/python3.10/site-packages/pytest_asyncio/plugin.py", line 440 in runtest
  File "/pyenv/versions/3.10.16/lib/python3.10/site-packages/_pytest/runner.py", line 169 in pytest_runtest_call
  File "/pyenv/versions/3.10.16/lib/python3.10/site-packages/pluggy/_callers.py", line 103 in _multicall
  File "/pyenv/versions/3.10.16/lib/python3.10/site-packages/pluggy/_manager.py", line 120 in _hookexec
  File "/pyenv/versions/3.10.16/lib/python3.10/site-packages/pluggy/_hooks.py", line 513 in __call__
  File "/pyenv/versions/3.10.16/lib/python3.10/site-packages/_pytest/runner.py", line 262 in <lambda>
  File "/pyenv/versions/3.10.16/lib/python3.10/site-packages/_pytest/runner.py", line 341 in from_call
  File "/pyenv/versions/3.10.16/lib/python3.10/site-packages/_pytest/runner.py", line 261 in call_runtest_hook
  File "/pyenv/versions/3.10.16/lib/python3.10/site-packages/_pytest/runner.py", line 222 in call_and_report
  File "/pyenv/versions/3.10.16/lib/python3.10/site-packages/_pytest/runner.py", line 133 in runtestprotocol
  File "/pyenv/versions/3.10.16/lib/python3.10/site-packages/_pytest/runner.py", line 114 in pytest_runtest_protocol
  File "/pyenv/versions/3.10.16/lib/python3.10/site-packages/pluggy/_callers.py", line 103 in _multicall
  File "/pyenv/versions/3.10.16/lib/python3.10/site-packages/pluggy/_manager.py", line 120 in _hookexec
  File "/pyenv/versions/3.10.16/lib/python3.10/site-packages/pluggy/_hooks.py", line 513 in __call__
  File "/pyenv/versions/3.10.16/lib/python3.10/site-packages/_pytest/main.py", line 350 in pytest_runtestloop
  File "/pyenv/versions/3.10.16/lib/python3.10/site-packages/pluggy/_callers.py", line 103 in _multicall
  File "/pyenv/versions/3.10.16/lib/python3.10/site-packages/pluggy/_manager.py", line 120 in _hookexec
  File "/pyenv/versions/3.10.16/lib/python3.10/site-packages/pluggy/_hooks.py", line 513 in __call__
  File "/pyenv/versions/3.10.16/lib/python3.10/site-packages/_pytest/main.py", line 325 in _main
  File "/pyenv/versions/3.10.16/lib/python3.10/site-packages/_pytest/main.py", line 271 in wrap_session
  File "/pyenv/versions/3.10.16/lib/python3.10/site-packages/_pytest/main.py", line 318 in pytest_cmdline_main
  File "/pyenv/versions/3.10.16/lib/python3.10/site-packages/pluggy/_callers.py", line 103 in _multicall
  File "/pyenv/versions/3.10.16/lib/python3.10/site-packages/pluggy/_manager.py", line 120 in _hookexec
  File "/pyenv/versions/3.10.16/lib/python3.10/site-packages/pluggy/_hooks.py", line 513 in __call__
  File "/pyenv/versions/3.10.16/lib/python3.10/site-packages/_pytest/config/__init__.py", line 169 in main
  File "/pyenv/versions/3.10.16/lib/python3.10/site-packages/_pytest/config/__init__.py", line 192 in console_main
  File "/pyenv/versions/3.10.16/lib/python3.10/site-packages/pytest/__main__.py", line 5 in <module>
  File "/pyenv/versions/3.10.16/lib/python3.10/runpy.py", line 86 in _run_code
  File "/pyenv/versions/3.10.16/lib/python3.10/runpy.py", line 196 in _run_module_as_main

(build link)

Could that be a result of these changes?

@jakirkham
Copy link
Member Author

Given the previous commit passed and the last one is simply merging in the latest from branch-25.02, I don't think it is related

@jakirkham
Copy link
Member Author

@pentschev does the error above in James' comment look familiar to you?

@pentschev
Copy link
Member

parser.c:2305 UCX WARN (set UCX_WARN_UNUSED_ENV_VARS=n to suppress this warning)
parser.c:1339 UCX ERROR Invalid value for SEG_SIZE: 'invalid-size'. Expected: memory units: [b|kb|mb|gb], "inf", or "auto"

Those are normal when we test invalid arguments, they are errors that we expect to catch.

Fatal Python error: Segmentation fault
does the error above in James' comment look familiar to you?

That is unfortunately known, I thought it had been fixed a while ago, had not seen this one in particular for weeks, it seems there's still something flaky. 😞 -- Nevertheless, it should not be related, so triggered a rerun.

@jameslamb jameslamb self-requested a review December 20, 2024 19:37
@jameslamb
Copy link
Member

Alright thanks @pentschev , I'll approve this then and we can merge if CI passes.

Copy link
Member

@pentschev pentschev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks @jakirkham .

@jakirkham
Copy link
Member Author

Thanks Peter and James! 🙏

Looks like the rerun worked 🎉

Let's keep an eye on it and we can follow up as needed

@jameslamb
Copy link
Member

/merge

@rapids-bot rapids-bot bot merged commit 00970f8 into rapidsai:branch-0.42 Dec 20, 2024
60 checks passed
@jakirkham jakirkham deleted the use_pynvml_12 branch December 20, 2024 20:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
improvement Improves an existing functionality non-breaking Introduces a non-breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants