-
Notifications
You must be signed in to change notification settings - Fork 4.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix deadlock in PipeEngine._exec_recv_grads #5518
base: master
Are you sure you want to change the base?
Conversation
3ccae4b
to
890cccc
Compare
rebase master |
0d5f2f5
to
5d10817
Compare
Hi @tjruwase, Could you review this pls? |
156a2d7
to
ad4fe8a
Compare
ad4fe8a
to
46eb620
Compare
46eb620
to
50ec241
Compare
34b1fd1
to
3a58893
Compare
@i4never - would you be able to add any unit tests? |
sure, I'm working on this. |
Thanks @i4never - ping me whenever this needs review/tests/etc. |
9ae7461
to
29419dc
Compare
I'm using Megtron-DeepSpeed with TP/PP/DP. In my case there are three tensors need to communicate between pipelines:
hidden_state
(floating, need grad)attention_mask
(int32, no grad)cached_rotray_embedding
(floating, no grad)Only first tensor has grad which meets the restriction of PipelineEngine here:
DeepSpeed/deepspeed/runtime/pipe/engine.py
Lines 734 to 736 in 3dd7ccf
Only grads of first tensor sended in first stage:
DeepSpeed/deepspeed/runtime/pipe/engine.py
Lines 1106 to 1109 in 3dd7ccf
But the next stage try to recv more than one grad because
tensor.is_floating_point()
is used to filter outputs. In my casecached_rotray_embedding
is floating tensor with no grad which caught by filter. Next stage expecting more data than sended makes training hangs.DeepSpeed/deepspeed/runtime/pipe/engine.py
Lines 1206 to 1209 in 3dd7ccf
Since only one grad is send anyway, we don't need is_floating_point filter here.