BUG: early_step_in_backward with pipeline parallelism and len(model_parts) > 1 #777

cassanof · 2025-01-07T01:53:13Z

The init method for the OptimizersInBackwardContainer has a bug:

Lines 99 to 101 in 2a44370

    
           def optim_hook(param) -> None: 
        
               optim_dict[param].step() 
        
               optim_dict[param].zero_grad()

The hook closure tries to capture optim_dict, but if you have more than 1 model_parts, which can be the case with pipeline parallelism, this will capture the last optim_dict, throwing an error on backward as the parameters in the first model part will not be contained in this dict.

Also, the fused backward+optim code doesn't seem to handle gradient clipping.

The text was updated successfully, but these errors were encountered:

mori360 · 2025-01-07T06:00:11Z

throwing an error on backward as the parameters in the first model part will not be contained in this dict.

I tried running with "--experimental.pipeline_parallel_degree 2", "--optimizer.early_step_in_backward",
All optimizers call .step() with changes
Could you give an example to repro the issue, thank you?

the fused backward+optim code doesn't seem to handle gradient clipping.

backward+optim would free gradient during backward(to optimize memory cost) so that there's no gradient for gradient clipping

cassanof · 2025-01-07T08:50:46Z

Hi! For your configuration, will that create a model_parts with len > 1? If not, then it will work correctly, otherwise it should break.
This is because python closures capture by reference to the identifier, not value. When optim_hook is invoked, it will always use the final state of optim_dict after the loop is complete, rather than the state at the time the hook was registered.

I have trouble in producing a repro because my version of torchtitan is heavily modified.

backward+optim would free gradient during backward(to optimize memory cost) so that there's no gradient for gradient clipping

That makes sense. Wouldn't you still want to clip the grad before stepping the optimizer? I understand you can't do grad norm clipping, because you can't compute the norm, but you can use traditional clipping where you clamp the grad by some threshold.

awgu assigned mori360 Jan 7, 2025

gnadathur assigned H-Huang Jan 7, 2025

mori360 mentioned this issue Jan 7, 2025

Register backward hook for the whole optim_dict to enable working at multi schedule pp #780

Draft

tianyu-l added the bug Something isn't working label Jan 7, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: early_step_in_backward with pipeline parallelism and len(model_parts) > 1 #777

BUG: early_step_in_backward with pipeline parallelism and len(model_parts) > 1 #777

cassanof commented Jan 7, 2025 •

edited

Loading

mori360 commented Jan 7, 2025

cassanof commented Jan 7, 2025

BUG: early_step_in_backward with pipeline parallelism and len(model_parts) > 1 #777

BUG: early_step_in_backward with pipeline parallelism and len(model_parts) > 1 #777

Comments

cassanof commented Jan 7, 2025 • edited Loading

mori360 commented Jan 7, 2025

cassanof commented Jan 7, 2025

cassanof commented Jan 7, 2025 •

edited

Loading