[BUG]Zero++ training failed #6926

HelloWorld506 · 2025-01-06T08:49:45Z

Describe the bug
I have 4 nodes, each with 8 A100 gpu. In order to reduce communication between nodes, I used zero++ training, which indeed accelerated the training process. However, during the training process, the loss remained at 11.9321 and the grad_norm remained at 0, resulting in training failure

My Deepspeed configuration file is as follows：
{
"train_batch_size": "auto",
"train_micro_batch_size_per_gpu": "auto",
"gradient_accumulation_steps": "auto",
"gradient_clipping": "auto",
"zero_allow_untested_optimizer": true,
"fp16": {
"enabled": "auto",
"loss_scale": 0,
"loss_scale_window": 1000,
"initial_scale_power": 16,
"hysteresis": 2,
"min_loss_scale": 1
},
"bf16": {
"enabled": "auto"
},
"zero_optimization": {
"stage": 3,
"offload_optimizer": {
"device": "cpu",
"pin_memory": true
},
"offload_param": {
"device": "cpu",
"pin_memory": true
},
"zero_hpz_partition_size": 8,
"zero_quantized_weights": false,
"zero_quantized_gradients": false,
"overlap_comm": true,
"contiguous_gradients": true,
"sub_group_size": 1e9,
"reduce_bucket_size": "auto",
"stage3_prefetch_bucket_size": "auto",
"stage3_param_persistence_threshold": "auto",
"stage3_max_live_parameters": 1e9,
"stage3_max_reuse_distance": 1e9,
"stage3_gather_16bit_weights_on_model_save": true
}
}

Excuse me, where is the problem and how should I solve it？

loadams · 2025-01-06T16:46:19Z

@HelloWorld506 - can you please share a repro script and the DeepSpeed version you are using?

HelloWorld506 · 2025-01-06T17:13:29Z

@loadams hello
My Deepspeed version is 0.14.5
The script I used was successful when using zero1, 2, and 3, but failed when using zero++. I did not modify the code on the original basis, only modified the configuration file and added the following configuration:
"zero_hpz_partition_size": 8,
"zero_quantized_weights": false,
"zero_quantized_gradients": false
Do I need to modify the code for using zero++？

HelloWorld506 · 2025-01-07T04:18:00Z

@loadams In addition, when I set zero_quantized_weights and zero_quantized_gradients to true, it happens an error:
RuntimeError: expected mat1 and mat2 to have the same dtype, but got: float != c10::Half
And I set bf16 and fp16 to false , using fp32. But it not works, the error still exists.
My Deepspeed configuration file is as follows：
{
"train_batch_size": "auto",
"train_micro_batch_size_per_gpu": "auto",
"gradient_accumulation_steps": "auto",
"gradient_clipping": "auto",
"zero_allow_untested_optimizer": true,
"fp16": {
"enabled": false,
"loss_scale": 0,
"loss_scale_window": 1000,
"initial_scale_power": 16,
"hysteresis": 2,
"min_loss_scale": 1
},
"bf16": {
"enabled": false
},
"zero_optimization": {
"stage": 3,
"offload_optimizer": {
"device": "cpu",
"pin_memory": true
},
"offload_param": {
"device": "cpu",
"pin_memory": true
},
"zero_hpz_partition_size": 8,
"zero_quantized_weights": true,
"zero_quantized_gradients": true,
"overlap_comm": true,
"contiguous_gradients": true,
"sub_group_size": 1e9,
"reduce_bucket_size": "auto",
"stage3_prefetch_bucket_size": "auto",
"stage3_param_persistence_threshold": "auto",
"stage3_max_live_parameters": 1e9,
"stage3_max_reuse_distance": 1e9,
"stage3_gather_16bit_weights_on_model_save": true
}
}

by the way, I use Lora to finetune my model.

HelloWorld506 added bug Something isn't working training labels Jan 6, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG]Zero++ training failed #6926

[BUG]Zero++ training failed #6926

HelloWorld506 commented Jan 6, 2025

loadams commented Jan 6, 2025

HelloWorld506 commented Jan 6, 2025

HelloWorld506 commented Jan 7, 2025

[BUG]Zero++ training failed #6926

[BUG]Zero++ training failed #6926

Comments

HelloWorld506 commented Jan 6, 2025

loadams commented Jan 6, 2025

HelloWorld506 commented Jan 6, 2025

HelloWorld506 commented Jan 7, 2025