-
Notifications
You must be signed in to change notification settings - Fork 4.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG]Zero++ training failed #6926
Comments
@HelloWorld506 - can you please share a repro script and the DeepSpeed version you are using? |
@loadams hello |
@loadams In addition, when I set zero_quantized_weights and zero_quantized_gradients to true, it happens an error: by the way, I use Lora to finetune my model. |
Describe the bug
I have 4 nodes, each with 8 A100 gpu. In order to reduce communication between nodes, I used zero++ training, which indeed accelerated the training process. However, during the training process, the loss remained at 11.9321 and the grad_norm remained at 0, resulting in training failure
My Deepspeed configuration file is as follows:
{
"train_batch_size": "auto",
"train_micro_batch_size_per_gpu": "auto",
"gradient_accumulation_steps": "auto",
"gradient_clipping": "auto",
"zero_allow_untested_optimizer": true,
"fp16": {
"enabled": "auto",
"loss_scale": 0,
"loss_scale_window": 1000,
"initial_scale_power": 16,
"hysteresis": 2,
"min_loss_scale": 1
},
"bf16": {
"enabled": "auto"
},
"zero_optimization": {
"stage": 3,
"offload_optimizer": {
"device": "cpu",
"pin_memory": true
},
"offload_param": {
"device": "cpu",
"pin_memory": true
},
"zero_hpz_partition_size": 8,
"zero_quantized_weights": false,
"zero_quantized_gradients": false,
"overlap_comm": true,
"contiguous_gradients": true,
"sub_group_size": 1e9,
"reduce_bucket_size": "auto",
"stage3_prefetch_bucket_size": "auto",
"stage3_param_persistence_threshold": "auto",
"stage3_max_live_parameters": 1e9,
"stage3_max_reuse_distance": 1e9,
"stage3_gather_16bit_weights_on_model_save": true
}
}
Excuse me, where is the problem and how should I solve it?
The text was updated successfully, but these errors were encountered: