You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have multiple nodes, each with 8 40G A100, and I want to train a 72B model
When using zero3, the 72B model is distributed to all GPUs of all nodes. Even with nvlink, the communication delay is still very high, resulting in slow training speed, much slower than using zero3+offloading for a single node. The problem is that the more nodes there are, the slower the training speed. It is better to use only a single node
Is there a way to control zero3 to only allocate model parameters to the same node, where each node stores a complete model and only uses synchronous gradients between nodes to speed up training
The text was updated successfully, but these errors were encountered:
@tjruwase Hello, using zero++ did indeed speed up my training, but during the training process, the loss remained at 11.9321 and the grad_norm remained at 0, resulting in training failure. What is the reason for this and how can I resolve it ?
My Deepspeed configuration file is as follows:
{
"train_batch_size": "auto",
"train_micro_batch_size_per_gpu": "auto",
"gradient_accumulation_steps": "auto",
"gradient_clipping": "auto",
"zero_allow_untested_optimizer": true,
"fp16": {
"enabled": "auto",
"loss_scale": 0,
"loss_scale_window": 1000,
"initial_scale_power": 16,
"hysteresis": 2,
"min_loss_scale": 1
},
"bf16": {
"enabled": "auto"
},
"zero_optimization": {
"stage": 3,
"offload_optimizer": {
"device": "cpu",
"pin_memory": true
},
"offload_param": {
"device": "cpu",
"pin_memory": true
},
"zero_hpz_partition_size": 8,
"zero_quantized_weights": false,
"zero_quantized_gradients": false,
"overlap_comm": true,
"contiguous_gradients": true,
"sub_group_size": 1e9,
"reduce_bucket_size": "auto",
"stage3_prefetch_bucket_size": "auto",
"stage3_param_persistence_threshold": "auto",
"stage3_max_live_parameters": 1e9,
"stage3_max_reuse_distance": 1e9,
"stage3_gather_16bit_weights_on_model_save": true
}
}
I have multiple nodes, each with 8 40G A100, and I want to train a 72B model
When using zero3, the 72B model is distributed to all GPUs of all nodes. Even with nvlink, the communication delay is still very high, resulting in slow training speed, much slower than using zero3+offloading for a single node. The problem is that the more nodes there are, the slower the training speed. It is better to use only a single node
Is there a way to control zero3 to only allocate model parameters to the same node, where each node stores a complete model and only uses synchronous gradients between nodes to speed up training
The text was updated successfully, but these errors were encountered: