Using zero3 on multiple nodes is slow #6889

HelloWorld506 · 2024-12-18T03:43:41Z

I have multiple nodes, each with 8 40G A100, and I want to train a 72B model
When using zero3, the 72B model is distributed to all GPUs of all nodes. Even with nvlink, the communication delay is still very high, resulting in slow training speed, much slower than using zero3+offloading for a single node. The problem is that the more nodes there are, the slower the training speed. It is better to use only a single node
Is there a way to control zero3 to only allocate model parameters to the same node, where each node stores a complete model and only uses synchronous gradients between nodes to speed up training

tjruwase · 2024-12-18T04:19:55Z

@HelloWorld506, you can try the hpz feature of ZeRO++ if it fits your scenario.

HelloWorld506 · 2024-12-18T11:47:19Z

it works, thank you!

HelloWorld506 · 2025-01-06T08:55:46Z

@tjruwase Hello, using zero++ did indeed speed up my training, but during the training process, the loss remained at 11.9321 and the grad_norm remained at 0, resulting in training failure. What is the reason for this and how can I resolve it ？
My Deepspeed configuration file is as follows：
{
"train_batch_size": "auto",
"train_micro_batch_size_per_gpu": "auto",
"gradient_accumulation_steps": "auto",
"gradient_clipping": "auto",
"zero_allow_untested_optimizer": true,
"fp16": {
"enabled": "auto",
"loss_scale": 0,
"loss_scale_window": 1000,
"initial_scale_power": 16,
"hysteresis": 2,
"min_loss_scale": 1
},
"bf16": {
"enabled": "auto"
},
"zero_optimization": {
"stage": 3,
"offload_optimizer": {
"device": "cpu",
"pin_memory": true
},
"offload_param": {
"device": "cpu",
"pin_memory": true
},
"zero_hpz_partition_size": 8,
"zero_quantized_weights": false,
"zero_quantized_gradients": false,
"overlap_comm": true,
"contiguous_gradients": true,
"sub_group_size": 1e9,
"reduce_bucket_size": "auto",
"stage3_prefetch_bucket_size": "auto",
"stage3_param_persistence_threshold": "auto",
"stage3_max_live_parameters": 1e9,
"stage3_max_reuse_distance": 1e9,
"stage3_gather_16bit_weights_on_model_save": true
}
}

HelloWorld506 added bug Something isn't working training labels Dec 18, 2024

HelloWorld506 closed this as completed Dec 18, 2024

HelloWorld506 reopened this Jan 6, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Using zero3 on multiple nodes is slow #6889

Using zero3 on multiple nodes is slow #6889

HelloWorld506 commented Dec 18, 2024

tjruwase commented Dec 18, 2024

HelloWorld506 commented Dec 18, 2024

HelloWorld506 commented Jan 6, 2025

Using zero3 on multiple nodes is slow #6889

Using zero3 on multiple nodes is slow #6889

Comments

HelloWorld506 commented Dec 18, 2024

tjruwase commented Dec 18, 2024

HelloWorld506 commented Dec 18, 2024

HelloWorld506 commented Jan 6, 2025