We test the different implementations of PyTorch distributed training, and compare the performances.
We recommend that users use Apex
to conduct distributed training on High-Flyer AIHPC.
ImageNet. We use ffrecord to aggregate the scattered files on High-Flyer AIHPC.
train_data = '/public_dataset/1/ImageNet/train.ffr'
val_data = '/public_dataset/1/ImageNet/val.ffr'
ResNet
torchvision.models.resnet50()
- batch_size: 400
- num_nodes: 1
- gpus: 8
Apex
is the most effective implementation to conduct PyTorch distributed training for now.- The acceleration effect is basically the same as the number of GPU.
- The deeper the degree of parallelism, the lower the utilization of GPU.