PyTorch Distributed Test on High-Flyer AIHPC

We test the different implementations of PyTorch distributed training, and compare the performances.

single GPU
nn.DataParallel
torch.distributed + torch.multiprocessing
Apex

We recommend that users use Apex to conduct distributed training on High-Flyer AIHPC.

Dataset

ImageNet. We use ffrecord to aggregate the scattered files on High-Flyer AIHPC.

train_data = '/public_dataset/1/ImageNet/train.ffr'
val_data = '/public_dataset/1/ImageNet/val.ffr'

Test Model

ResNet

torchvision.models.resnet50()

Parameters

batch_size: 400
num_nodes： 1
gpus： 8

Results

Summary

Apex is the most effective implementation to conduct PyTorch distributed training for now.
The acceleration effect is basically the same as the number of GPU.
The deeper the degree of parallelism, the lower the utilization of GPU.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
resnet_ddp_apex.py		resnet_ddp_apex.py
resnet_dp.py		resnet_dp.py
resnet_single_gpu.py		resnet_single_gpu.py
restnet_ddp.py		restnet_ddp.py
result.png		result.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PyTorch Distributed Test on High-Flyer AIHPC

Dataset

Test Model

Parameters

Results

Summary

About

Releases

Packages

Languages

License

HFAiLab/pytorch_distributed

Folders and files

Latest commit

History

Repository files navigation

PyTorch Distributed Test on High-Flyer AIHPC

Dataset

Test Model

Parameters

Results

Summary

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages