You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When saving checkpoint for FSDP’s ShardingStrategy.HYBRID_SHARD, only one of the shard_group should be calling save_state_dict and the corresponding process group needs to be passed in.
It references the old FSDP1's sharding strategy, however, this should apply equally to FSDP2. I believe this also applies to the different async saving flavors that are implemented.
Do you agree that it should? Have you tried this with FSDP2? Do you know of any blockers?
This would decrease the saving burden considerably for jobs with a large data_parallel_replicate_degree
Thank you!
The text was updated successfully, but these errors were encountered:
The docs for
dcp.save
say:It references the old FSDP1's sharding strategy, however, this should apply equally to FSDP2. I believe this also applies to the different async saving flavors that are implemented.
Today,
torchtitan
does not do this:torchtitan/torchtitan/checkpoint.py
Lines 368 to 375 in d989842
Do you agree that it should? Have you tried this with FSDP2? Do you know of any blockers?
This would decrease the saving burden considerably for jobs with a large
data_parallel_replicate_degree
Thank you!
The text was updated successfully, but these errors were encountered: