Saving should be aware of the dp mesh #799

carmocca · 2025-01-21T21:49:28Z

The docs for dcp.save say:

When saving checkpoint for FSDP’s ShardingStrategy.HYBRID_SHARD, only one of the shard_group should be calling save_state_dict and the corresponding process group needs to be passed in.

It references the old FSDP1's sharding strategy, however, this should apply equally to FSDP2. I believe this also applies to the different async saving flavors that are implemented.

Today, torchtitan does not do this:

torchtitan/torchtitan/checkpoint.py

Lines 368 to 375 in d989842

    
           elif self.async_mode == AsyncMode.ASYNC_WITH_PINNED_MEM: 
        
               self._async_with_pinned_memory(checkpoint_id) 
        
           elif self.async_mode == AsyncMode.ASYNC: 
        
               self.async_future = dcp.async_save( 
        
                   self.states, checkpoint_id=checkpoint_id, process_group=self.pg 
        
               ) 
        
           else: 
        
               dcp.save(self.states, checkpoint_id=checkpoint_id)

Do you agree that it should? Have you tried this with FSDP2? Do you know of any blockers?
This would decrease the saving burden considerably for jobs with a large data_parallel_replicate_degree

Thank you!

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Saving should be aware of the dp mesh #799

Saving should be aware of the dp mesh #799

carmocca commented Jan 21, 2025 •

edited

Loading

Saving should be aware of the dp mesh #799

Saving should be aware of the dp mesh #799

Comments

carmocca commented Jan 21, 2025 • edited Loading

carmocca commented Jan 21, 2025 •

edited

Loading