Feature/Improvements: Checkpointing #682

rwightman · 2020-11-26T20:34:55Z

rwightman
Nov 26, 2020

Working on a training script recently I've run into a few items related to checkpointing that I feel would improve usability.

save/load functions

Current save/restore checkpoint routines include code for managing checkpoint histories, sorting by step, restoring latest, etc. There is no basic save/load fn that can be used without those extras. I'm about to write my own checkpoint manager as I want to manage history based on eval metrics and I will have to re-write the save/load serialization bits in my own code.

It'd be nice to split the base save/load functionality from the history management.

strict flag

Working with training code based on ImageNet example here, I have a TrainState dataclass. Adding an extra field that I setup with safe defaults (so new training session should be backwards compat with old state) fails because the from_state_dict fn for struct.dataclass is quite strict

Adding something like a strict flag like below would improve this for my use...

  def from_state_dict(x, state, strict=True):
    """Restore the state of a data class."""
    state = state.copy()  # copy the state so we can pop the restored fields.
    updates = {}
    for name in data_fields:
      if name not in state:
        if strict:
          raise ValueError(f'Missing field {name} in state dict while restoring an instance of {clz.__name__}')
        else:
          continue:
                     
      value = getattr(x, name)
      value_state = state.pop(name)
      updates[name] = serialization.from_state_dict(value, value_state)

jheek · 2020-12-07T15:43:51Z

jheek
Dec 7, 2020
Maintainer

The more low-level save/load functions are in flax.serialization. the checkpointing module is separate because it depends on tensorflow.GFile.

About the strict flag:
I see your point although I think it would make more sense to add an option to flax.struct.dataclass or perhaps even better flax.struct.field to declare a field as optional or to have some kind of fallback value.

Note that you can always pass restore_checkpoint(target=None) to get the raw state dict. This way you can manually restore parts of your training state in combination with flax.serialization.from_state_dict(...)

0 replies

rwightman · 2020-12-08T22:07:16Z

rwightman
Dec 8, 2020
Author

@jheek I didn't see any file IO level save / load in serialize. Just to/from bytes serialization. The idea was to break the file io level checkpointing functionality into save/load only and history mgmt. I realize bytes -> file isn't a big deal, but it's not something you want every user to re-write themselves... adding functionality people are used to in other libs you might want model zoo / download functionality added in a (non-training) checkpoint handling scope at some point.

For checkpoint restore, using dataclass fields could be a good option.

I'm aware of the current None behaviour. I was actually surprised when that first happened and I'm not yet convinced that use of None makes sense given that None is used for optional semantics in Python. Looking in the future to when Flax is fully fleshed out, it would seem the need/desire to have the raw state dict will be less common than say handling optional fields. I'd rather not write the code to handle raw state dicts if I didn't have to and most of my need to do that (at least right now) is for fwd/backward compat handling / optional fields

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature/Improvements: Checkpointing #682

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments

{{title}}

{{title}}

Select a reply

Feature/Improvements: Checkpointing #682

rwightman Nov 26, 2020

save/load functions

strict flag

Replies: 2 comments

jheek Dec 7, 2020 Maintainer

rwightman Dec 8, 2020 Author

rwightman
Nov 26, 2020

jheek
Dec 7, 2020
Maintainer

rwightman
Dec 8, 2020
Author