Skip to content

MultihostCheckpoint counter error in CLU #1225

Answered by marcvanzee
marcvanzee asked this question in Q&A
Discussion options

You must be logged in to vote

Answer by @andsteing:

It seems MultihostCheckpoint could not find overlapping checkpoint numbers in the tasks checkpoint directories. When the task tries to save a checkpoint, the discrepancy is discovered (save_counter is zero if no checkpoint is loaded, but latest_checkpoint is read by the tf checkpoint manager from the directory), and the process fails.

I'm not quite sure what has happened, but maybe you have had different processes that wrote to the same directory?

Replies: 1 comment

Comment options

marcvanzee
Apr 8, 2021
Maintainer Author

You must be logged in to vote
0 replies
Answer selected by marcvanzee
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Q&A
Labels
None yet
1 participant