MultihostCheckpoint counter error in CLU #1225
-
Original question by @salayatana66: I got a CLU related error after training for a while. Essentially mulihost checkpont failed randomly with:
If I try to restart the job from the same checkpoint I get the same error. |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment
-
Answer by @andsteing: It seems I'm not quite sure what has happened, but maybe you have had different processes that wrote to the same directory? |
Beta Was this translation helpful? Give feedback.
Answer by @andsteing:
It seems
MultihostCheckpoint
could not find overlapping checkpoint numbers in the tasks checkpoint directories. When the task tries to save a checkpoint, the discrepancy is discovered (save_counter
is zero if no checkpoint is loaded, butlatest_checkpoint
is read by the tf checkpoint manager from the directory), and the process fails.I'm not quite sure what has happened, but maybe you have had different processes that wrote to the same directory?