Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training process stuck with --distribute_modules flag #27

Open
Hambaobao opened this issue Aug 21, 2024 · 2 comments
Open

Training process stuck with --distribute_modules flag #27

Hambaobao opened this issue Aug 21, 2024 · 2 comments

Comments

@Hambaobao
Copy link

Hi, while executing:
torchrun --nproc_per_node gpu -m sae meta-llama/Meta-Llama-3-8B --distribute_modules --batch_size 1 --layers 24 25 --grad_acc_steps 8 --ctx_len 2048 --k 192 --load_in_8bit --micro_acc_steps 2,
I found that the training process gets stuck at step=8.

I debugged and traced the problem to:

dist.all_to_all([x for x in inputs], outputs)
Has anyone encountered this issue?

@norabelrose
Copy link
Member

I have not had issues with it getting stuck at an early step like step 8. It has sometimes gotten stuck at the very end of training.

@htlou
Copy link

htlou commented Sep 2, 2024

I have not had issues with it getting stuck at an early step like step 8. It has sometimes gotten stuck at the very end of training.

I met the same problem here too. Tried pile-10k on gpt2, Gemma-2b, and Llama-3-7b, and all the training got stuck at the exact last step. Specifically, the program got stuck on Line 281 of https://github.com/EleutherAI/sae/blob/main/sae/trainer.py. At that point, the loss is a tensor(0.3034, device='cuda:0', grad_fn=).
Based on this issue, I want to ask:

  1. If there's any fix to this issue.
  2. If it's ok to directly skip the last step and harvest the SAE.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants