Logits largely different from PyTorch #1063
Replies: 2 comments 6 replies
-
If implemented correctly, there shouldn't be a significant difference in the final result. I suspect some of the operations is not exactly the same. It could be some epsilon parameter or different memory layout of the attention tensors. |
Beta Was this translation helpful? Give feedback.
-
Hey Georgi, thanks for your amazing work on GGML! Yeah that was my suspicion too, but the epsilon parameter is the same (1e-5) and the memory layout should be the same between my own model (default GPT2LMHead) and the cerebras one. Is there anything you'd recommend I check in these operations in more detail? Here are the results I get when logging the distance after every operation (ish) for both models (sorry for the large copy/paste). It looks like it starts diverging at the second layer's KQ_soft_max, but I'm unsure why.
|
Beta Was this translation helpful? Give feedback.
-
Hi, and thanks in advance for your time!
I've been struggling for the past couple of months with porting a GPT2LMHeadModel to GGML. I've used the script
convert-cerebras-to-ggml.py
to package the weights of my model into a .bin that I can run with thegpt-2
example scripts. However, when I compare the logits to the ones obtained with the PyTorch model, they are widely different (cross entropy loss of 2e5). This is not the case when I run the example with the cerebras model (cross entropy loss of 25). The model has the correct weights and parameters. I've spent a long time comparing the different hidden states across a GGML and PyTorch run and realized that the loss increases layer after layer, but it grows much slower for the cerebras model (.5 per layer) than for my own model (30 per layer).I'm kind of at a loss and at a desperate need of advice for how I could debug this problem further. Any leads? I'm happy to provide my graphs of the distances I logged over the different layers, or more information. Thanks so much in advance for your help 🙏
Beta Was this translation helpful? Give feedback.
All reactions