Logits largely different from PyTorch #1063

lancelotblanchard · 2025-01-05T21:16:14Z

lancelotblanchard
Jan 5, 2025

Hi, and thanks in advance for your time!

I've been struggling for the past couple of months with porting a GPT2LMHeadModel to GGML. I've used the script convert-cerebras-to-ggml.py to package the weights of my model into a .bin that I can run with the gpt-2 example scripts. However, when I compare the logits to the ones obtained with the PyTorch model, they are widely different (cross entropy loss of 2e5). This is not the case when I run the example with the cerebras model (cross entropy loss of 25). The model has the correct weights and parameters. I've spent a long time comparing the different hidden states across a GGML and PyTorch run and realized that the loss increases layer after layer, but it grows much slower for the cerebras model (.5 per layer) than for my own model (30 per layer).

I'm kind of at a loss and at a desperate need of advice for how I could debug this problem further. Any leads? I'm happy to provide my graphs of the distances I logged over the different layers, or more information. Thanks so much in advance for your help 🙏

ggerganov · 2025-01-06T08:28:22Z

ggerganov
Jan 6, 2025
Maintainer

If implemented correctly, there shouldn't be a significant difference in the final result. I suspect some of the operations is not exactly the same. It could be some epsilon parameter or different memory layout of the attention tensors.

0 replies

lancelotblanchard · 2025-01-06T22:43:39Z

lancelotblanchard
Jan 6, 2025
Author

Hey Georgi, thanks for your amazing work on GGML! Yeah that was my suspicion too, but the epsilon parameter is the same (1e-5) and the memory layout should be the same between my own model (default GPT2LMHead) and the cerebras one. Is there anything you'd recommend I check in these operations in more detail? Here are the results I get when logging the distance after every operation (ish) for both models (sorry for the large copy/paste). It looks like it starts diverging at the second layer's KQ_soft_max, but I'm unsure why.

Cerebras Model (working) Own Model (not working)

wte: 0.0
wpe: 0.0
inpL: 0.0

layer_1_ln_1: 9.896612e-06
layer_1_qkv: 0.000115721086
layer_1_Q: 8.558617e-05
layer_1_K: 7.151423e-05
layer_1_V: 3.0857263e-05
layer_1_KQ: 1790.6439
layer_1_KQ_scaled: 223.61922
layer_1_KQ_masked: 185.60698
layer_1_KQ_soft_max: 1.6501398e-06
layer_1_KQV: 7.387581e-06
layer_1_KQV_merged: 7.3875776e-06
layer_1_c_proj: 6.6591115e-06
layer_1_residual_1: 6.686363e-06
layer_1_ln_2: 0.00011214544
layer_1_mlp: 0.058738343
layer_1_residual_2: 0.058738403

layer_2_ln_1: 0.039386105
layer_2_qkv: 0.101688445
layer_2_Q: 0.036140777
layer_2_K: 0.085201666
layer_2_V: 0.04213136
layer_2_KQ: 28594.188
layer_2_KQ_scaled: 3574.7004
layer_2_KQ_masked: 2921.9224
layer_2_KQ_soft_max: 0.0070034117
layer_2_KQV: 0.03831337
layer_2_KQV_merged: 0.03831337
layer_2_c_proj: 0.061948262
layer_2_residual_1: 0.07872261
layer_2_ln_2: 0.04826341
layer_2_mlp: 0.6244889
layer_2_residual_2: 0.63594097

layer_3_ln_1: 0.41648746
layer_3_qkv: 1.5638301
layer_3_Q: 1.3953837
layer_3_K: 0.5464405
layer_3_V: 0.44707355
layer_3_KQ: 61178.906
layer_3_KQ_scaled: 7647.662
layer_3_KQ_masked: 6192.7075
layer_3_KQ_soft_max: 0.053774785
layer_3_KQV: 0.45505163
layer_3_KQV_merged: 0.4550515
layer_3_c_proj: 0.47397614
layer_3_residual_1: 0.64589953
layer_3_ln_2: 0.5483953
layer_3_mlp: 0.55270535
layer_3_residual_2: 0.7901242

layer_4_ln_1: 0.77885455
layer_4_qkv: 1.920253
layer_4_Q: 1.4229826
layer_4_K: 1.0778393
layer_4_V: 0.7076427
layer_4_KQ: 39497.56
layer_4_KQ_scaled: 4935.346
layer_4_KQ_masked: 2673.4824
layer_4_KQ_soft_max: 0.09395697
layer_4_KQV: 0.736364
layer_4_KQV_merged: 0.7363639
layer_4_c_proj: 0.5588572
layer_4_residual_1: 0.84702766
layer_4_ln_2: 0.78492373
layer_4_mlp: 0.5170147
layer_4_residual_2: 0.9566526

layer_5_ln_1: 1.0167422
layer_5_qkv: 2.0046694
layer_5_Q: 1.2899926
layer_5_K: 1.3182925
layer_5_V: 0.78531855
layer_5_KQ: 21780.557
layer_5_KQ_scaled: 2720.8928
layer_5_KQ_masked: 2157.5012
layer_5_KQ_soft_max: 0.10210983
layer_5_KQV: 0.6279266
layer_5_KQV_merged: 0.6279265
layer_5_c_proj: 0.52419686
layer_5_residual_1: 0.97225356
layer_5_ln_2: 0.99657845
layer_5_mlp: 0.5550895
layer_5_residual_2: 1.0813524

layer_6_ln_1: 1.1324998
layer_6_qkv: 2.0704298
layer_6_Q: 1.3250324
layer_6_K: 1.2892209
layer_6_V: 0.93213606
layer_6_KQ: 14218.007
layer_6_KQ_scaled: 1775.4788
layer_6_KQ_masked: 1405.1062
layer_6_KQ_soft_max: 0.11496724
layer_6_KQV: 0.6506905
layer_6_KQV_merged: 0.6506905
layer_6_c_proj: 0.57677966
layer_6_residual_1: 1.2190394
layer_6_ln_2: 1.1502812
layer_6_mlp: 0.58095974
layer_6_residual_2: 1.3120929

layer_7_ln_1: 1.1596652
layer_7_qkv: 2.21825
layer_7_Q: 1.3064988
layer_7_K: 1.5448257
layer_7_V: 0.9095092
layer_7_KQ: 26252.738
layer_7_KQ_scaled: 3281.381
layer_7_KQ_masked: 2428.544
layer_7_KQ_soft_max: 0.09279305
layer_7_KQV: 0.60664976
layer_7_KQV_merged: 0.60665
layer_7_c_proj: 0.5602338
layer_7_residual_1: 1.3903992
layer_7_ln_2: 1.1658942
layer_7_mlp: 0.7001409
layer_7_residual_2: 1.5627978

layer_8_ln_1: 1.1749557
layer_8_qkv: 2.0494883
layer_8_Q: 1.2765135
layer_8_K: 1.3248377
layer_8_V: 0.90317416
layer_8_KQ: 16195.569
layer_8_KQ_scaled: 2023.297
layer_8_KQ_masked: 1443.1249
layer_8_KQ_soft_max: 0.06487389
layer_8_KQV: 0.54632884
layer_8_KQV_merged: 0.5463287
layer_8_c_proj: 0.4923537
layer_8_residual_1: 1.6570141
layer_8_ln_2: 1.2486362
layer_8_mlp: 0.74540526
layer_8_residual_2: 1.8535631

layer_9_ln_1: 1.1605415
layer_9_qkv: 1.8344065
layer_9_Q: 1.1834071
layer_9_K: 1.1267657
layer_9_V: 0.83366394
layer_9_KQ: 9165.323
layer_9_KQ_scaled: 1144.069
layer_9_KQ_masked: 875.52704
layer_9_KQ_soft_max: 0.056545544
layer_9_KQV: 0.5183208
layer_9_KQV_merged: 0.51832086
layer_9_c_proj: 0.49944034
layer_9_residual_1: 1.9356343
layer_9_ln_2: 1.3494033
layer_9_mlp: 1.0914463
layer_9_residual_2: 2.3238735

layer_10_ln_1: 1.1793141
layer_10_qkv: 1.7577239
layer_10_Q: 1.1522738
layer_10_K: 1.0526785
layer_10_V: 0.8085352
layer_10_KQ: 8522.07
layer_10_KQ_scaled: 1062.2568
layer_10_KQ_masked: 792.84375
layer_10_KQ_soft_max: 0.06240623
layer_10_KQV: 0.48264673
layer_10_KQV_merged: 0.4826467
layer_10_c_proj: 0.46677652
layer_10_residual_1: 2.4020514
layer_10_ln_2: 1.3429929
layer_10_mlp: 1.3850182
layer_10_residual_2: 3.095342

ln_f: 2.7104752
lm_head: 25.675674

wte: 0.0
wpe: 0.0
inpL: 0.0

layer_1_ln_1: 7.4049494e-06
layer_1_qkv: 0.0002451378
layer_1_Q: 0.00018557144
layer_1_K: 0.00015548796
layer_1_V: 3.8468246e-05
layer_1_KQ: 15478.433
layer_1_KQ_scaled: 1931.1981
layer_1_KQ_masked: 1279.2477
layer_1_KQ_soft_max: 1.2659726e-05
layer_1_KQV: 2.725239e-05
layer_1_KQV_merged: 2.725239e-05
layer_1_c_proj: 1.7297793e-05
layer_1_residual_1: 1.7318369e-05
layer_1_ln_2: 0.00012915043
layer_1_mlp: 0.009372361
layer_1_residual_2: 0.009372275

layer_2_ln_1: 0.0826371
layer_2_qkv: 0.20212619
layer_2_Q: 0.13946965
layer_2_K: 0.13894105
layer_2_V: 0.045811016
layer_2_KQ: 13241.3545
layer_2_KQ_scaled: 1650.6891
layer_2_KQ_masked: 619.769
layer_2_KQ_soft_max: 12.918358
layer_2_KQV: 44.58142
layer_2_KQV_merged: 44.581425
layer_2_c_proj: 15.569263
layer_2_residual_1: 15.569194
layer_2_ln_2: 93.71642
layer_2_mlp: 15.143775
layer_2_residual_2: 16.234962

layer_3_ln_1: 144.98157
layer_3_qkv: 354.91196
layer_3_Q: 246.87163
layer_3_K: 241.01326
layer_3_V: 83.24449
layer_3_KQ: 21263.725
layer_3_KQ_scaled: 2654.006
layer_3_KQ_masked: 718.5858
layer_3_KQ_soft_max: 20.732357
layer_3_KQV: 57.04609
layer_3_KQV_merged: 57.0461
layer_3_c_proj: 23.855082
layer_3_residual_1: 27.531258
layer_3_ln_2: 157.50766
layer_3_mlp: 29.649569
layer_3_residual_2: 27.839556

layer_4_ln_1: 266.19275
layer_4_qkv: 654.1274
layer_4_Q: 451.26562
layer_4_K: 444.3822
layer_4_V: 163.60663
layer_4_KQ: 24025.02
layer_4_KQ_scaled: 2998.935
layer_4_KQ_masked: 606.8187
layer_4_KQ_soft_max: 22.284218
layer_4_KQV: 68.64635
layer_4_KQV_merged: 68.64636
layer_4_c_proj: 26.950285
layer_4_residual_1: 37.861553
layer_4_ln_2: 197.24821
layer_4_mlp: 29.669317
layer_4_residual_2: 35.353825

layer_5_ln_1: 260.86618
layer_5_qkv: 724.951
layer_5_Q: 501.20572
layer_5_K: 495.97842
layer_5_V: 168.3818
layer_5_KQ: 37735.883
layer_5_KQ_scaled: 4714.9014
layer_5_KQ_masked: 749.07117
layer_5_KQ_soft_max: 23.1553
layer_5_KQV: 80.43096
layer_5_KQV_merged: 80.43098
layer_5_c_proj: 33.758152
layer_5_residual_1: 46.91114
layer_5_ln_2: 206.02765
layer_5_mlp: 30.842085
layer_5_residual_2: 40.61248

layer_6_ln_1: 400.68143
layer_6_qkv: 992.5716
layer_6_Q: 695.662
layer_6_K: 672.2046
layer_6_V: 222.24974
layer_6_KQ: 43147.633
layer_6_KQ_scaled: 5390.2314
layer_6_KQ_masked: 740.18024
layer_6_KQ_soft_max: 24.522976
layer_6_KQV: 88.46552
layer_6_KQV_merged: 88.46553
layer_6_c_proj: 36.376686
layer_6_residual_1: 53.384613
layer_6_ln_2: 214.4531
layer_6_mlp: 35.040794
layer_6_residual_2: 47.263767

layer_7_ln_1: 419.2001
layer_7_qkv: 1155.2109
layer_7_Q: 857.8253
layer_7_K: 748.64624
layer_7_V: 195.39314
layer_7_KQ: 58234.348
layer_7_KQ_scaled: 7276.835
layer_7_KQ_masked: 842.85913
layer_7_KQ_soft_max: 26.28251
layer_7_KQV: 86.114586
layer_7_KQV_merged: 86.114586
layer_7_c_proj: 34.87311
layer_7_residual_1: 58.246216
layer_7_ln_2: 225.61505
layer_7_mlp: 39.339817
layer_7_residual_2: 51.32108

layer_8_ln_1: 387.79102
layer_8_qkv: 1157.3822
layer_8_Q: 824.2098
layer_8_K: 792.67236
layer_8_V: 178.56348
layer_8_KQ: 68017.88
layer_8_KQ_scaled: 8499.824
layer_8_KQ_masked: 873.53784
layer_8_KQ_soft_max: 26.628895
layer_8_KQV: 83.45537
layer_8_KQV_merged: 83.45538
layer_8_c_proj: 34.56205
layer_8_residual_1: 59.74011
layer_8_ln_2: 232.52385
layer_8_mlp: 39.456818
layer_8_residual_2: 54.197147

layer_9_ln_1: 445.04935
layer_9_qkv: 1275.0343
layer_9_Q: 923.7025
layer_9_K: 856.8664
layer_9_V: 195.62573
layer_9_KQ: 85633.33
layer_9_KQ_scaled: 10701.414
layer_9_KQ_masked: 970.1791
layer_9_KQ_soft_max: 24.88993
layer_9_KQV: 66.36115
layer_9_KQV_merged: 66.36114
layer_9_c_proj: 35.458523
layer_9_residual_1: 65.28133
layer_9_ln_2: 257.3057
layer_9_mlp: 40.7061
layer_9_residual_2: 67.813995

layer_10_ln_1: 475.3552
layer_10_qkv: 1357.9565
layer_10_Q: 966.4959
layer_10_K: 923.7882
layer_10_V: 237.80414
layer_10_KQ: 118725.62
layer_10_KQ_scaled: 14838.553
layer_10_KQ_masked: 1209.8259
layer_10_KQ_soft_max: 20.499435
layer_10_KQV: 63.350224
layer_10_KQV_merged: 63.350216
layer_10_c_proj: 35.80544
layer_10_residual_1: 77.95269
layer_10_ln_2: 288.97815
layer_10_mlp: 49.110363
layer_10_residual_2: 85.54273

layer_11_ln_1: 476.0571
layer_11_qkv: 1372.8685
layer_11_Q: 925.3044
layer_11_K: 964.7027
layer_11_V: 312.9385
layer_11_KQ: 140674.61
layer_11_KQ_scaled: 17582.137
layer_11_KQ_masked: 1301.0841
layer_11_KQ_soft_max: 18.5467
layer_11_KQV: 84.49375
layer_11_KQV_merged: 84.49372
layer_11_c_proj: 45.07647
layer_11_residual_1: 100.815384
layer_11_ln_2: 313.82672
layer_11_mlp: 64.74883
layer_11_residual_2: 119.981766

layer_12_ln_1: 301.84424
layer_12_qkv: 1112.1233
layer_12_Q: 655.5733
layer_12_K: 879.5246
layer_12_V: 182.97672
layer_12_KQ: 79453.336
layer_12_KQ_scaled: 9928.423
layer_12_KQ_masked: 610.76184
layer_12_KQ_soft_max: 25.360025
layer_12_KQV: 108.48184
layer_12_KQV_merged: 108.481834
layer_12_c_proj: 170.68916
layer_12_residual_1: 210.46101
layer_12_ln_2: 257.7163
layer_12_mlp: 135.60199
layer_12_residual_2: 209.37378

ln_f: 2128.6814
lm_head: 197846.39

6 replies

lancelotblanchard Jan 7, 2025
Author

Thanks a lot for the suggestion! I checked all of the tensors and they seem fine, only the projection matrices (c_attn, c_proj, c_fc, and c_proj) were transposed (convert-cerebras-to-ggml.py:158-168) since they are used in ggml_mul_mat, so I think that's right. I'll keep exploring in the meantime, but I have a few questions:

Do you have any recommendation to compare the results properly? Until now, I've been naming the tensors after each operation (ggml_set_name(cur, "name") and then retrieving them after computation using memcpy(vector.data(), ggml_get_data(...), sizeof(float)*N), but I think I'm running into some problems after operations like view_2d since the stride differs and the resulting vector doesn't have the correct data. Is there any better built-in way to do that or do I need to manually retrieve the data row per row?
I think this is linked to my previous question, but I cannot properly log the result of the ggml_permute operation since the memory dump looks exactly the same.
I've logged some of the operations in my previous message, and I'm confused about the KQ matrix multiplication happening in the GGML code. While both K and Q are similar, the result is completely different to torch.matmul(Q, K.transpose(-1,-2)). Is there any way I could better log what's happening with this operation?

Thanks again for the support, I've been obsessing over this for the past few months and this is really helpful already.

ggerganov Jan 8, 2025
Maintainer

Is there any better built-in way to do that or do I need to manually retrieve the data row per row?

Unfortunately, it can be very difficult to debug this. Maybe you can try porting the model directly in llama.cpp and using the https://github.com/ggerganov/llama.cpp/tree/master/examples/eval-callback tool to dump the contents of the operations.

I think this is linked to my previous question, but I cannot properly log the result of the ggml_permute operation since the memory dump looks exactly the same.

This operation only changes the shape of the tensor - the data remains the same.

lancelotblanchard Jan 12, 2025
Author

Hi Georgi, thanks for your response!

Just realized my model was using the scale_attn_by_inverse_layer_idx stabilization tweak, which was missing from the GGML code! After implementing it, the distance between the PyTorch logits and the GGML ones dropped from 143,854,600 to 57,808. I'm still wondering why there is such a difference in the outputs, though. After investigating all of the operations, it looks like the hidden states are slowly diverging, and the difference increases a lot when multiplying K and Q together. All the operations should be using float32, so I'm not sure why I get this accumulated difference. Any clues?

ggerganov Jan 12, 2025
Maintainer

Just realized my model was using the scale_attn_by_inverse_layer_idx stabilization tweak, which was missing from the GGML code!

Great!

All the operations should be using float32, so I'm not sure why I get this accumulated difference. Any clues?

Do you expect the results to be bit-for-bit identical? This would be very difficult to achieve because the floating point numbers are multiplied and accumulated in different ways between the two implementations so it is normal to have differences.

Now that the differences have dropped, does the ggml model at least produce sane results? The value 57,808 on its own does not provide much information. You can try to compute NMSE - this should be less than 1e-5 if everything is OK.

lancelotblanchard Jan 15, 2025
Author

Thanks for the tip, NMSE seems perfect for comparison.

I got a NMSE of 2.1156E-5, which I managed to reduce to 8.5944E-10 after removing #define GGML_GELU_FP16 in ggml-cpu.c, so that sounds perfect!

One final thing now, it looks like I get a NMSE of 8.4153E-5 when using the metal backend (I'm trying to run this on Mac). It looks like the GELU is already FP32 in Metal, so I can't change that. Is there any reason why the logits would be so different when using metal?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Logits largely different from PyTorch #1063

{{title}}

Replies: 2 comments 6 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Logits largely different from PyTorch #1063

lancelotblanchard Jan 5, 2025

Replies: 2 comments · 6 replies

ggerganov Jan 6, 2025 Maintainer

lancelotblanchard Jan 6, 2025 Author

lancelotblanchard Jan 7, 2025 Author

ggerganov Jan 8, 2025 Maintainer

lancelotblanchard Jan 12, 2025 Author

ggerganov Jan 12, 2025 Maintainer

lancelotblanchard Jan 15, 2025 Author

lancelotblanchard
Jan 5, 2025

Replies: 2 comments 6 replies

ggerganov
Jan 6, 2025
Maintainer

lancelotblanchard
Jan 6, 2025
Author

lancelotblanchard Jan 7, 2025
Author

ggerganov Jan 8, 2025
Maintainer

lancelotblanchard Jan 12, 2025
Author

ggerganov Jan 12, 2025
Maintainer

lancelotblanchard Jan 15, 2025
Author