-
Notifications
You must be signed in to change notification settings - Fork 49
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implementation question #5
Comments
I haven't tried different values, but yes I took |
https://gist.github.com/redknightlois/c4023d393eb8f92bb44b2ab582d7ec20#gistcomment-3010232 This comment on |
@Tony-Y 's link shows a comment from the original author that they use identity function instead of clipping. Thanks Tony! |
DeepSpeed trains Bert with LAMB and clips to [0.08,0.5]: https://github.com/microsoft/DeepSpeed/blob/master/docs/_tutorials/bert-pretraining.md#reproducing-bert-training-results-with-deepspeed It's quite interesting and confusing that such different values are used in different implementations. |
Author open sourced theirs! |
I noticed that in your implementation you've clamped the values of the
weight_norm
to min of0
and max of10
. I have seen this10
before in other implementations and noticed that this comes from the first version of the lamb paper. However, this number refers to thetrust_ratio
and not theweight_norm
. Have you done any further experiments with this or were you looking at other implementations of the paper and decided to use10
for that reason. I also implemented lamb with both v1 and the latest version and I didn't notice a difference. Just wanted to know if you did additional testing or were aware of this issue.The text was updated successfully, but these errors were encountered: