-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Question about the framwork #5
Comments
Hey! that's correct. Additionally, when training the gate vectors, you need to freeze entire model with adapter included. During inference, once you stack the gate vectors to make into a linear layer, make sure to do normalization to account for the fact that these gates are independent and can have varying norms. In papers, we normalized to mean zero and standard deviation of 1. |
Thank you for your response. There’s one thing left: is there any ablation on training the vectors you mentioned in the paper? You stated that these vectors only need a few iterations (around 100), and then you use them. When I try this in my case, specifically in a ReID vision task, I see that, even with the same hyperparameters used during training, the accuracy increases slowly. How can I determine if my vectors are correctly learning the routing paths? Also, if you don’t mind, could you provide me with your contact information? I’d love to discuss ideas further. Thank you once again! |
Hey, for training gates, there's no concrete objective we used to measure if the gates are trained properly. In our paper, we did 10% of training steps for experts (which is 100 steps) with all hyper parameters same as expert's training. The output of sigmoid gates start from 0.5, so initial loss should be around same value as the end of expert's training and make sure to double check that loss doesn't go much higher during gate training (if so, try lower lrs). As long as it is around same value, training for a fixed number of steps should give us reasonable gates to use post-hoc. |
Thx a lot For explanation Mohammed |
this is a simplifed implmentation in case of Vit i use block in routeing vector training and PHATGOOSE block in case of inference im wondring if im missing sth in my code `class Block(nn.Module):
class PHATGOOSEBlock(nn.Module):
` |
I want to understand the overall framework technically. From my understanding and from reading the paper, you are training vectors of shape [input_hidden, 1]—for example, [384, 1] in the case of ViT Small—for each dataset over a few iterations. Then, at inference time, you stack them into a vector of size [384, N], where N is the number of datasets or domains. In this setup, sparse MoE is used, and self.w_gate is replaced with this vector. If I'm wrong in my understanding, please help me correct it.
The text was updated successfully, but these errors were encountered: