You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hey guys, I need a clarification about situations, which this method must be applied to.
Am I right, that this method is best to use when our teacher model is much more complex than the student model? In this case we could get comparable accuracy with lower params for the student model.
Or we must use the same student and the teacher model architecture? But in this case I do not understand, why just not to perform fine-tuning using the pre-trained teacher network directly?
The text was updated successfully, but these errors were encountered:
We only experimented with student models that have the same number of parameters as the teacher models. The method should also work for scenarios with different teacher and student architectures.
The baseline right directly uses the pre-trained teacher model for fine-tuning. We showed that using the student model works much better than this baseline.
@ancientmooner Hi small question regrad to ur answer.
If the output feature map size between teacher and student is not the same, how can the feature
map distilled in your method?
Hey guys, I need a clarification about situations, which this method must be applied to.
Am I right, that this method is best to use when our teacher model is much more complex than the student model? In this case we could get comparable accuracy with lower params for the student model.
Or we must use the same student and the teacher model architecture? But in this case I do not understand, why just not to perform fine-tuning using the pre-trained teacher network directly?
The text was updated successfully, but these errors were encountered: