Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RuntimeError: EncoderDecoder: VisionMambaSeg: shape '[-1, 14, 14, 192]' is invalid for input of size 37824 #108

Open
VickyHuang1113 opened this issue Jul 26, 2024 · 3 comments

Comments

@VickyHuang1113
Copy link

VickyHuang1113 commented Jul 26, 2024

I tried to fine-tune the segmentation model using the pretrained Vim-T, but encountered the following issue while executing bash scripts/ft_vim_tiny_upernet.sh:

Position interpolate from 14x14 to 32x32   
Traceback (most recent call last):
  File "/home/vic1113/miniconda3/envs/vim_seg/lib/python3.9/site-packages/mmcv/utils/registry.py", line 69, 
in build_from_cfg return obj_cls(**args)
  File "/home/vic1113/PrMamba/seg/backbone/vim.py", line 89, in __init__
    self.init_weights(pretrained)
  File "/home/vic1113/PrMamba/seg/backbone/vim.py", line 143, in init_weights
    interpolate_pos_embed(self, state_dict_model)
  File "/home/vic1113/PrMamba/vim/utils.py", line 258, in interpolate_pos_embed
    pos_tokens = pos_tokens.reshape(-1, orig_size, orig_size, embedding_size).permute(0, 3, 1, 2)
RuntimeError: shape '[-1, 14, 14, 192]' is invalid for input of size 37824

This error is propagated through multiple functions, resulting in the final error:
RuntimeError: EncoderDecoder: VisionMambaSeg: shape '[-1, 14, 14, 192]' is invalid for input of size 37824.

The pretrained weight I used was vim_t_midclstok_76p1acc.pth, which seems to be the correct one. If not, there should be an error while loading, such as size mismatch for norm_f.weight: copying a param with shape torch.Size([192]) from checkpoint, the shape in current model is torch.Size([384]), but I didn't get this error.

So, I guess there might be an issue with the model settings, but I’m not sure. 37824 = (14*14 + 1) * 192, and the "+1" is the part that leads to the error. If the "+1" part is for mid cls token, should I just drop it for the segmentation model?

Have anyone ever encountered this problem, or successfully finetuned a segmentation model?

Thank you very much!

@GIT-HYQ
Copy link

GIT-HYQ commented Jul 30, 2024

same issue,have you fix it now?

@VickyHuang1113
Copy link
Author

No, I can't apply the pretrained weights to the segmentation model.
It seems the shapes of the backbones are different, and we might need to retrain it.

@TiSgrc2002
Copy link

I tried to fine-tune the segmentation model using the pretrained Vim-T, but encountered the following issue while executing bash scripts/ft_vim_tiny_upernet.sh:

Position interpolate from 14x14 to 32x32   
Traceback (most recent call last):
  File "/home/vic1113/miniconda3/envs/vim_seg/lib/python3.9/site-packages/mmcv/utils/registry.py", line 69, 
in build_from_cfg return obj_cls(**args)
  File "/home/vic1113/PrMamba/seg/backbone/vim.py", line 89, in __init__
    self.init_weights(pretrained)
  File "/home/vic1113/PrMamba/seg/backbone/vim.py", line 143, in init_weights
    interpolate_pos_embed(self, state_dict_model)
  File "/home/vic1113/PrMamba/vim/utils.py", line 258, in interpolate_pos_embed
    pos_tokens = pos_tokens.reshape(-1, orig_size, orig_size, embedding_size).permute(0, 3, 1, 2)
RuntimeError: shape '[-1, 14, 14, 192]' is invalid for input of size 37824

This error is propagated through multiple functions, resulting in the final error: RuntimeError: EncoderDecoder: VisionMambaSeg: shape '[-1, 14, 14, 192]' is invalid for input of size 37824.

The pretrained weight I used was vim_t_midclstok_76p1acc.pth, which seems to be the correct one. If not, there should be an error while loading, such as size mismatch for norm_f.weight: copying a param with shape torch.Size([192]) from checkpoint, the shape in current model is torch.Size([384]), but I didn't get this error.

So, I guess there might be an issue with the model settings, but I’m not sure. 37824 = (14*14 + 1) * 192, and the "+1" is the part that leads to the error. If the "+1" part is for mid cls token, should I just drop it for the segmentation model?

Have anyone ever encountered this problem, or successfully finetuned a segmentation model?

Thank you very much!

Is VisionMambaSeg a pre trained model? I used Vim small+(26M 81.6 95.4) when loading the pre trained model https://huggingface.co/hustvl/Vim-small-midclstok However, there are many mismatches, such as the absence of the checkpoint ['meta '] setting in mmcv in the downloaded model

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants