Questions about CLIP-ViT-L/14@336px #32

Wgkai · 2024-11-13T09:27:47Z

Thank you for your amazing work.I searched online and found that the CLIP-ViT-L/14@336px model divides an image into 14*14=196 patches, and the embedding dimension is 768. In your work the shape of features after CLIP visual encoder is (576,1024). How does it come?

jzhang38 · 2024-11-20T20:25:35Z

It is (336/14) ** 2 = 576 patches.
The number 14 refers to the patch size, not the number of patches for each dimension.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Questions about CLIP-ViT-L/14@336px #32

Questions about CLIP-ViT-L/14@336px #32

Wgkai commented Nov 13, 2024

jzhang38 commented Nov 20, 2024 •

edited

Loading

Questions about CLIP-ViT-L/14@336px #32

Questions about CLIP-ViT-L/14@336px #32

Comments

Wgkai commented Nov 13, 2024

jzhang38 commented Nov 20, 2024 • edited Loading

jzhang38 commented Nov 20, 2024 •

edited

Loading