You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am Zhiqiu Lin, a final-year PhD student at Carnegie Mellon University working with Prof. Deva Ramanan. I came across your work and was impressed by its great performance! Congratulations!
I also wanted to share NaturalBench (NeurIPS'24 D&B) if you are looking for more reliable vision-language benchmarks:
NaturalBench (https://linzhiqiu.github.io/papers/naturalbench/) is a vision-centric VQA benchmark that challenges vision-language models with pairs of simple questions about natural imagery. Unlike prior VQA benchmarks (like MME and ScienceQA), which blind language models (e.g., GPT-3.5) can solve, NaturalBench ensures such shortcuts won’t work. We evaluated 53 state-of-the-art models, and even top models like GPT-4o and Qwen2-VL fall 50%-70% short of human accuracy (90%+), revealing significant room for improvement.
We also found that current models show strong answer biases, such as favoring “Yes” over “No” regardless of the input. Correcting these biases can boost performance by 2-3x, even for GPT-4o, making NaturalBench a valuable testbed for future debiasing techniques.
Hey,
I am Zhiqiu Lin, a final-year PhD student at Carnegie Mellon University working with Prof. Deva Ramanan. I came across your work and was impressed by its great performance! Congratulations!
I also wanted to share NaturalBench (NeurIPS'24 D&B) if you are looking for more reliable vision-language benchmarks:
NaturalBench (https://linzhiqiu.github.io/papers/naturalbench/) is a vision-centric VQA benchmark that challenges vision-language models with pairs of simple questions about natural imagery. Unlike prior VQA benchmarks (like MME and ScienceQA), which blind language models (e.g., GPT-3.5) can solve, NaturalBench ensures such shortcuts won’t work. We evaluated 53 state-of-the-art models, and even top models like GPT-4o and Qwen2-VL fall 50%-70% short of human accuracy (90%+), revealing significant room for improvement.
We also found that current models show strong answer biases, such as favoring “Yes” over “No” regardless of the input. Correcting these biases can boost performance by 2-3x, even for GPT-4o, making NaturalBench a valuable testbed for future debiasing techniques.
Check out my Twitter post about it here: https://x.com/ZhiqiuLin/status/1848454555341885808.
🚀 Start using NaturalBench: https://github.com/Baiqi-Li/NaturalBench
Best,
Zhiqiu
The text was updated successfully, but these errors were encountered: