New language + vision tasks! #5292

jacob-morrison · 2021-06-29T23:31:43Z

jacob-morrison
Jun 29, 2021
Collaborator

Hi everyone!

We’ve added three new language + vision tasks to AllenNLP:

Visual Genome QA	NLVR^2	Flickr30k Image Retrieval
Dataset reader	Dataset reader	Dataset reader
Model	Model	Model
Training config	Training config	Training config

These join our previously added implementations for SNLI-VE, VQA and GQA.

Some notes about these models:

Visual Genome QA
- VGQA uses the same model as VQA and GQA. It can be thought of as a multiple choice over the entire vocabulary. As with the other tasks, VGQA’s instances have these fields:
  - Image features
  - Question
  - Answer
NLVR^2
- This task is very similar to SNLI-VE. The main difference is that there are two images instead of one. Our model concatenates the VilBERT outputs for each of the images paired with the hypothesis, then feeds them through a two layer multilayer perceptron. Each instance consists of four fields:
  - First image’s features
  - Second image’s features
  - Hypothesis
  - Label
Image Retrieval
- Training and validation instances look different. Our training instances are a 4-way multiple choice:
  - Correct image, correct caption
  - Correct image, random wrong caption
  - Random wrong image, correct caption
  - Hard negative image, correct caption
- Our validation instances consist of all 1,000 validation images paired with the same caption. During validation, we score each image with the given caption and sort them based on their score, and see if the correct image is in the top k scores.
- Batch size is important. Our model struggled to learn with an effective batch size smaller than 128.
- Calculating hard negatives is expensive. We calculate ours while creating instances in the dataset reader, and it can be quite slow. Caching hard negatives is an easy way to save a lot of time, and we provide a cache for the Flickr30k dataset.

These are the scores this implementation achieves out of the box::

Task	Score (accuracy)
VGQA	26.5%
NLVR2	50.8%
Flickr30k IR	11.8%
SNLI-VE	69.1%
VQA
GQA

These scores are quite a bit below the state of the art. We believe this is due to our strategy for extracting image features. We’re extracting features using a Faster R-CNN model with a ResNet-50-FPN backbone pre-trained on COCO train2017. Conveniently, this ships with torchvision under the name fasterrcnn_resnet50_fpn, so you probably have it installed already. Unfortunately, these features are not quite good enough to achieve state-of-the-art scores on these datasets. We invite you to improve on this, to see if different features (like these) can help these models achieve or exceed the scores from the VilBERT multitask training paper.

epwalsh · 2021-06-29T23:36:24Z

epwalsh
Jun 29, 2021
Maintainer

Great work @jacob-morrison!

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New language + vision tasks! #5292

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment

{{title}}

Select a reply

New language + vision tasks! #5292

jacob-morrison Jun 29, 2021 Collaborator

Replies: 1 comment

epwalsh Jun 29, 2021 Maintainer

jacob-morrison
Jun 29, 2021
Collaborator

epwalsh
Jun 29, 2021
Maintainer