This repository provides the code for our project, which combines kNN-Memory and ClipCap to improve long-range dependency handling. The project builds on the ClipCap and Memorizing Transformers repositories. This work is conducted as part of the academic curriculum for the Deep Learning 2 course at the University of Amsterdam. You can read our comphrehensive report here.
The project is structured as follows:
├── checkpoints (model checkpoints)
├── demos (demo notebooks)
├── images (images used in the report)
├── logs (training logs)
├── src (source code)
│ ├── dataset (dataset code, including parsers)
│ ├── evaluation (evaluation code for metrics)
│ ├── memorizing_transformers_pytorch (Memorizing Transformers code)
│ ├── models (model code for kNN-Memory and ClipCap)
│ ├── generate_captions.py (generate captions for a dataset)
│ ├── predict.py (predict captions for a video)
│ ├── train.py (train a model)
│ ├── validate.py (validate a model)
│ ├── utils.py (utility functions)
├── environment.yml (conda environment file)
├── requirements.txt (pip requirements file)
├── blogpost.md (report)
├── pyproject.toml (project file)
└── README.md (this file)
The code is written in Python 3.10. Install the required packages using either pip install -r requirements.txt
or by creating a conda environment with the provided environment.yml
file using conda env create -f environment.yml
. Activate the environment using conda activate knn-memory-clipcap
.
Our experiments use the ActivityNet Caption dataset. Use one of the following methods to download the dataset. The first method is recommended, other methods are only provided for full reproducibility.
-
To download the pre-processed video clips, run:
cd src/data/ wget "https://drive.google.com/u/0/uc?id=1fhZc7yM4Xja7rixz7hBLspPsYbaEQBYm&export=download&confirm=t" -O activitynet_ViT-B_32_train_first_2000.pkl wget "https://drive.google.com/u/0/uc?id=1vliDDQxoSdrl5ZaJ-9DZBBEc8cQYwztA&export=download&confirm=t" -O activitynet_ViT-B_32_dev_first_250.pkl wget "https://drive.google.com/u/0/uc?id=1C2qaf3xBXwfr-LDfygnO8GK-12DCuxxn&export=download&confirm=t" -O activitynet_ViT-B_32_validation_first_500.pkl wget "https://drive.google.com/u/0/uc?id=1KHAXlNhp3GoXyh1mmLCr4iuktqez92F8&export=download&confirm=t" -O activitynet_ViT-B_32_dev_all_67.pkl wget "https://drive.google.com/u/0/uc?id=1rsQgeIveEXyFqVicBFMmNaaZa4VO7jWZ&export=download&confirm=t" -O activitynet_ViT-B_32_train_all_540.pkl wget "https://drive.google.com/u/0/uc?id=18MK9omT8qNfuW69KL_WZwP2PSBhMrYdV&export=download&confirm=t" -O activitynet_ViT-B_32_validation_all_133.pkl
Instead of
wget
, you can also download the files manually from here. The files should be placed in thesrc/data/
folder. Additionally, the pre-processed COCO dataset can be found there as well. -
If you want to download the entire ActivityNet Caption dataset from scratch, run:
python3 src/datasets/download_dataset.py
WARNING: this will download the entire dataset, which is about 200 GB in size.
To extract frames from the downloaded videos or your own videos, execute:
python3 src/datasets/extract_frames.py -r <path_to_videos>
This command creates a
frames
folder in the videos' parent directory. By default, frames are extracted at 5 fps. To modify this setting, use the-fps
flag. The script also generates a summary CSV file in theframes
folder, containing the video ID, frame extraction success status, and number of frames extracted.To pre-process the dataset, run:
python3 src/dataset/parsers/parse_activitynet.py --split <split>
Other arguments are available, see
python3 src/dataset/parsers/parse_activitynet.py --help
for more information.
Generating a caption for a video can be done in the demo notebook found in notebooks/demo.ipynb
.
To train a model, run:
python src/train.py --train_path activitynet_ViT-B_32_train_first_2000.pkl --valid_path activitynet_ViT-B_32_dev_first_250.pkl --checkpoint checkpoints/coco/coco_prefix-best.pt --prefix activitynet_with_memory --only_prefix --use_video_dataset --use_memory
Use the --use_memory
flag to enable kNN-Memory, and the --use_video_dataset
flag to use the video dataset. Additionally, the --only_prefix
flag can be used to only train the prefix model. The full argument list is available using python src/train.py --help
.
To evaluate a model, run:
python src/validate.py --data /Users/sebastiaan/Developer/knn-memory-clipcap/src/data/ --checkpoint checkpoints/activitynet_with_memory-best.pt --only_prefix --use_video_dataset --use_memory
The full argument list is available using python src/validate.py --help
.
To generate captions for a dataset, run:
python src/generate_captions.py --data /Users/sebastiaan/Developer/knn-memory-clipcap/src/data/ --checkpoint checkpoints/activitynet_with_memory-best.pt --only_prefix --use_video_dataset --use_memory
This will generate two JSON files that can be used to calculate the evaluation metrics. The full argument list is available using python src/generate_captions.py --help
.
To calculate the evaluation metrics on previously generated captions, run:
python src/evaluation/evaluate_captions.py --submission <captions_file>.json --references <reference_file>.json
Where <captions_file>.json
is the file generated by src/generate_captions.py
and <reference_file>.json
is the file containing the ground truth captions. Our generated captions can be found in the \organized_data
folder. To run the evaluation it is necessary to have installed Java on your device. The full argument list is available using python src/evaluation/evaluate_captions.py --help
.
This project is conducted as part of the academic curriculum for the Deep Learning 2 course at the University of Amsterdam. We would like to thank the course staff for their support and feedback.