Upgrade guide from 1.x ➡️ 2.0 #4933

epwalsh · 2021-01-26T22:48:14Z

epwalsh
Jan 26, 2021
Maintainer

Upgrade Guide

For the most part, upgrading your projects from AllenNLP 1.x to AllenNLP 2.0 will be seamless. But we've had to make several breaking changes to the data pipeline for performance and to the trainer callbacks API for simplicity and flexibility.

The new `MultiProcessDataLoader`

Most notably, we've replaced the PyTorchDataLoader with a shiny new high performance MultiProcessDataLoader.

This DataLoader has a very similar API to the old PyTorchDataLoader, but is optimized for AllenNLP experiments. It can efficiently scale up with the number of workers and load large datasets lazily faster than ever. It's even able to utilize batch samplers on the fly while loading lazily.

Upgrading your configs to work with the new data loader should be very straightforward, and only require minor changes, if any. The main thing to keep in mind is that laziness is now controlled by the max_instances_in_memory parameter of the data loader, and so there is no longer a lazy parameter in the DatasetReader class.

By default, max_instances_in_memory is set None, which means all of the data is loaded in memory up front.
But when max_instances_in_memory is set to a positive integer, no data will be read until it's actually needed either for vocab creation or training / validating, and then each worker will only load this many instances in memory at a time.

For more information see the MultiProcessDataLoader API docs.

Changes to the `DatasetReader` class

Implementing a DatasetReader subclass still just requires you to implement the _read() and text_to_instance() methods. But if you want to take full advantage of the new MultiProcessDataLoader by using multiple workers (i.e. by setting num_workers > 0), you'll have to make a couple of small changes to your DatasetReader subclasses.

Integrating with the `MultiProcessDataLoader`

First, it's important to understand how the MultiProcessDataLoader uses its DatasetReader when num_workers is greater than 0. Consider this example below, which is a simplified version of what happens within the allennlp train command:

reader = MyDatasetReader()
data_loader = MultiProcessDataLoader(
    reader,
    "data_path",
    batch_size=8,
    max_instances_in_memory=128,
    num_workers=2,
)

batch_iterator = iter(data_loader)
for batch in batch_iterator:
    # ... do forward and backward pass ...

When batch_iterator is created, the data loader spawns 2 workers, each of which have a copy of the dataset reader.
Each of those workers then essentially just calls reader.read(), pulling from this iterator until it has collected max_instances_in_memory instances, at which point it batches the instances, turns them into tensors, and then sends them through a Queue back to the main process. The workers repeat this until their reader.read() iterator is exhausted.

For this to work efficiently when num_workers is greater than 1, the dataset readers within each worker need to somehow coordinate with the others to partition the data so that each worker only has to process a subset of the total data. We also have this same problem in distributed training because different training nodes will each have their own data loader which may be reading from the same data source.

Sharding data

To solve this, we've made the DatasetReader class "aware" of its worker rank and node rank (for distributed training). This info can be accessed via the get_worker_info() and get_distributed_info() methods, respectively.

You can use this information within the _read() method of your DatasetReader to manually shard the data across workers, or you can simply utilize the shard_iterable() helper method. This method takes any type of iterator and returns another iterator over the same objects, with a filter applied that takes into account the distributed and/or multiprocess data loading context so that only a unique shard of the original iterator is returned.

For example, suppose the _read() method of your 1.x DatasetReader looked like this:

# AllenNLP 1.x

class MyDatasetReader(DatasetReader):
    # ... snip ...

    def _read(self, file_path):
        with open(file_path) as f:
            for line in f:
                yield self.text_to_instance(line)

All you have to do to upgrade to 2.0 is wrap f in self.shard_iterable() like this:

# AllenNLP 2.0

class MyDatasetReader(DatasetReader):
    # ... snip ...

    def _read(self, file_path):
        with open(file_path) as f:
            for line in self.shard_iterable(f):
                yield self.text_to_instance(line)

The idea and implementation of shard_iterable is pretty simple. When it's invoked on an iterator, it just checks how many other workers are currently reading data and then just skips over that many objects in the original iterator before including the next one.

So suppose you are running a distributed training job on 2 GPUs, and each GPU worker node is loading data with 3 workers. Then shard_iterable() will know to skip over 5 = 2 * 3 - 1 objects before yielding the next:

raw items         [0] [1] [2] [3] [4] [5] [6] [7] [8] [9] ... 
-------------------------------------------------------------
GPU 0, worker 0:   ✓                       ✓
GPU 0, worker 1:       ✓                       ✓
GPU 0, worker 3:           ✓                       ✓
GPU 1, worker 0:               ✓                       ✓
GPU 1, worker 1:                   ✓
GPU 1, worker 3:                       ✓

Now, whether you use shard_iterable() or manually implement the sharding logic yourself, you'll need to make sure AllenNLP is aware that your dataset reader is handling the sharding logic. Otherwise you'll get an error if you try to use num_workers > 0. So you have to make sure to call super().__init__() with manual_distributed_sharding and manual_multiprocess_sharding set to True:

# AllenNLP 2.0

class MyDatasetReader(DatasetReader):
    def __init__(self, **kwargs):
        super().__init__(
            manual_distributed_sharding=True,
            manual_multiprocess_sharding=True,
            **kwargs,
        )

    # ... snip ...

Keeping `Instance` objects small

In AllenNLP 2.0, data loaders are also used to create an iterator of Instance objects for building a vocabulary. This iterator is created using the iter_instances() method.

With the MultiProcessDataLoader, iter_instances() looks a lot like __iter__() under the hood in that it spawns workers that each read instances with their own copy of the dataset reader. But instead of sending batched tensors back to the main process through the Queue, they just send the Instance objects.

For this to work efficiently, Instance objects need to be lightweight so that they can be quickly serialized by each worker and deserialized by the main process. For the most part, Instance fields just contain data, which is good. But there is one notable exception with text fields.

Just like in AllenNLP 1.x, text fields need token indexers before they can be indexed. And token indexers can be quite big like the PretrainedTransformerIndexer. So obviously we want to avoid serializing and deserializing the token indexers for every Instance that we create.

We solved this in AllenNLP 2.0 by allowing text fields to be created without initially assigning any token indexers. The token indexers are then assigned later through the apply_token_indexers() method, which is only ever called by the main process.

So, for example, if your text_to_instance() method looked like this before:

# AllenNLP 1.x

class MyDatasetReader(DatasetReader):
    # ... snip ...

    def text_to_instance(self, source: str) -> Instance:
        tokens = self.tokenizer.tokenize(source)
        return Instance({"source": TextField(tokens, self.token_indexers)})

Then in 2.0, you need to leave out the token indexers in the "source" text field and instead assign them in the apply_token_indexers() method:

# AllenNLP 2.0

class MyDatasetReader(DatasetReader):
    # ... snip ...

    def text_to_instance(self, source: str) -> Instance:
        tokens = self.tokenizer.tokenize(source)
        return Instance({"source": TextField(tokens)})

    def apply_token_indexers(self, instance: Instance) -> None:
        instance.fields["source"].token_indexers = self.token_indexers

Summary

If possible, utilize the shard_iterable() method around the raw data iterator in your _read() method and then put super().__init__(manual_distributed_sharding=True, manual_multiprocess_sharding=True, **kwargs) in your __init__() implementation.
If your text_to_instance method adds any TextField objects to the instances it creates, make sure to leave the token_indexers parameter of each TextField unspecified and implement apply_token_indexers() to assign the right token_indexers to each TextField.

For more information, see the DatasetReader API docs.

Trainer callbacks

In AllenNLP 1.x, we had EpochCallback and BatchCallback. These have been consolidated into one TrainerCallback
that receives calls like on_start(), on_batch(), on_epoch(), and on_end(). If you have an existing callback,
it should be fairly easy to convert it to this new format. The parameters it takes are almost exactly the same. The
exact definition of the TrainerCallback class is here:

allennlp/allennlp/training/trainer.py

Line 104 in 67fa291

class TrainerCallback(Registrable):

For the same reason, the trainer no longer takes the batch_callbacks, epoch_callbacks, and end_callbacks
parameters. These are all handled by the single callbacks parameter now. This change is reflected in the
GradientDescentTrainer, and therefore in the configuration files as well.

TensorBoard logging

Speaking of trainer callbacks, the TensorBoard functionality has been moved to a callback: TensorBoardCallback. So if the "trainer" part of your config looked like this for 1.x experiments:

# AllenNLP 1.x

"trainer": {
  # ...
  "tensorboard_writer": { ... }
}

It will now look like this:

# AllenNLP 2.0

"trainer": {
  # ...
  "callbacks": [
    {"type": "tensorboard", "tensorboard_writer": { ... }},
  ],
}

Are we missing anything? Please comment below if you have any questions!

wlhgtc · 2021-01-27T08:15:27Z

wlhgtc
Jan 27, 2021

I found the multitask_xx in doc, maybe add some examples would be better?

2 replies

epwalsh Jan 27, 2021
Maintainer Author

Definitely. @dirkgr do we have example we can make available somewhere? Eventually it would be great to have a chapter in the guide on this.

dirkgr Jan 27, 2021
Maintainer

Since multitask is a new capability, it's not mentioned in the upgrade guide here, but it will be in the release notes.

This is from the draft of the release notes:

Multi-task learning

2.0 adds support for multi-task learning throughout the AllenNLP system. In multi-task learning, the model consists of a backbone that is common to all the tasks, and tends to be the larger part of the model, and multiple task-specific heads that use the output of the backbone to make predictions for a specific task. This way, the backbone gets many more training examples than you might have available for a single task, and can thus produce better representations, which makes all tasks benefit. The canonical example for this is BERT, where the backbone is made up of the transformer stack, and then there are multiple model heads that do classification, tagging, masked-token prediction, etc. AllenNLP 2.0 helps you build such models by giving you those abstractions. The MultiTaskDatasetReader can read datasets for multiple tasks at once. The MultiTaskDataloader loads the instances from the reader and makes batches. The trainer feeds these batches to a MultiTaskModel, which consists of a Backbone and multiple Heads. If you want to look at the details of how this works, we have an example config available at https://github.com/allenai/allennlp-models/blob/main/training_config/vision/vilbert_multitask.jsonnet.

Do you think that covers it?

mahnerak · 2021-01-27T23:55:49Z

mahnerak
Jan 27, 2021

Those changes on how Instance serialization works (especially the new way of setting token indexers) are priceless! 🔥

On TrainerCallbacks, there is one more thing to consider: previously, an epoch callback was fired with argument epoch=-1 at the start of the training. This behaviour is deprecated in favour of on_start() callback.

0 replies

wlhgtc · 2021-01-28T10:08:43Z

wlhgtc
Jan 28, 2021

@epwalsh
Thanks for your excellent job, I could speed up my data loading process nearly 10x.

All things go well except logging: I found some strange part in log when I tried to use MultiProcessDataLoader.

got warning from hugging-face like:

The current process just got forked. Disabling parallelism to avoid deadlocks...
To disable this warning, please explicitly set TOKENIZERS_PARALLELISM=(true | false)

I know it was caused by multi-worker and I could set it to true to avoid this warning:

import os
os.environ["TOKENIZERS_PARALLELISM"] = "true"

I wonder whether it would affect our data_loader ?

Seem all workers output logs ?

2021-01-28 18:01:01,952 - INFO - extends.reader - Reading instances from lines in file at: ./data/mwp/train.json
2021-01-28 18:01:01,959 - INFO - extends.reader - Reading instances from lines in file at: ./data/mwp/train.json
2021-01-28 18:01:01,964 - INFO - extends.reader - Reading instances from lines in file at: ./data/mwp/train.json
2021-01-28 18:01:01,970 - INFO - extends.reader - Reading instances from lines in file at: ./data/mwp/train.json
2021-01-28 18:01:01,977 - INFO - extends.reader - Reading instances from lines in file at: ./data/mwp/train.json
2021-01-28 18:01:01,982 - INFO - extends.reader - Reading instances from lines in file at: ./data/mwp/train.json
2021-01-28 18:01:01,989 - INFO - extends.reader - Reading instances from lines in file at: ./data/mwp/train.json
2021-01-28 18:01:01,995 - INFO - extends.reader - Reading instances from lines in file at: ./data/mwp/train.json
2021-01-28 18:01:02,000 - INFO - extends.reader - Reading instances from lines in file at: ./data/mwp/train.json

Maybe only one log would be better( or clearly) ?

2 replies

epwalsh Jan 28, 2021
Maintainer Author

Hmm okay so for your first question: that's totally fine. HF adding some code that detects when the current process has been forked and then forces the forked process to only use a single thread to avoid deadlocks. I'm actually surprised that the warning goes away when you set TOKENIZERS_PARALLELISM to true. I would recommend setting it to false, actually. I've done some benchmarks and I found that the multi-threaded features in the tokenizers library actually slows tokenization down when you're tokenizing one input at a time (as opposed to using their batch tokenize feature).

For your 2nd question, we could probably modify the loggers for worker processes so that they display the worker ID in each log line. Like:

2021-01-28 18:01:01,952 - INFO - worker0 - extends.reader - Reading instances from lines in file at: ./data/mwp/train.json
2021-01-28 18:01:01,959 - INFO - worker1 - extends.reader - Reading instances from lines in file at: ./data/mwp/train.json
2021-01-28 18:01:01,964 - INFO - worker2 - extends.reader - Reading instances from lines in file at: ./data/mwp/train.json
...

I don't think we should disable logging completely in workers because it can be useful for debugging.

epwalsh Jan 28, 2021
Maintainer Author

For 1, also see #4935

lgessler · 2021-01-31T21:34:35Z

lgessler
Jan 31, 2021

To anyone else upgrading and wondering what happened to AllennlpDataset: it's been removed, and you should replace it in your type signatures with Iterator[Instance]

1 reply

epwalsh Feb 1, 2021
Maintainer Author

Yes, thanks for pointing that out @lgessler

aleSuglia · 2021-02-05T11:29:10Z

aleSuglia
Feb 5, 2021

Hey guys. Thanks a lot for this upgrade. It looks extremely promising and way closer to PyTorch. One quick question is related to the vision capabilities of AllenNLP. Are you planning to release a specific guide for it? I'll start reading the code in the meantime :)

5 replies

epwalsh Feb 5, 2021
Maintainer Author

Hey @aleSuglia, we don't have any guides in the works at the moment. Is there anything in particular about the vision stuff you'd like to see a guide on? Like a specific task, or maybe just an overview of all the vision+text tasks we have so far?

aleSuglia Feb 5, 2021

I've noticed from the allennlp-models repository that you have integrated a classical FastRCNN object detector that can be easily integrated into a dataset reader. However, this forces the users to have all the images on disk and the dataset reader will, on the fly, generated features for each of them. In my experience, is very common to use already generated FastRCNN features for these models which are typically stored in HDF5 files or similar (see my previous comment about this in an Huggingface issue). So I was wondering what was AllenNLP position on this and how the new version would help researchers like me in speeding up my development process. Thanks!

epwalsh Feb 5, 2021
Maintainer Author

We do support using pre-generated features, but they need to be stored in LMDB in the format that the dataset reader expects. Actually, when you don't have pre-generated features, the dataset reader will cache them into LMDB as it generates them, so it only needs to generate them once. But if you have ideas on how we can improve this we are all ears!

aleSuglia Feb 5, 2021

Oh I see. I'll definitely look more into it then. So for instance a guide about how to tackle a V+L task and how the internals work would be very useful. Do you have any specific piece of code that you think I should be looking at to start with? Thanks @epwalsh for the help!

dirkgr Feb 5, 2021
Maintainer

I think having a guide is a good idea, but in lieu of that, you can start by looking at https://github.com/allenai/allennlp-models/blob/main/training_config/vision/vilbert_vqa.jsonnet, and follow down the rabbit hole of how the dataset reader works, and then how the model works that does this task. The reader and the model are the key components here.

For VQA, it does all of the image pre-processing in the dataset reader, so that's where most of the complexity lies. The model just receives vectors which the dataset reader made from the images.

idl99 · 2021-02-08T18:54:14Z

idl99
Feb 8, 2021

First of all, great stuff! I just started using AllenNLP for my research and I am hopeful of great productivity with this well-designed framework.

Quick question, do we need to index our Instances with the DataLoader now instead of the DatasetReader now? I notice the index_with method has been moved from the DataLoader to the DataReader.

1 reply

epwalsh Feb 8, 2021
Maintainer Author

@idl99 that is correct. Unless you're not using a DataLoader, in which case you could just call .index_fields() on each Instance produced by the DatasetReader.

JohnGiorgi · 2021-02-14T17:10:00Z

JohnGiorgi
Feb 14, 2021

Thanks a lot for the guide. It made porting 1.x --> 2.0 mostly painless.

One thing that I found tricky was implementing apply_token_indexers for an Instance that contains ListField[TextField]. I eventually came up with the following (for an Instance with a ListField[TextField] at "source"):

def apply_token_indexers(self, instance: Instance) -> None:
    for text_field in instance["source"].field_list:
        text_field.token_indexers = self._token_indexers

Is this the correct approach (it appears to work in my setup)? It might be worth documenting this somewhere if so!

2 replies

epwalsh Feb 15, 2021
Maintainer Author

@JohnGiorgi yes I think that is the best approach! I actually hadn't thought of that case before, so I'm glad you brought that up.

It might be good to have this example in the API docs for DatasetReader or DatasetReader.apply_token_indexers. Feel free to make a PR for that if you'd like.

JohnGiorgi Feb 16, 2021

Will do :)

FlyingWing · 2021-03-19T11:29:59Z

FlyingWing
Mar 19, 2021

how to use multi gpus?

5 replies

wlhgtc Mar 19, 2021

    "distributed": {
        "cuda_devices": [0,1], 
    },
    "trainer": {
       # no cuda param here 
   
   }

FlyingWing Mar 19, 2021

i use version2.0, the type of parameters "distributed" has been changed, now it is a bollean type. and the type of "cuda_device" is int, can not be used as list

wlhgtc Mar 19, 2021

It do works well on 2.0 ,cuda_devices rather than cuda_device!

wlhgtc Mar 19, 2021

i paste the parameters as follows, cuda_devices is not found:

def init(
self,
model: Model,
optimizer: torch.optim.Optimizer,
data_loader: DataLoader,
patience: Optional[int] = None,
validation_metric: Union[str, List[str]] = "-loss",
validation_data_loader: DataLoader = None,
num_epochs: int = 20,
serialization_dir: Optional[str] = None,
checkpointer: Checkpointer = None,
cuda_device: Optional[Union[int, torch.device]] = None,
grad_norm: Optional[float] = None,
grad_clipping: Optional[float] = None,
learning_rate_scheduler: Optional[LearningRateScheduler] = None,
momentum_scheduler: Optional[MomentumScheduler] = None,
moving_average: Optional[MovingAverage] = None,
callbacks: List[TrainerCallback] = None,
distributed: bool = False,
local_rank: int = 0,
world_size: int = 1,
num_gradient_accumulation_steps: int = 1,
use_amp: bool = False,

You could see code in here, this does not change from 1.x to 2.0 and it was provided in v1.0.0( you could see in release note).

FlyingWing Mar 19, 2021

ok, thx

NicolasAG · 2021-04-13T21:33:27Z

NicolasAG
Apr 13, 2021

Thanks for the great guide!
I am using an InterleavingDatasetReader that reads three different data files with three custom Seq2Seq data readers, each creating source and target TextFields.
My Seq2SeqDatasetReader.text_to_instance() function creates fields like so: source_field = TextField(tokenized_source) & target_field = TextField(tokenized_target) and I separated the apply_token_indexers() function:

    @overrides
    def apply_token_indexers(self, instance: Instance) -> None:
        # see: https://github.com/allenai/allennlp/discussions/4933
        instance["source_tokens"].token_indexers = self._source_token_indexers
        if "target_tokens" in instance:
            instance["target_tokens"].token_indexers = self._target_token_indexers

However I still get this error:
ValueError: Found a TextField (source_tokens) with token_indexers already applied, but you're using num_workers > 0 in your data loader.
I'm assuming it's because of the Interleaving dataset reader... but I'm not sure how to fix it because it doesn't make sense for it to have a text_to_instance() or a apply_token_indexer() function...

3 replies

epwalsh Apr 13, 2021
Maintainer Author

Ah shoot, this is a bug! #5122 should fix, can you confirm?

NicolasAG Apr 14, 2021

Yes that works! 🎉 thanks a lot! 👍

epwalsh Apr 14, 2021
Maintainer Author

Great!

Upgrade guide from 1.x ➡️ 2.0 #4933

epwalsh Jan 26, 2021 Maintainer

Upgrade Guide

The new MultiProcessDataLoader

Changes to the DatasetReader class

Integrating with the MultiProcessDataLoader

Sharding data

Keeping Instance objects small

Summary

Trainer callbacks

TensorBoard logging

Replies: 9 comments · 21 replies

epwalsh Jan 27, 2021 Maintainer Author

dirkgr Jan 27, 2021 Maintainer

Multi-task learning

epwalsh Jan 28, 2021 Maintainer Author

epwalsh Jan 28, 2021 Maintainer Author

epwalsh Feb 1, 2021 Maintainer Author

epwalsh Feb 5, 2021 Maintainer Author

epwalsh Feb 5, 2021 Maintainer Author

dirkgr Feb 5, 2021 Maintainer

epwalsh Feb 8, 2021 Maintainer Author

epwalsh Feb 15, 2021 Maintainer Author

epwalsh Apr 13, 2021 Maintainer Author

epwalsh Apr 14, 2021 Maintainer Author

epwalsh
Jan 26, 2021
Maintainer

The new `MultiProcessDataLoader`

Changes to the `DatasetReader` class

Integrating with the `MultiProcessDataLoader`

Keeping `Instance` objects small

Replies: 9 comments 21 replies

epwalsh Jan 27, 2021
Maintainer Author

dirkgr Jan 27, 2021
Maintainer

epwalsh Jan 28, 2021
Maintainer Author

epwalsh Jan 28, 2021
Maintainer Author

epwalsh Feb 1, 2021
Maintainer Author

epwalsh Feb 5, 2021
Maintainer Author

epwalsh Feb 5, 2021
Maintainer Author

dirkgr Feb 5, 2021
Maintainer

epwalsh Feb 8, 2021
Maintainer Author

epwalsh Feb 15, 2021
Maintainer Author

epwalsh Apr 13, 2021
Maintainer Author

epwalsh Apr 14, 2021
Maintainer Author