Replies: 9 comments 21 replies
-
I found the |
Beta Was this translation helpful? Give feedback.
-
Those changes on how On TrainerCallbacks, there is one more thing to consider: previously, an epoch callback was fired with argument |
Beta Was this translation helpful? Give feedback.
-
@epwalsh All things go well except logging: I found some strange part in log when I tried to use
I know it was caused by multi-worker and I could set it to true to avoid this warning:
I wonder whether it would affect our data_loader ?
Maybe only one log would be better( or clearly) ? |
Beta Was this translation helpful? Give feedback.
-
To anyone else upgrading and wondering what happened to |
Beta Was this translation helpful? Give feedback.
-
Hey guys. Thanks a lot for this upgrade. It looks extremely promising and way closer to PyTorch. One quick question is related to the vision capabilities of AllenNLP. Are you planning to release a specific guide for it? I'll start reading the code in the meantime :) |
Beta Was this translation helpful? Give feedback.
-
First of all, great stuff! I just started using AllenNLP for my research and I am hopeful of great productivity with this well-designed framework. Quick question, do we need to index our Instances with the DataLoader now instead of the DatasetReader now? I notice the index_with method has been moved from the DataLoader to the DataReader. |
Beta Was this translation helpful? Give feedback.
-
Thanks a lot for the guide. It made porting 1.x --> 2.0 mostly painless. One thing that I found tricky was implementing def apply_token_indexers(self, instance: Instance) -> None:
for text_field in instance["source"].field_list:
text_field.token_indexers = self._token_indexers Is this the correct approach (it appears to work in my setup)? It might be worth documenting this somewhere if so! |
Beta Was this translation helpful? Give feedback.
-
how to use multi gpus? |
Beta Was this translation helpful? Give feedback.
-
Thanks for the great guide!
However I still get this error: |
Beta Was this translation helpful? Give feedback.
-
Upgrade Guide
For the most part, upgrading your projects from AllenNLP 1.x to AllenNLP 2.0 will be seamless. But we've had to make several breaking changes to the data pipeline for performance and to the trainer callbacks API for simplicity and flexibility.
The new
MultiProcessDataLoader
Most notably, we've replaced the
PyTorchDataLoader
with a shiny new high performanceMultiProcessDataLoader
.This
DataLoader
has a very similar API to the oldPyTorchDataLoader
, but is optimized for AllenNLP experiments. It can efficiently scale up with the number of workers and load large datasets lazily faster than ever. It's even able to utilize batch samplers on the fly while loading lazily.Upgrading your configs to work with the new data loader should be very straightforward, and only require minor changes, if any. The main thing to keep in mind is that laziness is now controlled by the
max_instances_in_memory
parameter of the data loader, and so there is no longer alazy
parameter in theDatasetReader
class.By default,
max_instances_in_memory
is setNone
, which means all of the data is loaded in memory up front.But when
max_instances_in_memory
is set to a positive integer, no data will be read until it's actually needed either for vocab creation or training / validating, and then each worker will only load this many instances in memory at a time.For more information see the
MultiProcessDataLoader
API docs.Changes to the
DatasetReader
classImplementing a
DatasetReader
subclass still just requires you to implement the_read()
andtext_to_instance()
methods. But if you want to take full advantage of the newMultiProcessDataLoader
by using multiple workers (i.e. by settingnum_workers > 0
), you'll have to make a couple of small changes to yourDatasetReader
subclasses.Integrating with the
MultiProcessDataLoader
First, it's important to understand how the
MultiProcessDataLoader
uses itsDatasetReader
whennum_workers
is greater than 0. Consider this example below, which is a simplified version of what happens within theallennlp train
command:When
batch_iterator
is created, the data loader spawns 2 workers, each of which have a copy of the dataset reader.Each of those workers then essentially just calls
reader.read()
, pulling from this iterator until it has collectedmax_instances_in_memory
instances, at which point it batches the instances, turns them into tensors, and then sends them through aQueue
back to the main process. The workers repeat this until theirreader.read()
iterator is exhausted.For this to work efficiently when
num_workers
is greater than 1, the dataset readers within each worker need to somehow coordinate with the others to partition the data so that each worker only has to process a subset of the total data. We also have this same problem in distributed training because different training nodes will each have their own data loader which may be reading from the same data source.Sharding data
To solve this, we've made the
DatasetReader
class "aware" of its worker rank and node rank (for distributed training). This info can be accessed via theget_worker_info()
andget_distributed_info()
methods, respectively.You can use this information within the
_read()
method of yourDatasetReader
to manually shard the data across workers, or you can simply utilize theshard_iterable()
helper method. This method takes any type of iterator and returns another iterator over the same objects, with a filter applied that takes into account the distributed and/or multiprocess data loading context so that only a unique shard of the original iterator is returned.For example, suppose the
_read()
method of your 1.xDatasetReader
looked like this:All you have to do to upgrade to 2.0 is wrap
f
inself.shard_iterable()
like this:The idea and implementation of
shard_iterable
is pretty simple. When it's invoked on an iterator, it just checks how many other workers are currently reading data and then just skips over that many objects in the original iterator before including the next one.So suppose you are running a distributed training job on 2 GPUs, and each GPU worker node is loading data with 3 workers. Then
shard_iterable()
will know to skip over5 = 2 * 3 - 1
objects before yielding the next:Now, whether you use
shard_iterable()
or manually implement the sharding logic yourself, you'll need to make sure AllenNLP is aware that your dataset reader is handling the sharding logic. Otherwise you'll get an error if you try to usenum_workers > 0
. So you have to make sure to callsuper().__init__()
withmanual_distributed_sharding
andmanual_multiprocess_sharding
set toTrue
:Keeping
Instance
objects smallIn AllenNLP 2.0, data loaders are also used to create an iterator of
Instance
objects for building a vocabulary. This iterator is created using theiter_instances()
method.With the
MultiProcessDataLoader
,iter_instances()
looks a lot like__iter__()
under the hood in that it spawns workers that each read instances with their own copy of the dataset reader. But instead of sending batched tensors back to the main process through theQueue
, they just send theInstance
objects.For this to work efficiently,
Instance
objects need to be lightweight so that they can be quickly serialized by each worker and deserialized by the main process. For the most part,Instance
fields just contain data, which is good. But there is one notable exception with text fields.Just like in AllenNLP 1.x, text fields need token indexers before they can be indexed. And token indexers can be quite big like the
PretrainedTransformerIndexer
. So obviously we want to avoid serializing and deserializing the token indexers for everyInstance
that we create.We solved this in AllenNLP 2.0 by allowing text fields to be created without initially assigning any token indexers. The token indexers are then assigned later through the
apply_token_indexers()
method, which is only ever called by the main process.So, for example, if your
text_to_instance()
method looked like this before:Then in 2.0, you need to leave out the token indexers in the "source" text field and instead assign them in the
apply_token_indexers()
method:Summary
shard_iterable()
method around the raw data iterator in your_read()
method and then putsuper().__init__(manual_distributed_sharding=True, manual_multiprocess_sharding=True, **kwargs)
in your__init__()
implementation.text_to_instance
method adds anyTextField
objects to the instances it creates, make sure to leave thetoken_indexers
parameter of eachTextField
unspecified and implementapply_token_indexers()
to assign the righttoken_indexers
to eachTextField
.For more information, see the
DatasetReader
API docs.Trainer callbacks
In AllenNLP 1.x, we had
EpochCallback
andBatchCallback
. These have been consolidated into oneTrainerCallback
that receives calls like
on_start()
,on_batch()
,on_epoch()
, andon_end()
. If you have an existing callback,it should be fairly easy to convert it to this new format. The parameters it takes are almost exactly the same. The
exact definition of the
TrainerCallback
class is here:allennlp/allennlp/training/trainer.py
Line 104 in 67fa291
For the same reason, the trainer no longer takes the
batch_callbacks
,epoch_callbacks
, andend_callbacks
parameters. These are all handled by the single
callbacks
parameter now. This change is reflected in theGradientDescentTrainer
, and therefore in the configuration files as well.TensorBoard logging
Speaking of trainer callbacks, the TensorBoard functionality has been moved to a callback:
TensorBoardCallback
. So if the "trainer" part of your config looked like this for 1.x experiments:It will now look like this:
Are we missing anything? Please comment below if you have any questions!
Beta Was this translation helpful? Give feedback.
All reactions