Native utilities for datasets / TensorBoard? #1171

bhchiang · 2021-03-22T06:49:02Z

bhchiang
Mar 22, 2021

Are there any native utilities for loading datasets / TensorBoard?

Processing data:

Torch/TF have their own tools, but for Torch Dataset and DataLoader it feels impure to be converting back and forth between jnp.ndarray and torch.Tensor all the time.
tfds.core.DataBuilder seems to involve much more steps in creating custom datasets, also issues in converting from TF types back to JAX types.

TensorBoard:

I've seen jaxboard, flax.metrics.tensorboard, tensorboardX - all require converting to some other datatype first. Performance-wise I don't think there's a huge difference.

Would love to help out with porting over some of these utilities to JAX in a functional manner.

Answered by jheek

Mar 23, 2021

There are some issues with a "JAX native" data loading pipeline. At its core tf.data is like a scheduler with buffers and tasks that run in parallel (map is not vectorizing like jax.vmap but instead parallelising over CPU threads).
Secondly, JAX doesn't support dynamic shapes and it isn't trivial to handle things like JPEG, audio, video formats etc.
TF has ops that support all these things natively.

PyTorch has the same issue. It provides a thin wrapper around multiprocessing which is just another library for scheduling tasks into a pool of threads/processes but PyTorch itself doesn't know how to parse a JPEG. The big difference is that TF embeds preprocessing into the TF graph so it's mo…

View full answer

bhchiang · 2021-03-23T07:16:39Z

bhchiang
Mar 23, 2021
Author

Conclusion: costs of rolling own data and logging solution > benefits; the most standard solution would probably be to use tools from the Google ecosystem (flax.metrics.tensorboard, tf.data) as evidenced by the Flax examples.

It is a bit unfortunate - for instance, mapping tf.data.Dataset (https://www.tensorflow.org/api_docs/python/tf/data/Dataset#map) seems to be a natural use case for dropping in JAX transformations instead of resorting to TF operations.

Sounds familiar!

EDIT: See @jheek's answer.

0 replies

jheek · 2021-03-23T08:30:56Z

jheek
Mar 23, 2021
Maintainer

There are some issues with a "JAX native" data loading pipeline. At its core tf.data is like a scheduler with buffers and tasks that run in parallel (map is not vectorizing like jax.vmap but instead parallelising over CPU threads).
Secondly, JAX doesn't support dynamic shapes and it isn't trivial to handle things like JPEG, audio, video formats etc.
TF has ops that support all these things natively.

PyTorch has the same issue. It provides a thin wrapper around multiprocessing which is just another library for scheduling tasks into a pool of threads/processes but PyTorch itself doesn't know how to parse a JPEG. The big difference is that TF embeds preprocessing into the TF graph so it's more seamless but less modular compared to PyTorch. You can also use PyTorch data loaders with JAX. In the end all these pipelines produce NumPy buffers which can be used with JAX without any overhead. PyTorch data loaders could also uses JAX similair to how you can use PyTorch. For example after decoding and cropping an image you could call some JAX op to do some further preprocessing on the image. You could use things like jax.jit(preprocess, backend="cpu") to do this on the CPU instead of the accelerator.

For Tensorboard again there isn't much JAX can do here. Tensorboard just writes numpy buffers to a file in a ProtoBuf encoding. JAX doesn't have ops to do IO or to encode ProtoBufs.

I don't see this as a big issue though. IO and encoding/decoding are very different from the computational ops JAX support. Mixing them together makes it much harder to reason about JAX and would move away from the modular and functional approach that it uses now.

1 reply

bhchiang Mar 23, 2021
Author

Thanks for taking the time to write this up - helped clear up a lot of misconceptions for me; I'd think this would be a great addition to the docs.

I've been using PyTorch loaders for my first few Flax projects, and I was a bit confused as to when to create JAX arrays (which would be on the accelerator) in order to balance memory / training speed.

It seems that eager prefetching would be the move according to the ImageNet example (https://github.com/google/flax/blob/master/examples/imagenet/train.py#L187).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Native utilities for datasets / TensorBoard? #1171

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 1 reply

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Native utilities for datasets / TensorBoard? #1171

bhchiang Mar 22, 2021

Replies: 2 comments · 1 reply

bhchiang Mar 23, 2021 Author

jheek Mar 23, 2021 Maintainer

bhchiang Mar 23, 2021 Author

bhchiang
Mar 22, 2021

Replies: 2 comments 1 reply

bhchiang
Mar 23, 2021
Author

jheek
Mar 23, 2021
Maintainer

bhchiang Mar 23, 2021
Author