How to do autoregressive decoding in JAX/Flax? #920

marcvanzee · 2021-01-18T13:38:48Z

marcvanzee
Jan 18, 2021
Maintainer

Jan 21, 2021

Autoregressive decoding is implemented in attention.py. It is used in the WMT example

Some more explanation from @levskaya, who wrote most of the code:

You only need caching for the self-attention layers in a decoder. (not for an encoder, and not for the encoder-decoder layers found in a full enc-dec transformer model)

let's say you have a decoder-stack (i.e. a language model) - if you run a sampler (beam, top-k, etc.) iteratively on the decoder at inference time you pass the first token through, embed it, run the layers, get logits for the next position, then sample that to choose a next-token. Then you stick that token back into the decoder-input, and run it again to get the next token,…

View full answer

marcvanzee · 2021-01-21T14:13:35Z

marcvanzee
Jan 21, 2021
Maintainer Author

Autoregressive decoding is implemented in attention.py. It is used in the WMT example

Some more explanation from @levskaya, who wrote most of the code:

You only need caching for the self-attention layers in a decoder. (not for an encoder, and not for the encoder-decoder layers found in a full enc-dec transformer model)

let's say you have a decoder-stack (i.e. a language model) - if you run a sampler (beam, top-k, etc.) iteratively on the decoder at inference time you pass the first token through, embed it, run the layers, get logits for the next position, then sample that to choose a next-token. Then you stick that token back into the decoder-input, and run it again to get the next token, etc.

If you just do this naively, you are re-calculating the past keys and values and past attn interactions for all the past tokens again and again and again, leading to a O(L^2) algorithm that is very very wasteful for longer sequences.

So instead of doing it naively, and feeding entire length-L arrays through each layer, we only feed a length 1 array through, and we store the previously calculated keys and values at each self-attention layer in stateful "cache" variables.

For future tokens, the "query" at that new timepoint can then attend to the past cached keys, values, and you avoid the L^2 blowup

In practice, at each layer you also store an integer 'index' encoding what position you're currently at.

You also create a similar cached index for the absolute position encoding layer to keep track of the current position in the decoder.

So the 'cache' is just a set of 0-arrays for the keys and values at each decoder self-attention layer. The length dimension in these is initialized to be equal to the longest-possible sequence you plan to generate.

NB: the cached attention implementation in the linen transformer layers was written to be as simple as possible for readability, but we'll probably update it soon to use a slightly more complicated layout and "one-hot scatter" pattern that performs much faster on TPUs.

0 replies

thisiscam · 2021-05-15T04:43:49Z

thisiscam
May 15, 2021

Hi,

This is a question from the jax discussion: jax-ml/jax#6242, but I feel like it's also applicable to this thread.

In the decoding in the lm1b example, each new token attends to all the positions up to max decode length, where the future tokens are masked out using a boolean mask.
This induces overhead since in pytorch/TF implementations (without XLA), a new token only attends to previous tokens.

I'm guessing that the self attention in flax is is more optimized for TPU, but what would be a good way to handle this on CPU/GPU?

4 replies

levskaya May 15, 2021
Maintainer

It doesn't have much to do with GPU vs TPU, rather it's the difference between a whole-program-optimized compiled compute kernel with static memory layout and constraints (XLA) vs a dynamically dispatched system w. more dynamic memory patterns (e.g. pytorch).

In practice the extra cost of having a static, large attention matrix where we attend to future masked-out tokens is typically more than offset by not having to mess around with dynamic memory management during the computation - this is usually true on GPUs as well as TPUs.

thisiscam May 15, 2021

In practice the extra cost of having a static, large attention matrix where we attend to future masked-out tokens is typically more than offset by not having to mess around with dynamic memory management during the computation - this is usually true on GPUs as well as TPUs.

In this case, one can pre-allocate a fixed max size cache, so there's no need for dynamic memory management.
But still, it would be ideal if a new token only attends to previous tokens (by a dynamically increasing index), so that computation cost is lower.
You are probably right that pytorch probably doesn't do this and they just allocate dynamic memory, but this doesn't prevent one to implement this strategy, right?
So my point in this question is more related to the computation cost, rather than memory cost.

jheek May 17, 2021
Maintainer

I once made a Colab where the decoding used a buffer that grew in powers of 2 (starting from some base).
This does indeed increase performance a bit.
There's a tradeoff here though. PyTorch is more flexible but by default does very little optimization (because everything is dynamic). XLA is completly static but it will have lower dispatch overhead which is important for the small ops used in decoding. Also XLA will do extensive op fusion which can usually saves a lot of time.

thisiscam May 17, 2021

This does indeed increase performance a bit.

Do you still remember what that number looks like?
It seems like the longer the max_length the more useful of this optimization.

marcvanzee · 2021-05-17T09:56:01Z

marcvanzee
May 17, 2021
Maintainer Author

But still, it would be ideal if a new token only attends to previous tokens (by a dynamically increasing index), so that computation cost is lower.

Just to shed some more light on what @jheek and @levskaya are pointing out as well: The attention function will be jit compiled for each shape that you input. So if you are planning to implement attention with a dynamically increasing index, then this means the attention function will be jit compiled for all sequence lengths. In terms of computation cost, this means you will spend a lot of time compiling all the attention functions, which is infeasible in practice.

2 replies

thisiscam May 17, 2021

Thanks. The two ideas I have thought of: jax-ml/jax#6242 (comment).
I have implemented a growing buffer implementation by augmenting the wmt example in flax.
I'm planning to some bencmarking soon.

thisiscam May 17, 2021

In terms of computation cost, this means you will spend a lot of time compiling all the attention functions, which is infeasible in practice.

I would separate the concerns here.

If I were to implement decoding in some lower level language such as cuda C, I'd launch kernels of growing sizes, while reusing the same cache memory. This is the most efficient thing I can think of.

I can try to mimic the above by using a growing buffer in XLA. But, since XLA does not allow me to manage buffers natively (it has a value based semantics, so I can't do strided matmuls), it's unclear how close I can get to mimicing that behavior. In particular, I will need to implement a mechanism like "growing cache by copying old cache into a new, larger sized cache". It's unclear how well XLA can optimize those mem-copys out.

Lastly, there's the question of compilation cost. To me, this seems like a pretty obvious drawback of XLA at the moment. One workaround is perhaps try to grow sizes by multiples, which will trigger less re-compilation and alleviate the problem. Another is to somehow disk persist the compiled code -- there are some discussions in the JAX repo for this, so presumably it's likely going to come out in the future.

thisiscam · 2021-05-17T17:01:05Z

thisiscam
May 17, 2021

Here's a notebook for some benchmarking:
https://colab.research.google.com/gist/thisiscam/f3d849ff989ecc504681aeb52b0c13f1/transformer-ar-cache.ipynb.

This is running times for an overly simplified encoder decoder Transformer (3 layers) on CPU:

Run timing...
grow_target_every=8 Compile took 440.4188942909241s
grow_target_every=8: 0.11375412464141846s/iter
grow_target_every=16 Compile took 219.05793380737305s
grow_target_every=16: 0.11449531078338623s/iter
grow_target_every=32 Compile took 115.851811170578s
grow_target_every=32: 0.11850531578063965s/iter
grow_target_every=64 Compile took 66.41503739356995s
grow_target_every=64: 0.13175711154937744s/iter
grow_target_every=128 Compile took 48.2630729675293s
grow_target_every=128: 0.15804930686950683s/iter

Here, the max_decode_length is 128.
I grow cache every grow_target_every tokens (where grow_target_every==128 implies no growing).
The entire predict function is put into a python while loop, chunking max_decode_length by grow_target_every, and the loop is jitted all at once.

Notably, the compile time is indeed inverse to grow_target_every , the sweet spot for execution time appears to be at 8 leading to a 0.11375412464141846s / 0.15804930686950683 = 1.38 speedup.

Remaining works:

TPU and GPU timings
Larger Transformer model (more layers)

1 reply

thisiscam May 18, 2021

Update with some TPU results:

Run timing...
grow_target_every=16 Compile took 409.809298992157s
grow_target_every=16: 0.05354159832000732s/iter
grow_target_every=32 Compile took 223.15370631217957s
grow_target_every=32: 0.05056853294372558s/iter
grow_target_every=64 Compile took 195.3647916316986s
grow_target_every=64: 0.05019561767578125s/iter
grow_target_every=128 Compile took 81.47796416282654s
grow_target_every=128: 0.05377954006195068s/iter

grow_target_every=8 took too long to compile so I dropped it. This is on the standard Transformer model (6 layers).
Surprisingly, seems like on TPU backend there's some magic going on that makes all the numbers almost even...

So perhaps either there's a larger overhead due to dynamic memory access patterns, or the XLA TPU backend did the optimal thing.

levskaya · 2021-05-18T00:15:35Z

levskaya
May 18, 2021
Maintainer

One thing to be aware of, especially for XLA:TPU but also XLA:GPU is that though the high-level IR of XLA (HLO) offers only a static-size API, underneath when XLA lowers to machine code it does quite a few transformations that can internally do smart, dynamic things. For instance, you can write "one-hot" gathers and scatters using dots that look horrendous from a compute/memory perspective that are optimized out into efficient code on TPUs.

I'm not sure XLA is going to do the optimal thing in this particular case... but it might not be as bad as expected. In any case, it's useful to directly measure things on TPU, since you can't directly reason about what's going on as you might w. low level CPU programming, and the performance cliffs live in very different regimes / places on TPUs than CPUs or even GPUs.

1 reply

thisiscam May 18, 2021

Thanks for your suggestion. I'm going to try my benchmark code on a TPU soon.
I tried it with Colab TPU but the compilation times out (even jit initializing the parameters takes 30 mins !)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to do autoregressive decoding in JAX/Flax? #920

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 5 comments 8 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

How to do autoregressive decoding in JAX/Flax? #920

marcvanzee Jan 18, 2021 Maintainer

Replies: 5 comments · 8 replies

marcvanzee Jan 21, 2021 Maintainer Author

levskaya May 15, 2021 Maintainer

jheek May 17, 2021 Maintainer

marcvanzee May 17, 2021 Maintainer Author

levskaya May 18, 2021 Maintainer

marcvanzee
Jan 18, 2021
Maintainer

Replies: 5 comments 8 replies

marcvanzee
Jan 21, 2021
Maintainer Author

levskaya May 15, 2021
Maintainer

jheek May 17, 2021
Maintainer

marcvanzee
May 17, 2021
Maintainer Author

levskaya
May 18, 2021
Maintainer