[Misc] Add attention sinks #3515

felixzhu555 · 2024-03-19T23:21:18Z

Overview

This PR adds experimental support for attention sinks (#1304), based on this paper and repo. Support is currently limited to RoPE and ALiBi models (e.g. Llama, Mistral/Mixtral, Falcon, Bloom, MPT). The attention sink is hard-coded as the first block of tokens in a sequence.

Usage

Set use_attention_sinks=True when instantiating LLM or LLMEngine, or set the --use-attention-sinks CLI argument. Also set enforce_eager=True (attention sinks currently does not work with CUDA graphs), and ensure the attention backend being used is FlashAttention, XFormers, or FlashInfer (WIP).

Background

Experiments show that the attention mechanism heavily attends to the first few tokens of the sequence being completed, regardless of what the tokens are. Once sequence length exceeds the context length of a model, and we start evicting tokens from the beginning of the KV cache (in a sliding window fashion), the model will generate garbage (high perplexity).

This is where attention sinks come in. By always preserving the KVs for the first few tokens of the sequence while using a sliding window approach for the rest of the KV cache, the model can continue to generate sensible output (low perplexity). Theoretically, the model can stream indefinitely, as long as cache eviction is handled properly. Note the sliding window length is the model's context length.

Example

Suppose our model's context length is 2048, which equals 128 blocks of 16 tokens. Let's pass in a prompt of 2000 tokens. For the next 48 generated tokens, nothing changes; we end up filling 128 blocks so far.

Normally, vLLM forces generation to stop here since the model's context length has been reached. However, using attention sinks we bypass this stopping condition and keep generating.

At the next decode, we are writing the 2049th token to the cache and computing the 2050th token (1-based indexing). Here, we edit the block table to be [block_table[0]] + block_table[2:], where we effectively ignore the 2nd block while retaining the 1st block, which is our attention sink. Notice how the block table is still length 128 because the 129th block was just allocated for token 2049. This modified block table is then used in the attention kernel.

Every 16th decode that follows will ignore an additional block, but always retain the 1st block as the sink.

Modifications

This PR adds a StreamingAttentionSink layer that computes attention using modified block tables with the "sink" block concatenated with the remaining sliding window blocks. In the RoPE case, we always store pre-rope keys into the cache, and extra work must be done at every decode to rotate all keys for a sequence based on their new positions in the cache. Note: due to this extra work, using attention sinks incurs a significant drop in tokens/s for RoPE models (around 50-70% for Llama).

use_attention_sinks is now an argument to LLMEngine, which passes it to the model runner and injects attention sinks into the model's modules. On every forward call of the model's attention layer, normal attention logic is replaced by StreamingAttentionSink logic.

The scheduler evicts (frees) a block (the "ignored" block) whenever a new block is allocated past the model's context length, such that the total number of used blocks is capped at max_model_len // block_size.

Future Work

Other attention backends: ROCMFlashAttention, torch SDPA
Support LoRA: LoRA requests with attention sinks is currently untested.
Integrate with speculative decoding: StreamingAttentionSink assumes only 1 token is generated every decode.
Integrate with prefix caching: StreamingAttentionSink directly edits the block table for every decode (past the context length), so the hash table for prefix caching cannot be used currently.

rkooo567 · 2024-03-22T01:29:03Z

Hi, @felixzhu555 . it is https://arxiv.org/abs/2309.17453 right?

felixzhu555 · 2024-03-22T02:02:25Z

Yep, trying to implement the logic from that paper. Their repo is https://github.com/mit-han-lab/streaming-llm.

jqueguiner · 2024-03-22T04:19:00Z

We need to @rlouf to the PR the guy in charge of outline, it seems that your PR is failing on the guided part.
I'll try to bring him in to help

vllm/model_executor/models/llama.py

…attention_sinks

DarkLight1337 · 2024-06-28T06:41:38Z

To speed up the CI queue for #5905, I've cancelled the distributed tests for the latest CI run in this PR since they won't pass anyway until #5905 has been merged. Please merge main into your branch after that happens so that the CI can pass once again.

hustxiayang · 2024-10-18T21:46:31Z

@felixzhu555 Hi, this is pr still in progress and should I expect it will be merged?

felixzhu555 · 2024-10-20T01:28:08Z

hi @hustxiayang, sorry this PR likely won't get merged, it remains an experimental prototype based on an older version of vLLM. After the ongoing engine refactor is complete, the memory manager in vllm should become more extensible and attention sinks can be supported more easily, at which time we can probably open a new PR.

hustxiayang · 2024-10-24T01:29:34Z

@felixzhu555 thanks a lot for your clarification!

felixzhu555 added 7 commits March 14, 2024 20:38

temp

7914879

wip

5b672d9

wip

b35d7ba

wip

e90cb58

wip

831f18b

change q pos

c8d86e6

evict

0bd7566

felixzhu555 and others added 18 commits March 31, 2024 20:14

edit xformers

f0263a4

wip

15b68ca

wip

9fe1895

wip

595638d

wip

217743d

wip

fd83c78

pull from main

12e0e97

wip

a9b094c

cuda illegal memory access

25e599d

wip

d14b94e

cache current prerope key inside llama instead of xformers

8bb1840

early eos

339305b

fix small bugs

1157cf3

wip

0f0a414

fix prefill

6f01606

wip

740cbdb

starting to work!

15d586a

blockwise speedup

c4a50b4

simon-mo reviewed Apr 25, 2024

View reviewed changes

vllm/model_executor/models/llama.py Outdated Show resolved Hide resolved

wip

455c814

felixzhu555 added 6 commits June 21, 2024 01:40

small

5bf0d5c

chunked prefill wip

ae31b1d

wip

779b2a3

cuda mem error

d527920

chunked prefill working

87bd485

wip

0a1abf8

simon-mo mentioned this pull request Jun 24, 2024

[RFC]: Support sparse KV cache framework #5751

Open

felixzhu555 added 6 commits June 25, 2024 08:23

fix paxos paper

08fd48f

wip

65f5f6d

Merge branch 'main' of https://github.com/vllm-project/vllm into add_…

cb12d5f

…attention_sinks

chunked prefill for alibi

fdc1365

add some docstrings

da75ff6

fix test

fa8a253

felixzhu555 added 4 commits June 29, 2024 01:08

pull main

1763a44

fix after removal of logical block table

ef65724

change pos arange

38bd15f

pull main

b0b8d0b

felixzhu555 marked this pull request as draft August 4, 2024 01:03

felixzhu555 added 8 commits August 4, 2024 01:11

small

7de1a21

small

1ecec38

pull main, breaking changes to be fixed

5f03373

fix updates from pull main

2da86a8

refactor forward: remove rem logic, move torch ops out of loop

71ca701

fix flash_attn.py

bce7902

fix tests

be779fb

pull main

9d97b8d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Misc] Add attention sinks #3515

[Misc] Add attention sinks #3515

felixzhu555 commented Mar 19, 2024 •

edited

Loading

rkooo567 commented Mar 22, 2024

felixzhu555 commented Mar 22, 2024

jqueguiner commented Mar 22, 2024

DarkLight1337 commented Jun 28, 2024

hustxiayang commented Oct 18, 2024

felixzhu555 commented Oct 20, 2024

hustxiayang commented Oct 24, 2024

[Misc] Add attention sinks #3515

Are you sure you want to change the base?

[Misc] Add attention sinks #3515

Conversation

felixzhu555 commented Mar 19, 2024 • edited Loading

Overview

Usage

Background

Example

Modifications

Future Work

rkooo567 commented Mar 22, 2024

felixzhu555 commented Mar 22, 2024

jqueguiner commented Mar 22, 2024

DarkLight1337 commented Jun 28, 2024

hustxiayang commented Oct 18, 2024

felixzhu555 commented Oct 20, 2024

hustxiayang commented Oct 24, 2024

felixzhu555 commented Mar 19, 2024 •

edited

Loading