Enable split rotary fusion for batched model #42

masahi · 2023-10-31T00:26:42Z

This is based on the recent work of @Lunderberg that enabled split-rotary fusion for single-sequence prefill. The same technique can be used to enable fusion for both prefill / decode in our batched model. This optimization can improve perf by reducing kernel launches and runtime memory allocations (thus more cache blocks can be allocated).

The latest https://github.com/masahi/tvm/tree/contrib-vllm (in particular, the commit masahi/tvm@a0cce50) is needed to use this feature.

Perf improvement is small but noticeable.

Vicuna 7B fp16

Before

Throughput: 7.04 requests/s, 3366.86 tokens/s

After

Throughput: 7.17 requests/s, 3427.27 tokens/s

llama 2 13B fp16

Before

Throughput: 3.79 requests/s, 1814.50 tokens/s

After

Throughput: 3.84 requests/s, 1834.09 tokens/s

Enable split rotary fusion for batched model

de70a1f

masahi merged commit 2d30b96 into octoml:batch-serving Oct 31, 2023
10 checks passed

Lunderberg pushed a commit to Lunderberg/mlc-llm that referenced this pull request Jan 30, 2024

Print size of model weight when compiling (octoml#42)

3fc2197

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable split rotary fusion for batched model #42

Enable split rotary fusion for batched model #42

masahi commented Oct 31, 2023 •

edited

Loading

Enable split rotary fusion for batched model #42

Enable split rotary fusion for batched model #42

Conversation

masahi commented Oct 31, 2023 • edited Loading

masahi commented Oct 31, 2023 •

edited

Loading