Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable split rotary fusion for batched model #42

Merged
merged 1 commit into from
Oct 31, 2023

Conversation

masahi
Copy link
Member

@masahi masahi commented Oct 31, 2023

This is based on the recent work of @Lunderberg that enabled split-rotary fusion for single-sequence prefill. The same technique can be used to enable fusion for both prefill / decode in our batched model. This optimization can improve perf by reducing kernel launches and runtime memory allocations (thus more cache blocks can be allocated).

The latest https://github.com/masahi/tvm/tree/contrib-vllm (in particular, the commit masahi/tvm@a0cce50) is needed to use this feature.

Perf improvement is small but noticeable.

Vicuna 7B fp16

Before

Throughput: 7.04 requests/s, 3366.86 tokens/s

After

Throughput: 7.17 requests/s, 3427.27 tokens/s

llama 2 13B fp16

Before

Throughput: 3.79 requests/s, 1814.50 tokens/s

After

Throughput: 3.84 requests/s, 1834.09 tokens/s

@masahi masahi merged commit 2d30b96 into octoml:batch-serving Oct 31, 2023
10 checks passed
Lunderberg pushed a commit to Lunderberg/mlc-llm that referenced this pull request Jan 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant