[Draft][Core] Refactor _prepare_model_input_tensors #5972

comaniac · 2024-06-28T21:06:55Z

NOTE: This PR will be rebased after the following PRs are merged: #4628 #5942.
Meanwhile, reviews and comments are welcome.

This PR refactors _prepare_model_input_tensors. Specifically, we introduce ModelRunnerInputBuilder mainly for logic isolation and modularization. Specifically, ModelRunnerInputBuilder manages all processed input data, including token IDs, positions, sequence length, etc, in one place, and isolates the following logic:

The logic of preparing prefill and decod inputs.
The logic of inserting a new sequence group to input data, considering prefix caching, chunked prefill, sliding windows, etc.
The logic of preparing attention inputs.
The logic of preparing LoRA and multi-modal inputs.
The logic of creating on-device tensors for model inputs.

Note that the purpose of this PR is to enable follow-up refactoring and optimizations, so we don't expect an obvious performance improvement at this moment, although the following optimizations may be slightly helpful:

The unique logic/branches for prefill and decode are separated.
Some iterative list appending (e.g., CUDA graph padding, LoRA requests) are replaced with .extend().

With this isolation, we could further have follow-up optimizations:

Refactor AttentionMetadata to only include on-device tensors, and move all related logic from ModelRunnerInputBuilder.
Remove the loop for seq_id in seq_ids in ModelRunnerInputBuilder._add_decode_seq_group() by leveraging tensor processing.
Parallelize the loop for seq_group_metadata in seq_group_metadata_list.
and more.

rkooo567

I remember the goal we want to is to write logics agonistic to prefill/decode (mainly because prefill is a special case of decode). At least that was the direction we wanted last time (and this PR seems to revert that direction). That's also why existing prepare_inputs doesn't distinguish prefill/decode as much as possible. That will enable features such as https://github.com/vllm-project/vllm/pull/6052/files#diff-d3df23c3e3bcfe97ee8507061c6de54f0eff23a8c75d7f5999062c42245290f8

How difficult is it to not distinguish prefill/decode at least in metadata level? Also, cc @zhuohan123

comaniac · 2024-07-02T15:03:05Z

The reason I separated prefill/decode is I observed the following things:

There are lots of branches with "is_prompt" and result in complex logic.
Prepare inputs for prefill is more complex because of chunked prefill and prefix caching. Separating them may help with future multi-step runner (because in-place updating must happen in decode stage for multi-step execution).

Meanwhile, this separation shouldn't affect #6052, which focuses on the forward logic that is orthogonal to prepare_input. And some attention backends (e.g. xformers) cannot be unified in this way anyways.

However, if you feel it's still better to not separate them, I can revert that in this PR. Happy to discuss :)

rkooo567 · 2024-07-02T16:30:00Z

Let me cc @zhuohan123 and @simon-mo for this one. We discussed this before, and I combined prepare_prefill/decode into a single API, and that was the direction they wanted before. It is the second item in this proposal.

https://docs.google.com/document/d/1rg8CoOnrtz1LT-hCK86ZsHuhoTDtqSEGs8KrN4wbITo/edit

I agreed with complex logics. But I think this is actually not fundamental but more of due to tech debt.

comaniac · 2024-07-07T22:14:16Z

Moved to #6164

wip

658a51e

comaniac marked this pull request as draft June 28, 2024 21:07

comaniac added 7 commits June 28, 2024 14:23

fix slot_mapping

406c6b7

isolate moreattention

a6f66ab

merge

2e260bf

a

1db982f

Merge branch 'main' into prepare_input

aed78d2

fix bug

f7e0e48

flash_attn / flashinfer

d15fb1c

rkooo567 reviewed Jul 2, 2024

View reviewed changes

rocm / xformers

592b61b

comaniac closed this Jul 7, 2024

comaniac deleted the prepare_input branch January 3, 2025 21:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Draft][Core] Refactor _prepare_model_input_tensors #5972

[Draft][Core] Refactor _prepare_model_input_tensors #5972

comaniac commented Jun 28, 2024

rkooo567 left a comment

comaniac commented Jul 2, 2024

rkooo567 commented Jul 2, 2024

comaniac commented Jul 7, 2024

[Draft][Core] Refactor _prepare_model_input_tensors #5972

[Draft][Core] Refactor _prepare_model_input_tensors #5972

Conversation

comaniac commented Jun 28, 2024

rkooo567 left a comment

Choose a reason for hiding this comment

comaniac commented Jul 2, 2024

rkooo567 commented Jul 2, 2024

comaniac commented Jul 7, 2024