Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Draft][Core] Refactor _prepare_model_input_tensors #5972

Closed
wants to merge 9 commits into from

Conversation

comaniac
Copy link
Collaborator

NOTE: This PR will be rebased after the following PRs are merged: #4628 #5942.
Meanwhile, reviews and comments are welcome.

This PR refactors _prepare_model_input_tensors. Specifically, we introduce ModelRunnerInputBuilder mainly for logic isolation and modularization. Specifically, ModelRunnerInputBuilder manages all processed input data, including token IDs, positions, sequence length, etc, in one place, and isolates the following logic:

  1. The logic of preparing prefill and decod inputs.
  2. The logic of inserting a new sequence group to input data, considering prefix caching, chunked prefill, sliding windows, etc.
  3. The logic of preparing attention inputs.
  4. The logic of preparing LoRA and multi-modal inputs.
  5. The logic of creating on-device tensors for model inputs.

Note that the purpose of this PR is to enable follow-up refactoring and optimizations, so we don't expect an obvious performance improvement at this moment, although the following optimizations may be slightly helpful:

  • The unique logic/branches for prefill and decode are separated.
  • Some iterative list appending (e.g., CUDA graph padding, LoRA requests) are replaced with .extend().

With this isolation, we could further have follow-up optimizations:

  1. Refactor AttentionMetadata to only include on-device tensors, and move all related logic from ModelRunnerInputBuilder.
  2. Remove the loop for seq_id in seq_ids in ModelRunnerInputBuilder._add_decode_seq_group() by leveraging tensor processing.
  3. Parallelize the loop for seq_group_metadata in seq_group_metadata_list.
  4. and more.

@comaniac comaniac marked this pull request as draft June 28, 2024 21:07
Copy link
Collaborator

@rkooo567 rkooo567 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I remember the goal we want to is to write logics agonistic to prefill/decode (mainly because prefill is a special case of decode). At least that was the direction we wanted last time (and this PR seems to revert that direction). That's also why existing prepare_inputs doesn't distinguish prefill/decode as much as possible. That will enable features such as https://github.com/vllm-project/vllm/pull/6052/files#diff-d3df23c3e3bcfe97ee8507061c6de54f0eff23a8c75d7f5999062c42245290f8

How difficult is it to not distinguish prefill/decode at least in metadata level? Also, cc @zhuohan123

@comaniac
Copy link
Collaborator Author

comaniac commented Jul 2, 2024

The reason I separated prefill/decode is I observed the following things:

  1. There are lots of branches with "is_prompt" and result in complex logic.
  2. Prepare inputs for prefill is more complex because of chunked prefill and prefix caching. Separating them may help with future multi-step runner (because in-place updating must happen in decode stage for multi-step execution).

Meanwhile, this separation shouldn't affect #6052, which focuses on the forward logic that is orthogonal to prepare_input. And some attention backends (e.g. xformers) cannot be unified in this way anyways.

However, if you feel it's still better to not separate them, I can revert that in this PR. Happy to discuss :)

@rkooo567
Copy link
Collaborator

rkooo567 commented Jul 2, 2024

Let me cc @zhuohan123 and @simon-mo for this one. We discussed this before, and I combined prepare_prefill/decode into a single API, and that was the direction they wanted before. It is the second item in this proposal.

https://docs.google.com/document/d/1rg8CoOnrtz1LT-hCK86ZsHuhoTDtqSEGs8KrN4wbITo/edit

I agreed with complex logics. But I think this is actually not fundamental but more of due to tech debt.

@comaniac comaniac closed this Jul 7, 2024
@comaniac
Copy link
Collaborator Author

comaniac commented Jul 7, 2024

Moved to #6164

@comaniac comaniac deleted the prepare_input branch January 3, 2025 21:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants