Multi step scheduling support for encoder-decoder models #12265

jkaniecki · 2025-01-21T13:17:34Z

This PR enables multi step scheduling for encoder - decoder models

This PR removes the awkward line breaks in README_GAUDI.md and uses appropriate markdown formatting instead of RST. Rendered document should look the same.

This PR removes additional `multiprocessing.Process` object created as a workaround for resolving multi-card stall issue.

This PR raises the allowed relative tolerance in GSM8K to 0.06, and moves Llama-70B test to 4xG2 from 2xG2 until memory usage is investigated (success run: vLLM-CI-Pipeline/206)

Supporting PR for HabanaAI/vllm-hpu-extension#10

We were asked on upstream PR to remove our changes from cache_engine.py. This PR does just that, and creates HPUCacheEngine inheriting from CacheEngine, just overriding _allocate_kv_cache method.

…imit prefill batch size (#394) This PR adds following functionality that can be enabled via engine flags: - use_padding_aware_scheduling - vLLM scheduler will now calculate token cost considering padded prefill shape (similar to #109). - max_num_prefill_seqs - padding-aware scheduler will perform an additional check for prefill batch size and will effectively limit prefill batch size at maximum of `max_num_prefill_seqs`. If unset, max prefill batch size will be `max_num_seqs`. Both features are generic and do not require HPU, although they may be specialized for particular vendor's usage. Padding aware scheduling includes padding function selector which selects HPU padding function (considering currently used HPU buckets) if current device is HPU. Otherwise, it will take a product of batch_size x max_seq_len.

…n' (#402)

Repeating missing code

Modify `benchmark_throughput.py` to allow running with FP8 on HPU (KV cache dtype `fp8_inc`) and to use padding-aware scheduling.

This PR removes the usage of custom HPU RotaryEmbedding modules, and adds a forward_hpu method to existing RotaryEmbedding, for reusing multiple derived implementations without the need of adding them to HPU extension. Mark_steps should not be needed within the test, but for whatever reason, if they are not there, PT bridge crashes. To be investigated later on. It does not affect actual model execution in any way I could test/observe.

With this check while running decode_block_bucket_min=128 and bs>128 it will skip buckets smaller than bs. Then during the run buckets that got skipped can be used by vllm and are being warmed-up which is causing perf drop & they are not run as hpu graphs. This change is removing said check.

Currently before each Sampler call we have a CPU sync, which causes a host gap: <img width="226" alt="image" src="https://github.com/user-attachments/assets/4509e69b-0f16-4ac9-812e-a2a9bc43a6ad"> This PR is removing that sync, so the host gap is no longer visible: <img width="133" alt="image" src="https://github.com/user-attachments/assets/66c19e4b-d832-4955-848d-8ae4acd8d264"> NOTE: class `ApplyToppTopkScalar` still has some CPU syncs inside. It means that the biggest gain will be observed in the scenario without `top_p` or `top_k` parameters. I think it is worth to investigate if we can remove the syncs from this function too.

CUDA uses `capture` for warmup runs and `execute_model` for actual runs. During each phase they call `set_active_loras` only once. HPU uses `execute_model` for both warmup and actual runs. Since `execute_model` already takes care of `set_active_loras` internally, the redundant call can be removed. This special handling is redundant and incorrect, as it causes out-of-bound slicing in decode phase reported in #405. This PR removes special handling of `set_active_loras` function call from warmup runs and resolves the issue in #405.

Changes the profile_run batches based on the max sequence length. This avoids padding during prepare_prompt; thus avoiding breaking constraints based on batch_size * seq_len <= max_num_batch_tokens. Current logic for profile_run max_batch_size takes precedence. e.g. - > max_batch_size = 256, max_num_batch_tokens = 2048, block_size = 128, max_seq_len = 1024 with current logic max_seq_len is updated as 8; however in **prepare_prompt** seq_len is padded to 128, thus getting batch_size * seq_len as 256 * 128 > max_num_batch_tokens; thus violating the above mentioned constraint with the updated logic, we calculate max_batch_size as 2, this avoids the padding at **prepare_prompt**, thus keeping the constraints in place. Fixes: #405

Supporting PR for HabanaAI/vllm-hpu-extension#14

Adding calculation of OpenSSF Scorecard. Note: badge (visible at repo main page) will be disabled for now.

Contiguous cache fetching to avoid using costly gather operation. Requires changes in vllm-hpu-extension (HabanaAI/vllm-hpu-extension#17) to work. Introduces redundant calculations in decoding phase. In all tested cases improves performance over the entire run (5-12%). For even better performance cache defragmentation is required. Only compatible with v2-block-manager.

Reverts #424

Move Dynamic MoE implementation to habana_main. It was previously implemented for 1.18, but had to be modified as ops have been moved to [github.com/HabanaAI/vllm-hpu-extension](https://github.com/HabanaAI/vllm-hpu-extension). Works with bf16, uses static (legacy) mode when running with quantization. Related PRs: - #303 - HabanaAI/vllm-hpu-extension#13 --- <details>  <summary><b> PR Checklist (Click to Expand) </b></summary> <p>Thank you for your contribution to vLLM! Before submitting the pull request, please ensure the PR meets the following criteria. This helps vLLM maintain the code quality and improve the efficiency of the review process.</p> <h3>PR Title and Classification</h3> <p>Only specific types of PRs will be reviewed. The PR title is prefixed appropriately to indicate the type of change. Please use one of the following:</p> <ul> <li><code>[Bugfix]</code> for bug fixes.</li> <li><code>[CI/Build]</code> for build or continuous integration improvements.</li> <li><code>[Doc]</code> for documentation fixes and improvements.</li> <li><code>[Model]</code> for adding a new model or improving an existing model. Model name should appear in the title.</li> <li><code>[Frontend]</code> For changes on the vLLM frontend (e.g., OpenAI API server, <code>LLM</code> class, etc.) </li> <li><code>[Kernel]</code> for changes affecting CUDA kernels or other compute kernels.</li> <li><code>[Core]</code> for changes in the core vLLM logic (e.g., <code>LLMEngine</code>, <code>AsyncLLMEngine</code>, <code>Scheduler</code>, etc.)</li> <li><code>[Hardware][Vendor]</code> for hardware-specific changes. Vendor name should appear in the prefix (e.g., <code>[Hardware][AMD]</code>).</li> <li><code>[Misc]</code> for PRs that do not fit the above categories. Please use this sparingly.</li> </ul> <p><strong>Note:</strong> If the PR spans more than one category, please include all relevant prefixes.</p> <h3>Code Quality</h3> <p>The PR need to meet the following code quality standards:</p> <ul> <li>We adhere to <a href="https://google.github.io/styleguide/pyguide.html">Google Python style guide</a> and <a href="https://google.github.io/styleguide/cppguide.html">Google C++ style guide</a>.</li> <li>Pass all linter checks. Please use <a href="https://github.com/vllm-project/vllm/blob/main/format.sh"><code>format.sh</code></a> to format your code.</li> <li>The code need to be well-documented to ensure future contributors can easily understand the code.</li> <li>Include sufficient tests to ensure the project to stay correct and robust. This includes both unit tests and integration tests.</li> <li>Please add documentation to <code>docs/source/</code> if the PR modifies the user-facing behaviors of vLLM. It helps vLLM user understand and utilize the new features or changes.</li> </ul> <h3>Adding or changing kernels</h3> <p>Each custom kernel needs a schema and one or more implementations to be registered with PyTorch.</p> <ul> <li>Make sure custom ops are registered following PyTorch guidelines: <a href="https://pytorch.org/tutorials/advanced/cpp_custom_ops.html#cpp-custom-ops-tutorial">Custom C++ and CUDA Operators</a> and <a href="https://docs.google.com/document/d/1_W62p8WJOQQUzPsJYa7s701JXt0qf2OfLub2sbkHOaU">The Custom Operators Manual</a></li> <li>Custom operations that return <code>Tensors</code> require meta-functions. Meta-functions should be implemented and registered in python so that dynamic dims can be handled automatically. See above documents for a description of meta-functions.</li> <li>Use <a href="https://pytorch.org/docs/stable/library.html#torch.library.opcheck"><code>torch.libary.opcheck()</code></a> to test the function registration and meta-function for any registered ops. See <code>tests/kernels</code> for examples.</li> <li>When changing the C++ signature of an existing op, the schema must be updated to reflect the changes.</li> <li>If a new custom type is needed, see the following document: <a href="https://docs.google.com/document/d/18fBMPuOJ0fY5ZQ6YyrHUppw9FA332CpNtgB6SOIgyuA">Custom Class Support in PT2</a>. </ul> <h3>Notes for Large Changes</h3> <p>Please keep the changes as concise as possible. For major architectural changes (>500 LOC excluding kernel/data/config/test), we would expect a GitHub issue (RFC) discussing the technical design and justification. Otherwise, we will tag it with <code>rfc-required</code> and might not go through the PR.</p> <h3>What to Expect for the Reviews</h3> <p>The goal of the vLLM team is to be a <i>transparent reviewing machine</i>. We would like to make the review process transparent and efficient and make sure no contributor feel confused or frustrated. However, the vLLM team is small, so we need to prioritize some PRs over others. Here is what you can expect from the review process: </p> <ul> <li> After the PR is submitted, the PR will be assigned to a reviewer. Every reviewer will pick up the PRs based on their expertise and availability.</li> <li> After the PR is assigned, the reviewer will provide status update every 2-3 days. If the PR is not reviewed within 7 days, please feel free to ping the reviewer or the vLLM team.</li> <li> After the review, the reviewer will put an <code> action-required</code> label on the PR if there are changes required. The contributor should address the comments and ping the reviewer to re-review the PR.</li> <li> Please respond to all comments within a reasonable time frame. If a comment isn't clear or you disagree with a suggestion, feel free to ask for clarification or discuss the suggestion. </li> </ul> <h3>Thank You</h3> <p> Finally, thank you for taking the time to read these guidelines and for your interest in contributing to vLLM. Your contributions make vLLM a great tool for everyone! </p> </details>

This PR enables long-contexts support with LoRA

…ase-2025-01-17

Multimodality fix for llava after rebase Fix for: ``` ERROR 12-16 12:31:11 engine.py:136] NotImplementedError: Unknown multi-modal data type: attention_mask ```

This PR updates `test/lora/utils.py` based on latest rebase.

1. This PR updates habana_main README_GAUDI to the Technical Writer reviewed version as seen in v1.19.0. (habana_main README_GAUDI and v1.19.0 README_GAUDI had diverged. ) 2. It also fixes broken urls due to recent restructuring in upstream vllm examples folder. 3. Adds notes in examples folder for new users and redirects them to see the Gaudi specific examples in README_GAUDI.md.

Change vllm-hpu-extension revision to ae726d4

Supporting PR for HabanaAI/vllm-hpu-extension#76

github-actions · 2025-01-21T13:17:49Z

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

Add ready label to the PR
Enable auto-merge.

🚀

kzawora-intel and others added 30 commits October 14, 2024 14:27

Reformat README_GAUDI.md (#389)

ebd42c4

This PR removes the awkward line breaks in README_GAUDI.md and uses appropriate markdown formatting instead of RST. Rendered document should look the same.

[CI] Prepare separate Jenkins tests for torch compile mode (#388)

2d2bf7a

Remove workaround added to resolve multi-card stall issue (#387)

9df1d4a

This PR removes additional `multiprocessing.Process` object created as a workaround for resolving multi-card stall issue.

Update SynapseAI version in README & Dockerfile (#390)

9777c9f

Merge remote-tracking branch 'origin/habana_main' into HEAD

5ceda69

Merge remote-tracking branch 'upstream/main' into HEAD

3e6a2d4

fix attention backend selector:

9ac52ab

Oct 7 rebase (#367)

57bc31d

enable mixtral quantization using INC (#372)

55dd07e

[CI] Temporarily increase test tolerances (#392)

401f5ae

This PR raises the allowed relative tolerance in GSM8K to 0.06, and moves Llama-70B test to 4xG2 from 2xG2 until memory usage is investigated (success run: vLLM-CI-Pipeline/206)

Add quickstart section to READMEs (#391)

e598f3f

Softmax: add weighted-sum normalization (#378)

f77435d

Supporting PR for HabanaAI/vllm-hpu-extension#10

Remove HPU changes from cache_engine.py (#400)

a59fc7b

We were asked on upstream PR to remove our changes from cache_engine.py. This PR does just that, and creates HPUCacheEngine inheriting from CacheEngine, just overriding _allocate_kv_cache method.

Add WA for RuntimeError: "fill_cpu" not implemented for 'Float8_e4m3f…

9276ccc

…n' (#402)

Workaround for OOM during loading llama-405 (#396)

07c98a5

Repeating missing code

Add HPU specific arguments to benchmark_throughput (#406)

acde882

Modify `benchmark_throughput.py` to allow running with FP8 on HPU (KV cache dtype `fp8_inc`) and to use padding-aware scheduling.

Add support for various softmax normalization options (#420)

7f58ad1

Supporting PR for HabanaAI/vllm-hpu-extension#14

Update README_GAUDI about fp8 calibration procedure (#423)

f603353

Set vllm-hpu-extension to 341a77f (#428)

a5136ec

Create scorecard.yml

a926d14

Adding calculation of OpenSSF Scorecard. Note: badge (visible at repo main page) will be disabled for now.

Revert "Contiguous PA" (#432)

e3ae2eb

Reverts #424

Support long contexts with LoRA (#418)

3a55e77

This PR enables long-contexts support with LoRA

kzawora-intel and others added 14 commits January 17, 2025 15:46

Merge remote-tracking branch 'upstream/main' into private/kzawora/reb…

ce50b1a

…ase-2025-01-17

fix TP crashes

a128878

make mypy happy

2e53e75

¿what the heck is incquark?

21f5fb2

i forgot brackets again

f1e911d

Multimodality fix for llava (#641)

ae67e4d

Multimodality fix for llava after rebase Fix for: ``` ERROR 12-16 12:31:11 engine.py:136] NotImplementedError: Unknown multi-modal data type: attention_mask ```

Rebase 2025-01-17 (#701)

018ce62

Fix LoRA tests (#696)

b10992b

This PR updates `test/lora/utils.py` based on latest rebase.

Change vllm-hpu-extension revision to ae726d4

293bd87

Change vllm-hpu-extension revision to ae726d4 (#707)

cc069cb

Change vllm-hpu-extension revision to ae726d4

Capabilities overhaul (#692)

fedf706

Supporting PR for HabanaAI/vllm-hpu-extension#76

[SW-216156] Fix mixtral Fused MoE issues after rebase (#708)

37eb4fc

Support for multi step scheduling in enc dec models

ed496d6

jkaniecki requested review from tlrmchlsmth, WoosukKwon, njhill, LiuXiaoxuanPKU, DarkLight1337, ywang96, mgoin, robertgshaw2-redhat, zhuohan123, youkaichao, alexm-redhat and comaniac as code owners January 21, 2025 13:17

jkaniecki closed this Jan 21, 2025

mergify bot added documentation Improvements or additions to documentation ci/build labels Jan 21, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multi step scheduling support for encoder-decoder models #12265

Multi step scheduling support for encoder-decoder models #12265

jkaniecki commented Jan 21, 2025 •

edited by github-actions bot

Loading

github-actions bot commented Jan 21, 2025

Multi step scheduling support for encoder-decoder models #12265

Multi step scheduling support for encoder-decoder models #12265

Conversation

jkaniecki commented Jan 21, 2025 • edited by github-actions bot Loading

github-actions bot commented Jan 21, 2025

jkaniecki commented Jan 21, 2025 •

edited by github-actions bot

Loading