Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multi step scheduling support for encoder-decoder models #12265

Closed

Conversation

jkaniecki
Copy link
Contributor

@jkaniecki jkaniecki commented Jan 21, 2025

This PR enables multi step scheduling for encoder - decoder models

kzawora-intel and others added 30 commits October 14, 2024 14:27
This PR removes the awkward line breaks in README_GAUDI.md and uses
appropriate markdown formatting instead of RST. Rendered document should
look the same.
This PR removes additional `multiprocessing.Process` object created as a
workaround for resolving multi-card stall issue.
This PR raises the allowed relative tolerance in GSM8K to 0.06, and
moves Llama-70B test to 4xG2 from 2xG2 until memory usage is
investigated (success run: vLLM-CI-Pipeline/206)
We were asked on upstream PR to remove our changes from cache_engine.py.
This PR does just that, and creates HPUCacheEngine inheriting from
CacheEngine, just overriding _allocate_kv_cache method.
…imit prefill batch size (#394)

This PR adds following functionality that can be enabled via engine
flags:
- use_padding_aware_scheduling - vLLM scheduler will now calculate token
cost considering padded prefill shape (similar to
#109).
- max_num_prefill_seqs - padding-aware scheduler will perform an
additional check for prefill batch size and will effectively limit
prefill batch size at maximum of `max_num_prefill_seqs`. If unset, max
prefill batch size will be `max_num_seqs`.
Both features are generic and do not require HPU, although they may be
specialized for particular vendor's usage. Padding aware scheduling
includes padding function selector which selects HPU padding function
(considering currently used HPU buckets) if current device is HPU.
Otherwise, it will take a product of batch_size x max_seq_len.
Modify `benchmark_throughput.py` to allow running with FP8 on HPU (KV
cache dtype `fp8_inc`) and to use padding-aware scheduling.
This PR removes the usage of custom HPU RotaryEmbedding modules, and
adds a forward_hpu method to existing RotaryEmbedding, for reusing
multiple derived implementations without the need of adding them to HPU
extension.
Mark_steps should not be needed within the test, but for whatever
reason, if they are not there, PT bridge crashes. To be investigated
later on. It does not affect actual model execution in any way I could
test/observe.
With this check while running decode_block_bucket_min=128 and bs>128 it
will skip buckets smaller than bs. Then during the run buckets that got
skipped can be used by vllm and are being warmed-up which is causing
perf drop & they are not run as hpu graphs.

This change is removing said check.
Currently before each Sampler call we have a CPU sync, which causes a
host gap:
<img width="226" alt="image"
src="https://github.com/user-attachments/assets/4509e69b-0f16-4ac9-812e-a2a9bc43a6ad">

This PR is removing that sync, so the host gap is no longer visible:
<img width="133" alt="image"
src="https://github.com/user-attachments/assets/66c19e4b-d832-4955-848d-8ae4acd8d264">

NOTE: class `ApplyToppTopkScalar` still has some CPU syncs inside. It
means that the biggest gain will be observed in the scenario without
`top_p` or `top_k` parameters. I think it is worth to investigate if we
can remove the syncs from this function too.
CUDA uses `capture` for warmup runs and `execute_model` for actual runs.
During each phase they call `set_active_loras` only once. HPU uses
`execute_model` for both warmup and actual runs. Since `execute_model`
already takes care of `set_active_loras` internally, the redundant call
can be removed.

This special handling is redundant and incorrect, as it causes
out-of-bound slicing in decode phase reported in
#405.

This PR removes special handling of `set_active_loras` function call
from warmup runs and resolves the issue in
#405.
Changes the profile_run batches based on the max sequence length. This
avoids padding during prepare_prompt; thus avoiding breaking constraints
based on batch_size * seq_len <= max_num_batch_tokens.

Current logic for profile_run max_batch_size takes precedence.
e.g. - > max_batch_size = 256, max_num_batch_tokens = 2048, block_size =
128, max_seq_len = 1024
with current logic max_seq_len is updated as 8; however in
**prepare_prompt** seq_len is padded to 128, thus getting batch_size *
seq_len as 256 * 128 > max_num_batch_tokens; thus violating the above
mentioned constraint
with the updated logic, we calculate max_batch_size as 2, this avoids
the padding at **prepare_prompt**, thus keeping the constraints in
place.

Fixes: #405
Adding calculation of OpenSSF Scorecard. Note: badge (visible at repo main page) will be disabled for now.
Contiguous cache fetching to avoid using costly gather operation.
Requires changes in vllm-hpu-extension
(HabanaAI/vllm-hpu-extension#17) to work.

Introduces redundant calculations in decoding phase. In all tested cases
improves performance over the entire run (5-12%). For even better
performance cache defragmentation is required. Only compatible with
v2-block-manager.
Move Dynamic MoE implementation to habana_main. It was previously
implemented for 1.18, but had to be modified as ops have been moved to
[github.com/HabanaAI/vllm-hpu-extension](https://github.com/HabanaAI/vllm-hpu-extension).
Works with bf16, uses static (legacy) mode when running with
quantization.

Related PRs:
- #303
- HabanaAI/vllm-hpu-extension#13

---

<details>
<!-- inside this <details> section, markdown rendering does not work, so
we use raw html here. -->
<summary><b> PR Checklist (Click to Expand) </b></summary>

<p>Thank you for your contribution to vLLM! Before submitting the pull
request, please ensure the PR meets the following criteria. This helps
vLLM maintain the code quality and improve the efficiency of the review
process.</p>

<h3>PR Title and Classification</h3>
<p>Only specific types of PRs will be reviewed. The PR title is prefixed
appropriately to indicate the type of change. Please use one of the
following:</p>
<ul>
    <li><code>[Bugfix]</code> for bug fixes.</li>
<li><code>[CI/Build]</code> for build or continuous integration
improvements.</li>
<li><code>[Doc]</code> for documentation fixes and improvements.</li>
<li><code>[Model]</code> for adding a new model or improving an existing
model. Model name should appear in the title.</li>
<li><code>[Frontend]</code> For changes on the vLLM frontend (e.g.,
OpenAI API server, <code>LLM</code> class, etc.) </li>
<li><code>[Kernel]</code> for changes affecting CUDA kernels or other
compute kernels.</li>
<li><code>[Core]</code> for changes in the core vLLM logic (e.g.,
<code>LLMEngine</code>, <code>AsyncLLMEngine</code>,
<code>Scheduler</code>, etc.)</li>
<li><code>[Hardware][Vendor]</code> for hardware-specific changes.
Vendor name should appear in the prefix (e.g.,
<code>[Hardware][AMD]</code>).</li>
<li><code>[Misc]</code> for PRs that do not fit the above categories.
Please use this sparingly.</li>
</ul>
<p><strong>Note:</strong> If the PR spans more than one category, please
include all relevant prefixes.</p>

<h3>Code Quality</h3>

<p>The PR need to meet the following code quality standards:</p>

<ul>
<li>We adhere to <a
href="https://google.github.io/styleguide/pyguide.html">Google Python
style guide</a> and <a
href="https://google.github.io/styleguide/cppguide.html">Google C++
style guide</a>.</li>
<li>Pass all linter checks. Please use <a
href="https://github.com/vllm-project/vllm/blob/main/format.sh"><code>format.sh</code></a>
to format your code.</li>
<li>The code need to be well-documented to ensure future contributors
can easily understand the code.</li>
<li>Include sufficient tests to ensure the project to stay correct and
robust. This includes both unit tests and integration tests.</li>
<li>Please add documentation to <code>docs/source/</code> if the PR
modifies the user-facing behaviors of vLLM. It helps vLLM user
understand and utilize the new features or changes.</li>
</ul>

<h3>Adding or changing kernels</h3>
<p>Each custom kernel needs a schema and one or more implementations to
be registered with PyTorch.</p>
<ul>
<li>Make sure custom ops are registered following PyTorch guidelines: <a
href="https://pytorch.org/tutorials/advanced/cpp_custom_ops.html#cpp-custom-ops-tutorial">Custom
C++ and CUDA Operators</a> and <a
href="https://docs.google.com/document/d/1_W62p8WJOQQUzPsJYa7s701JXt0qf2OfLub2sbkHOaU">The
Custom Operators Manual</a></li>
<li>Custom operations that return <code>Tensors</code> require
meta-functions. Meta-functions should be implemented and registered in
python so that dynamic dims can be handled automatically. See above
documents for a description of meta-functions.</li>
<li>Use <a
href="https://pytorch.org/docs/stable/library.html#torch.library.opcheck"><code>torch.libary.opcheck()</code></a>
to test the function registration and meta-function for any registered
ops. See <code>tests/kernels</code> for examples.</li>
<li>When changing the C++ signature of an existing op, the schema must
be updated to reflect the changes.</li>
<li>If a new custom type is needed, see the following document: <a
href="https://docs.google.com/document/d/18fBMPuOJ0fY5ZQ6YyrHUppw9FA332CpNtgB6SOIgyuA">Custom
Class Support in PT2</a>.
</ul>

<h3>Notes for Large Changes</h3>
<p>Please keep the changes as concise as possible. For major
architectural changes (>500 LOC excluding kernel/data/config/test), we
would expect a GitHub issue (RFC) discussing the technical design and
justification. Otherwise, we will tag it with <code>rfc-required</code>
and might not go through the PR.</p>

<h3>What to Expect for the Reviews</h3>

<p>The goal of the vLLM team is to be a <i>transparent reviewing
machine</i>. We would like to make the review process transparent and
efficient and make sure no contributor feel confused or frustrated.
However, the vLLM team is small, so we need to prioritize some PRs over
others. Here is what you can expect from the review process: </p>

<ul>
<li> After the PR is submitted, the PR will be assigned to a reviewer.
Every reviewer will pick up the PRs based on their expertise and
availability.</li>
<li> After the PR is assigned, the reviewer will provide status update
every 2-3 days. If the PR is not reviewed within 7 days, please feel
free to ping the reviewer or the vLLM team.</li>
<li> After the review, the reviewer will put an <code>
action-required</code> label on the PR if there are changes required.
The contributor should address the comments and ping the reviewer to
re-review the PR.</li>
<li> Please respond to all comments within a reasonable time frame. If a
comment isn't clear or you disagree with a suggestion, feel free to ask
for clarification or discuss the suggestion.
 </li>
</ul>

<h3>Thank You</h3>

<p> Finally, thank you for taking the time to read these guidelines and
for your interest in contributing to vLLM. Your contributions make vLLM
a great tool for everyone! </p>


</details>
This PR enables long-contexts support with LoRA
kzawora-intel and others added 14 commits January 17, 2025 15:46
Multimodality fix for llava after rebase

Fix for:
```
ERROR 12-16 12:31:11 engine.py:136] NotImplementedError: Unknown multi-modal data type: attention_mask
```
This PR updates `test/lora/utils.py` based on latest rebase.
1. This PR updates habana_main README_GAUDI to the Technical Writer
reviewed version as seen in v1.19.0.
(habana_main README_GAUDI and v1.19.0 README_GAUDI had diverged. )
2. It also fixes broken urls due to recent restructuring in upstream
vllm examples folder.
3. Adds notes in examples folder for new users and redirects them to see
the Gaudi specific examples in README_GAUDI.md.
Change vllm-hpu-extension revision to ae726d4
Copy link

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

  • Add ready label to the PR
  • Enable auto-merge.

🚀

@mergify mergify bot added documentation Improvements or additions to documentation ci/build labels Jan 21, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ci/build documentation Improvements or additions to documentation
Projects
None yet
Development

Successfully merging this pull request may close these issues.