-
-
Notifications
You must be signed in to change notification settings - Fork 5.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[core] further polish memory profiling #12126
Conversation
Signed-off-by: youkaichao <[email protected]>
Signed-off-by: youkaichao <[email protected]>
Signed-off-by: youkaichao <[email protected]>
Signed-off-by: youkaichao <[email protected]>
Signed-off-by: youkaichao <[email protected]>
Signed-off-by: youkaichao <[email protected]>
👋 Hi! Thank you for contributing to the vLLM project. Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can do one of these:
🚀 |
|
||
# load weights | ||
|
||
weights = torch.randn(128, 1024, 1024, device='cuda', dtype=torch.float32) | ||
|
||
weights_memory_in_bytes = 128 * 1024 * 1024 * 4 # 512 MiB |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
refactor in this PR: remote the _in_bytes
in variable name to make the name shorter.
assert abs(non_torch_ratio - 1) <= 0.05 | ||
assert abs(torch_peak_ratio - 1) <= 0.05 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
now this becomes accurate.
# we measure the torch peak memory usage via allocated_bytes, | ||
# rather than `torch.cuda.memory_reserved()` . | ||
# After `torch.cuda.reset_peak_memory_stats()`, | ||
# `torch.cuda.memory_reserved()` will keep growing, and only shrink | ||
# when we call `torch.cuda.empty_cache()` or OOM happens. | ||
self.torch_peak = torch.cuda.memory_stats().get( | ||
"allocated_bytes.all.peak", 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is the key change, reported by @gshtras
torch.cuda.reset_peak_memory_stats() | ||
self.baseline_snapshot = MemorySnapshot() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
another key change: we also measure the non-torch memory before creating the vllm instance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to do this in v1/worker/gpu_worker.py
as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
v1 does not use this memory_profiling
utility yet. welcome to port it to v1 code path!
Ran some tests and can confirm that this fixes the original issue we had with Llama3.2 90B model peak memory jumping from 98GB to 160GB |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for wrestling this into a better state!
failed tests look unrelated, merging. |
Signed-off-by: youkaichao <[email protected]>
Signed-off-by: youkaichao <[email protected]>
Signed-off-by: youkaichao <[email protected]>
Improve upon #11809
current main branch,
vllm serve meta-llama/Llama-3.2-11B-Vision --load-format dummy --max-model-len 65536 --max-num-seqs 128 --enforce-eager
will fail.after this PR, it can work on H100-80G now: