-
-
Notifications
You must be signed in to change notification settings - Fork 5.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RFC]: Disaggregated prefilling and KV cache transfer roadmap #10818
Comments
Will the logic for model upgrades and instance service discovery be introduced in XpYd? |
Model upgrades --- not in the scope of disaggregated prefill roadmap for now. But this IS important, RLHF-style training also needs this, so I can add that if this is a common need. Please ❤️ this message if you need this feature. |
When using XpYd in production, model upgrades are frequent. During the upgrade period, there are two versions of the model. I think vLLM gateway need to pair prefill and decode instances, ensuring they are from the same model version. |
Glad to see the progress of supporting P/D disaggreation feature.
|
|
To better support the PD disaggregated architecture, we are actively developing a dual-tiered scheduler, implemented in Go, to optimize XpYd and request management. This upgrade has been built upon our PD disaggregated feature within vllm and is now live in our production environment, showing improved performance with good stability. The core design of our scheduler is outlined below: ● Observability: To reduce reliance on any single inference engine, we have implemented a Go-based reverse proxy that directly collects and computes instance-level performance metrics in real time, such as TTFT, TPOT, instance load, and cache status. ● Hierarchical Scheduling System: Our system features a Cluster Level Scheduler (CLS) and an Instance Level Scheduler (ILS), aiming to maximize goodput per GPU while meeting latency SLOs. The CLS leverages a workload-aware performance-cost model to refine request routing, determining whether to use a disaggregated or colocated serving mode and pinpointing the most cost-effective GPU types. Subsequently, the ILS assigns the most suitable P/D instance pairs for incoming requests, optimizing load balancing and cache reuse. ● Dynamic P/D Adjustment: By leveraging instance-level metrics, we've developed a role shift module that periodically evaluates instance load stats and decides when to add, remove, or switch P/D instances as needed. We are looking forward to releasing the code for our global scheduler to OSS shortly. Additional features are currently in development. We welcome any discussions and opportunities for collaboration. |
@yuleil Hello, I was wondering how to use nsys to profile such distributed system, I have lots of experience in using nsys to profile vllm. But for PD disagg You know I have to run prefill/decode instance seperately, I want use one nsys profile two seperate instance. After check the help doc I still can not find the solution. |
Let's also add some orchestration support in the roadmap. Seems how to orchestrate such stateful application is not covered yet. Let's create one sub-task to track it |
Hi @KuntaiDu |
Chunked prefill chunks is useful in terms of controlling the peak GPU memory usage of prefilling very long context. So for long context usecase, it makes sense to use both. |
Fully asynchronous KV Cache transfer is a great feature. It can reduce latency. I hope that this feature can be merged to the main branch soon. Will it be merged to the main branch? If so, when will it be merged? @yuleil |
@KuntaiDu. I am trying to implement XpYd (taking 1P3D as an example), here are my method and problem. Method
ProblemBut the problem I encountered is: when I send an instance of the request Note: Even if I use TCPStore to transfer kvcache, the same problem occurs that the system gets stuck after the next request changes pd_pair, and through the log, I found that the stuck position is after the D instance sends the signal. So it is not a problem with nccl at all, but it is stuck somewhere else! Maybe I need to check my system implementation again. Notemy send func is as follows:
recv func:
|
I solved the problem of the system hanging when changing Problem descriptionI only passed in parameters when creating the SolutionI use queue to pass parameters to the New BugBut I encountered a new problem:
What is going on here? How can I solve this problem? |
Motivation.
Here is the roadmap for disaggregated prefill (and general-purpose kv cache transfer). Feel free to contribute 😁.
Proposed Change.
num_head
dimension andlayer
dimension (currently theroi
tensor only contains tokens dimension)vllm connect
([Frontend] Disaggregate prefill decode with zmq #11791 )Engine
instead of talking to the API serverFeedback Period.
No response
CC List.
@youkaichao @zeroorhero @comaniac @rkooo567 @WoosukKwon @liweiqing1997 @ShangmingCai @Leaf996 @coolkp @sjnaj @K-Mistele @ApostaC @YaoJiayi @njhill
Any Other Things.
No response
Before submitting a new issue...
The text was updated successfully, but these errors were encountered: