Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Frontend] Disaggregate prefill decode with zmq #11791

Open
wants to merge 18 commits into
base: main
Choose a base branch
from

Conversation

panf2333
Copy link

@panf2333 panf2333 commented Jan 7, 2025

Added vLLM Connect to initiate a proxy service and connect to the VLLM Server via ZMQ, improved the performance of prefill-decode disaggregation by 10-30% (TTFT), and 3X- 15X (ITL) on average.

This key change of this PR includes replacing HTTP with ZMQ for communication between the proxy and the VLLM server, and using socket pools to maintain persistent ZMQ connections, which reduces reconnection overhead.

We have attached the benchmark result and the detailed configuration to reproduce the result.

Benchmark

Parameters

  • GPU device: 2 * H100 80G
  • Model: meta-llama/Meta-Llama-3.1-8B-Instruct
  • Parameters: gpu-memory-utilization 0.6 + kv_buffer_size 5e9
  • dataset input 1024 output 6
  • CUDA_LAUNCH_BLOCKING=1
  • QPS: 1, 12, 24, 48, 96
  • Total Request: 96

Evaluation Steps

  1. Start Disagg HTTP proxy and 2 VLLM Server Instances(1 prefill and 1 decode)
  2. Run the script to test QPS in [1, 12, 24, 48, 96] each qps repeat 3 times and then obtain the average metric
  3. Start Disagg ZMQ proxy and 2 VLLM Server Instance and repeat the previous process.

image

Design of ZMQ-based Client-Server Communication

High-level Overview

image

Design of ZMQ-based Communication

image

Copy link

github-actions bot commented Jan 7, 2025

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

  • Add ready label to the PR
  • Enable auto-merge.

🚀

@robertgshaw2-redhat
Copy link
Collaborator

Ping when ready. NOTE for reviewers: do not merge until me and @russellb have a chance to review

clients.bind(url_client)
logger.info(f"ZMQ Server ROUTER started at {url_client}")
# Socket to talk to workers
workers = context.socket(zmq.DEALER)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@simon-mo I am not familiar with ZMQ --- is dealer the right technical choice?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://zguide.zeromq.org/docs/chapter3/#The-DEALER-to-DEALER-Combination

We need to proactively send messages to workers in this scenario.

  1. ROUTER is not suitable for initiating messages because it doesn't know the identities of other receivers until it receives the first message. Only then can it establish routes for interaction.

  2. REQ requires acknowledging each message before sending the next one, which doesn't meet our requirements.

  3. DEALER allows us to actively send messages and supports asynchronous multi-send and multi-receive, making it the more suitable pattern. It's important to note that we need to maintain the DEALER's ID.

vllm/entrypoints/connect.py Outdated Show resolved Hide resolved
vllm/scripts.py Show resolved Hide resolved
prefill_request['max_tokens'] = 1
route = "/v1/completions"
# finish prefill
async for x in execute_task_async(route, header, prefill_request, app.state.sockets_prefill):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A potential optimization (you don't need to implement it in this PR): return the first token generated by the prefill instance in this async for, instead of reposting the request to decode instance and waiting for the first token from there.

vllm/scripts.py Show resolved Hide resolved
@KuntaiDu KuntaiDu requested a review from youkaichao January 8, 2025 01:06
print("Worker DEALER started at", url_worker)

tasks = [asyncio.create_task(worker_routine(url_worker, context, i)) for i in range(5)]
proxy_task = asyncio.to_thread(zmq.proxy, clients, workers)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

zmq sockets are not threadsafe. This cannot run in a background thread it must be in an asyncio task.

Copy link
Collaborator

@robertgshaw2-redhat robertgshaw2-redhat Jan 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This means you cannot use the built in proxy since it does not use async sockets. In prior versions of VLLM, you will have to write your own proxy (its like 10LOC)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test scripts test_connect_server1.py and test_connect_server2.py were used to simulate model responses. I've since removed them.

yield
## close zmq context
logger.info("term zmqctx")
await app.state.zmqctx.term()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use destroy(linger=0)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great point! To ensure immediate termination and avoid potential blocking, I'll switch to using destroy(linger=0) instead of term(). I also replace it in the vllm/entrypoints/launcher.py

https://pyzmq.readthedocs.io/en/latest/api/zmq.html#context
After interrupting all blocking calls, term shall block until the following conditions are satisfied:

  1. All sockets open within context have been closed.
  2. For each socket within context, all messages sent on the socket have either been physically transferred to a network peer, or the socket’s linger period set with the zmq.LINGER socket option has expired.

logger.info(f"ZMQ Worker DEALER started at {url_worker}")

tasks = [asyncio.create_task(worker_routine(url_worker, app, context, i)) for i in range(5)]
proxy_task = asyncio.to_thread(zmq.proxy, clients, workers)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

zmq sockets are not threadsafe. You cannot run this in a background thread.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I appreciate you pointing out potential thread safety issues with zmq sockets. You are completely correct; By default, they are not thread safe. I will prioritize finding a more thread safe alternative in the future to ensure robust operation in multi-threaded environments.

As zmq.proxy() is a synchronous function, executing it directly within the main thread can potentially block the server.

Currently, these two sockets are used exclusively within this thread. While I believe there are no immediate thread safety concerns, it's prudent to consider future scalability and maintainability. Can we address potential thread-safety issues in a subsequent PR?

https://zguide.zeromq.org/docs/chapter2/#ZeroMQ-s-Built-In-Proxy-Function
It’s exactly like starting the main loop of rrbroker.
image

https://github.com/booksbyus/zguide/blob/master/examples/Python/rrbroker.py

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@robertgshaw2-neuralmagic Hi, Robert I have resolved this issue by threadproxy. It will create sockets in the proxy, which no one can access. So there won't be any thread safety issues

@robertgshaw2-redhat
Copy link
Collaborator

robertgshaw2-redhat commented Jan 8, 2025

@panf2333 - Thanks for the PR! Disaggregated serving is a hugely important initiative for VLLM in 2025

I am responsible for the multiprocessing + asyncio + zmq architecture of VLLM, so I am going to review this in detail. I am having some trouble following the design here. Can you make a simple diagram that charts out what these objects are to ease in review?

Thanks!

@panf2333
Copy link
Author

panf2333 commented Jan 8, 2025

@panf2333 - Thanks for the PR! Disaggregated serving is a hugely important initiative for VLLM in 2025

I am responsible for the multiprocessing + asyncio + zmq architecture of VLLM, so I am going to review this in detail. I am having some trouble following the design here. Can you make a simple diagram that charts out what these objects are to ease in review?

Thanks!

@robertgshaw2-neuralmagic It's my pleasure. I'll put together a diagram and send it over shortly.

@panf2333
Copy link
Author

panf2333 commented Jan 8, 2025

@panf2333 - Thanks for the PR! Disaggregated serving is a hugely important initiative for VLLM in 2025

I am responsible for the multiprocessing + asyncio + zmq architecture of VLLM, so I am going to review this in detail. I am having some trouble following the design here. Can you make a simple diagram that charts out what these objects are to ease in review?

Thanks!

@robertgshaw2-neuralmagic These are simple diagram , hoping to help you better understand this PR. I also updated the description of PR.

The relationship with client connector and vllm server

image

The zmq detail between connector and vllm server

image

@panf2333 panf2333 marked this pull request as ready for review January 8, 2025 10:20
@panf2333 panf2333 changed the title Disaggregate prefill decode with zmq [Frontend] Disaggregate prefill decode with zmq Jan 8, 2025
Signed-off-by: clark <[email protected]>
2.To more accurately reflect its purpose, we will rename connect.py to disagg_connector.py.

Signed-off-by: clark <[email protected]>
…oy(linger=0) for immediate termination

Signed-off-by: clark <[email protected]>
Signed-off-by: clark <[email protected]>
Signed-off-by: clark <[email protected]>
@panf2333 panf2333 force-pushed the disaggregate_prefill_decode_with_zmq branch from 1bc97ec to 0728a42 Compare January 8, 2025 16:39
Copy link
Collaborator

@russellb russellb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand the whole design yet, but I have one early comment: is all zmq communication local? If so, can you please use ipc:// sockets instead of tcp://? That will avoid some security concerns.

@panf2333
Copy link
Author

panf2333 commented Jan 9, 2025

I don't understand the whole design yet, but I have one early comment: is all zmq communication local? If so, can you please use ipc:// sockets instead of tcp://? That will avoid some security concerns.

@russellb I completely agree that security is a paramount concern.

Given the Disaggregated serving feature's potential to dispatch requests to other nodes, it's crucial to establish a secure communication channel between the connector proxy, prefill node, and decode node.

In order to connect the connector proxy, pre filled nodes, and decoding nodes, we should use 'tcp://'

in vllm/entrypoints/disagg_connector.py
async def run_disagg_connector(args, **uvicorn_kwargs) -> None:

in vllm/entrypoints/launcher.py

async def serve_zmq(arg, zmq_server_port: int, app: FastAPI) -> None:
    """Server routine"""
    logger.info("zmq Server start arg: %s, zmq_server_port: %d", arg,
                zmq_server_port)
    url_worker = "inproc://workers"
    url_client = f"tcp://0.0.0.0:{zmq_server_port}"

In the server side we will use "inproc://workers" to deal the message.

@russellb
Copy link
Collaborator

russellb commented Jan 9, 2025

This is big and complex enough that I would find it easier to discuss this at a design doc level. Do you have a design doc from planning this implementation?

I'm not really comfortable with adding any additional multi-node zmq usage without additional non-trivial effort to secure these communications.

@panf2333
Copy link
Author

This is big and complex enough that I would find it easier to discuss this at a design doc level. Do you have a design doc from planning this implementation?

I'm not really comfortable with adding any additional multi-node zmq usage without additional non-trivial effort to secure these communications.

@russellb I appreciate you raising this concern.
I will integrate the pyzme.auth module to enhance security with follow-up pr. I will change to ipc:// this time.

https://pyzmq.readthedocs.io/en/latest/api/zmq.auth.html
base on ZAP authentication and CURVE authentication
The document are here. Recommended Lark Doc.

lark doc: https://qus2es1bg99i.larksuite.com/wiki/Pbi1wFUTaiBZneksfytuQxrSsTe?from=from_copylink

google doc: https://docs.google.com/document/d/1ZwFij2OEx_K1xBx2EBx5FKfXQ9EJEGU6shYh-9MJdPs/edit?usp=sharing

@panf2333
Copy link
Author

This is big and complex enough that I would find it easier to discuss this at a design doc level. Do you have a design doc from planning this implementation?

I'm not really comfortable with adding any additional multi-node zmq usage without additional non-trivial effort to secure these communications.

@russellb Hi Russell, for now, I've used 'ipc://' to address immediate security concerns. However, I'll be addressing network security comprehensively in a future PR. I plan to leverage pyzmq.auth to implement robust authentication and authorization mechanisms.

Signed-off-by: clark <[email protected]>
@russellb
Copy link
Collaborator

This is big and complex enough that I would find it easier to discuss this at a design doc level. Do you have a design doc from planning this implementation?
I'm not really comfortable with adding any additional multi-node zmq usage without additional non-trivial effort to secure these communications.

@russellb Hi Russell, for now, I've used 'ipc://' to address immediate security concerns. However, I'll be addressing network security comprehensively in a future PR. I plan to leverage pyzmq.auth to implement robust authentication and authorization mechanisms.

I don't think that's sufficient. We also need a viable option for encryption, ideally with TLS.

@panf2333
Copy link
Author

This is big and complex enough that I would find it easier to discuss this at a design doc level. Do you have a design doc from planning this implementation?
I'm not really comfortable with adding any additional multi-node zmq usage without additional non-trivial effort to secure these communications.

@russellb Hi Russell, for now, I've used 'ipc://' to address immediate security concerns. However, I'll be addressing network security comprehensively in a future PR. I plan to leverage pyzmq.auth to implement robust authentication and authorization mechanisms.

I don't think that's sufficient. We also need a viable option for encryption, ideally with TLS.

@russellb I believe the disaggregation feature might benefit from optional TLS encryption. While encryption enhances security, it may introduce a slight performance overhead. Do you mean we can provide a configuration option to enable TLS encryption? This will allow users to choose the security level they need. I think users prefer to deploy clusters within secure environments such as intranets, so they want to improve performance as much as possible.

I will conduct in-depth research on auth and encryption before deciding on the selection. Before that zmq was only allowed to run locally. How about this?




@russellb
Copy link
Collaborator

That's fine. I'm completely OK with using it local-only.

Copy link
Collaborator

@KuntaiDu KuntaiDu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work! Can you change the disaggregated prefill example file under the examples folder? Let's provide some handle for newcomers to run disaggregated prefill example without figuring out how to correctly set all the CLI args.

Copy link

mergify bot commented Jan 20, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @panf2333.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Jan 20, 2025
# Conflicts:
#	examples/online_serving/disaggregated_prefill.sh

Signed-off-by: clark <[email protected]>
@mergify mergify bot removed the needs-rebase label Jan 20, 2025
@panf2333
Copy link
Author

panf2333 commented Jan 20, 2025

Great work! Can you change the disaggregated prefill example file under the examples folder? Let's provide some handle for newcomers to run disaggregated prefill example without figuring out how to correctly set all the CLI args.
@KuntaiDu
done
image

Copy link
Collaborator

@KuntaiDu KuntaiDu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@KuntaiDu
Copy link
Collaborator

@robertgshaw2-redhat would be great if you can take a look, if it also looks good to you I'll enable automerge.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants