[Frontend] Disaggregate prefill decode with zmq #11791

panf2333 · 2025-01-07T05:42:54Z

Added vLLM Connect to initiate a proxy service and connect to the VLLM Server via ZMQ, improved the performance of prefill-decode disaggregation by 10-30% (TTFT), and 3X- 15X (ITL) on average.

This key change of this PR includes replacing HTTP with ZMQ for communication between the proxy and the VLLM server, and using socket pools to maintain persistent ZMQ connections, which reduces reconnection overhead.

We have attached the benchmark result and the detailed configuration to reproduce the result.

Benchmark

Parameters

GPU device: 2 * H100 80G
Model: meta-llama/Meta-Llama-3.1-8B-Instruct
Parameters: gpu-memory-utilization 0.6 + kv_buffer_size 5e9
dataset input 1024 output 6
CUDA_LAUNCH_BLOCKING=1
QPS: 1, 12, 24, 48, 96
Total Request: 96

Evaluation Steps

Start Disagg HTTP proxy and 2 VLLM Server Instances(1 prefill and 1 decode)
Run the script to test QPS in [1, 12, 24, 48, 96] each qps repeat 3 times and then obtain the average metric
Start Disagg ZMQ proxy and 2 VLLM Server Instance and repeat the previous process.

Design of ZMQ-based Client-Server Communication

High-level Overview

Design of ZMQ-based Communication

github-actions · 2025-01-07T05:43:07Z

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

Add ready label to the PR
Enable auto-merge.

🚀

robertgshaw2-redhat · 2025-01-07T16:26:28Z

Ping when ready. NOTE for reviewers: do not merge until me and @russellb have a chance to review

KuntaiDu · 2025-01-08T00:51:11Z

vllm/entrypoints/launcher.py

+    clients.bind(url_client)
+    logger.info(f"ZMQ Server ROUTER started at {url_client}")
+    # Socket to talk to workers
+    workers = context.socket(zmq.DEALER)


@simon-mo I am not familiar with ZMQ --- is dealer the right technical choice?

https://zguide.zeromq.org/docs/chapter3/#The-DEALER-to-DEALER-Combination

We need to proactively send messages to workers in this scenario.

ROUTER is not suitable for initiating messages because it doesn't know the identities of other receivers until it receives the first message. Only then can it establish routes for interaction.

REQ requires acknowledging each message before sending the next one, which doesn't meet our requirements.

DEALER allows us to actively send messages and supports asynchronous multi-send and multi-receive, making it the more suitable pattern. It's important to note that we need to maintain the DEALER's ID.

vllm/entrypoints/connect.py

vllm/scripts.py

KuntaiDu · 2025-01-08T01:04:58Z

vllm/entrypoints/connect.py

+        prefill_request['max_tokens'] = 1
+        route = "/v1/completions"
+        # finish prefill
+        async for x in execute_task_async(route, header, prefill_request, app.state.sockets_prefill):


A potential optimization (you don't need to implement it in this PR): return the first token generated by the prefill instance in this async for, instead of reposting the request to decode instance and waiting for the first token from there.

vllm/scripts.py

robertgshaw2-redhat · 2025-01-08T03:54:52Z

benchmarks/disagg_benchmarks/zmq/test_connect_server1.py

+    print("Worker DEALER started at", url_worker)
+
+    tasks = [asyncio.create_task(worker_routine(url_worker, context, i)) for i in range(5)]
+    proxy_task =  asyncio.to_thread(zmq.proxy, clients, workers)


zmq sockets are not threadsafe. This cannot run in a background thread it must be in an asyncio task.

This means you cannot use the built in proxy since it does not use async sockets. In prior versions of VLLM, you will have to write your own proxy (its like 10LOC)

The test scripts test_connect_server1.py and test_connect_server2.py were used to simulate model responses. I've since removed them.

robertgshaw2-redhat · 2025-01-08T03:58:48Z

vllm/entrypoints/connect.py

+    yield
+    ## close zmq context
+    logger.info("term zmqctx")
+    await app.state.zmqctx.term()


use destroy(linger=0)

Great point! To ensure immediate termination and avoid potential blocking, I'll switch to using destroy(linger=0) instead of term(). I also replace it in the vllm/entrypoints/launcher.py

https://pyzmq.readthedocs.io/en/latest/api/zmq.html#context
After interrupting all blocking calls, term shall block until the following conditions are satisfied:

All sockets open within context have been closed.

For each socket within context, all messages sent on the socket have either been physically transferred to a network peer, or the socket’s linger period set with the zmq.LINGER socket option has expired.

robertgshaw2-redhat · 2025-01-08T04:05:53Z

vllm/entrypoints/launcher.py

+    logger.info(f"ZMQ Worker DEALER started at {url_worker}")
+
+    tasks = [asyncio.create_task(worker_routine(url_worker, app, context, i)) for i in range(5)]
+    proxy_task =  asyncio.to_thread(zmq.proxy, clients, workers)


zmq sockets are not threadsafe. You cannot run this in a background thread.

I appreciate you pointing out potential thread safety issues with zmq sockets. You are completely correct; By default, they are not thread safe. I will prioritize finding a more thread safe alternative in the future to ensure robust operation in multi-threaded environments.

As zmq.proxy() is a synchronous function, executing it directly within the main thread can potentially block the server.

Currently, these two sockets are used exclusively within this thread. While I believe there are no immediate thread safety concerns, it's prudent to consider future scalability and maintainability. Can we address potential thread-safety issues in a subsequent PR?

https://zguide.zeromq.org/docs/chapter2/#ZeroMQ-s-Built-In-Proxy-Function
It’s exactly like starting the main loop of rrbroker.

https://github.com/booksbyus/zguide/blob/master/examples/Python/rrbroker.py

@robertgshaw2-neuralmagic Hi, Robert I have resolved this issue by threadproxy. It will create sockets in the proxy, which no one can access. So there won't be any thread safety issues

robertgshaw2-redhat · 2025-01-08T04:13:26Z

@panf2333 - Thanks for the PR! Disaggregated serving is a hugely important initiative for VLLM in 2025

I am responsible for the multiprocessing + asyncio + zmq architecture of VLLM, so I am going to review this in detail. I am having some trouble following the design here. Can you make a simple diagram that charts out what these objects are to ease in review?

Thanks!

panf2333 · 2025-01-08T04:38:14Z

@panf2333 - Thanks for the PR! Disaggregated serving is a hugely important initiative for VLLM in 2025

I am responsible for the multiprocessing + asyncio + zmq architecture of VLLM, so I am going to review this in detail. I am having some trouble following the design here. Can you make a simple diagram that charts out what these objects are to ease in review?

Thanks!

@robertgshaw2-neuralmagic It's my pleasure. I'll put together a diagram and send it over shortly.

panf2333 · 2025-01-08T06:43:07Z

@panf2333 - Thanks for the PR! Disaggregated serving is a hugely important initiative for VLLM in 2025

I am responsible for the multiprocessing + asyncio + zmq architecture of VLLM, so I am going to review this in detail. I am having some trouble following the design here. Can you make a simple diagram that charts out what these objects are to ease in review?

Thanks!

@robertgshaw2-neuralmagic These are simple diagram , hoping to help you better understand this PR. I also updated the description of PR.

The relationship with client connector and vllm server

The zmq detail between connector and vllm server

Signed-off-by: clark <[email protected]>

2.To more accurately reflect its purpose, we will rename connect.py to disagg_connector.py. Signed-off-by: clark <[email protected]>

Signed-off-by: clark <[email protected]>

…oy(linger=0) for immediate termination Signed-off-by: clark <[email protected]>

Signed-off-by: clark <[email protected]>

russellb

I don't understand the whole design yet, but I have one early comment: is all zmq communication local? If so, can you please use ipc:// sockets instead of tcp://? That will avoid some security concerns.

panf2333 · 2025-01-09T03:22:19Z

I don't understand the whole design yet, but I have one early comment: is all zmq communication local? If so, can you please use ipc:// sockets instead of tcp://? That will avoid some security concerns.

@russellb I completely agree that security is a paramount concern.

Given the Disaggregated serving feature's potential to dispatch requests to other nodes, it's crucial to establish a secure communication channel between the connector proxy, prefill node, and decode node.

In order to connect the connector proxy, pre filled nodes, and decoding nodes, we should use 'tcp://'

in vllm/entrypoints/disagg_connector.py
async def run_disagg_connector(args, **uvicorn_kwargs) -> None:

in vllm/entrypoints/launcher.py

async def serve_zmq(arg, zmq_server_port: int, app: FastAPI) -> None:
    """Server routine"""
    logger.info("zmq Server start arg: %s, zmq_server_port: %d", arg,
                zmq_server_port)
    url_worker = "inproc://workers"
    url_client = f"tcp://0.0.0.0:{zmq_server_port}"

In the server side we will use "inproc://workers" to deal the message.

russellb · 2025-01-09T14:38:39Z

This is big and complex enough that I would find it easier to discuss this at a design doc level. Do you have a design doc from planning this implementation?

I'm not really comfortable with adding any additional multi-node zmq usage without additional non-trivial effort to secure these communications.

panf2333 · 2025-01-10T07:44:12Z

This is big and complex enough that I would find it easier to discuss this at a design doc level. Do you have a design doc from planning this implementation?

I'm not really comfortable with adding any additional multi-node zmq usage without additional non-trivial effort to secure these communications.

@russellb I appreciate you raising this concern.
I will integrate the pyzme.auth module to enhance security with follow-up pr. I will change to ipc:// this time.

https://pyzmq.readthedocs.io/en/latest/api/zmq.auth.html
base on ZAP authentication and CURVE authentication
The document are here. Recommended Lark Doc.

lark doc: https://qus2es1bg99i.larksuite.com/wiki/Pbi1wFUTaiBZneksfytuQxrSsTe?from=from_copylink

google doc: https://docs.google.com/document/d/1ZwFij2OEx_K1xBx2EBx5FKfXQ9EJEGU6shYh-9MJdPs/edit?usp=sharing

Signed-off-by: clark <[email protected]>

panf2333 · 2025-01-12T13:26:03Z

This is big and complex enough that I would find it easier to discuss this at a design doc level. Do you have a design doc from planning this implementation?

I'm not really comfortable with adding any additional multi-node zmq usage without additional non-trivial effort to secure these communications.

@russellb Hi Russell, for now, I've used 'ipc://' to address immediate security concerns. However, I'll be addressing network security comprehensively in a future PR. I plan to leverage pyzmq.auth to implement robust authentication and authorization mechanisms.

Signed-off-by: clark <[email protected]>

russellb · 2025-01-14T19:44:36Z

This is big and complex enough that I would find it easier to discuss this at a design doc level. Do you have a design doc from planning this implementation?
I'm not really comfortable with adding any additional multi-node zmq usage without additional non-trivial effort to secure these communications.

@russellb Hi Russell, for now, I've used 'ipc://' to address immediate security concerns. However, I'll be addressing network security comprehensively in a future PR. I plan to leverage pyzmq.auth to implement robust authentication and authorization mechanisms.

I don't think that's sufficient. We also need a viable option for encryption, ideally with TLS.

panf2333 · 2025-01-15T03:41:13Z

This is big and complex enough that I would find it easier to discuss this at a design doc level. Do you have a design doc from planning this implementation?
I'm not really comfortable with adding any additional multi-node zmq usage without additional non-trivial effort to secure these communications.

@russellb Hi Russell, for now, I've used 'ipc://' to address immediate security concerns. However, I'll be addressing network security comprehensively in a future PR. I plan to leverage pyzmq.auth to implement robust authentication and authorization mechanisms.

I don't think that's sufficient. We also need a viable option for encryption, ideally with TLS.

@russellb I believe the disaggregation feature might benefit from optional TLS encryption. While encryption enhances security, it may introduce a slight performance overhead. Do you mean we can provide a configuration option to enable TLS encryption? This will allow users to choose the security level they need. I think users prefer to deploy clusters within secure environments such as intranets, so they want to improve performance as much as possible.

I will conduct in-depth research on auth and encryption before deciding on the selection. Before that zmq was only allowed to run locally. How about this?

russellb · 2025-01-15T16:25:00Z

That's fine. I'm completely OK with using it local-only.

Signed-off-by: clark <[email protected]>

KuntaiDu

Great work! Can you change the disaggregated prefill example file under the examples folder? Let's provide some handle for newcomers to run disaggregated prefill example without figuring out how to correctly set all the CLI args.

Signed-off-by: clark <[email protected]>

mergify · 2025-01-20T15:15:24Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @panf2333.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

# Conflicts: # examples/online_serving/disaggregated_prefill.sh Signed-off-by: clark <[email protected]>

panf2333 · 2025-01-20T16:21:18Z

Great work! Can you change the disaggregated prefill example file under the examples folder? Let's provide some handle for newcomers to run disaggregated prefill example without figuring out how to correctly set all the CLI args.
@KuntaiDu
done

KuntaiDu

LGTM!

KuntaiDu · 2025-01-22T14:08:52Z

@robertgshaw2-redhat would be great if you can take a look, if it also looks good to you I'll enable automerge.

mergify bot added the frontend label Jan 7, 2025

mergify bot mentioned this pull request Jan 7, 2025

Disaggregate prefill decode with zmq #11789

Closed

robertgshaw2-redhat requested review from russellb and robertgshaw2-redhat January 7, 2025 16:26

KuntaiDu reviewed Jan 8, 2025

View reviewed changes

KuntaiDu requested a review from youkaichao January 8, 2025 01:06

robertgshaw2-redhat reviewed Jan 8, 2025

View reviewed changes

panf2333 marked this pull request as ready for review January 8, 2025 10:20

panf2333 changed the title ~~Disaggregate prefill decode with zmq~~ [Frontend] Disaggregate prefill decode with zmq Jan 8, 2025

panf2333 added 9 commits January 9, 2025 00:38

add test py

785f407

Signed-off-by: clark <[email protected]>

add vllm connect cmd

3e7b217

Signed-off-by: clark <[email protected]>

add identity url headers

a3dd197

Signed-off-by: clark <[email protected]>

add /v1/completions stream support

15ae8a3

Signed-off-by: clark <[email protected]>

1. connect_parser set --prefill-addr and --decode-addr are required

f9db5f8

2.To more accurately reflect its purpose, we will rename connect.py to disagg_connector.py. Signed-off-by: clark <[email protected]>

update disagg_connect test_request.py

df74348

Signed-off-by: clark <[email protected]>

Replace zmq.asyncio.Context().term() with zmq.asyncio.Context().destr…

0a99d16

…oy(linger=0) for immediate termination Signed-off-by: clark <[email protected]>

1. fix mypy issue

401e24f

Signed-off-by: clark <[email protected]>

Run yapf and ruff

0728a42

Signed-off-by: clark <[email protected]>

panf2333 force-pushed the disaggregate_prefill_decode_with_zmq branch from 1bc97ec to 0728a42 Compare January 8, 2025 16:39

russellb suggested changes Jan 8, 2025

View reviewed changes

KuntaiDu mentioned this pull request Jan 10, 2025

[RFC]: Disaggregated prefilling and KV cache transfer roadmap #10818

Open

34 tasks

panf2333 added 3 commits January 11, 2025 20:44

1. replace tpc:// with ipc:// \n 2. fix json response

25bcc79

Signed-off-by: clark <[email protected]>

create proxy sockets in the proxy function for thread safety

20ec3fc

Signed-off-by: clark <[email protected]>

fix ThreadProxy

56dd225

Signed-off-by: clark <[email protected]>

run format.sh

a9b27c1

Signed-off-by: clark <[email protected]>

panf2333 added 3 commits January 16, 2025 19:43

add benchmark shell

e399a28

Signed-off-by: clark <[email protected]>

run format

3fb9edf

Signed-off-by: clark <[email protected]>

remove invalid zmq benchmark code

6addfa0

Signed-off-by: clark <[email protected]>

KuntaiDu requested changes Jan 20, 2025

View reviewed changes

change disagg_prefill example to use zmq

cf5b45d

Signed-off-by: clark <[email protected]>

mergify bot added the needs-rebase label Jan 20, 2025

Merge branch 'main' into disaggregate_prefill_decode_with_zmq

a45cde4

# Conflicts: # examples/online_serving/disaggregated_prefill.sh Signed-off-by: clark <[email protected]>

mergify bot removed the needs-rebase label Jan 20, 2025

KuntaiDu approved these changes Jan 22, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Frontend] Disaggregate prefill decode with zmq #11791

[Frontend] Disaggregate prefill decode with zmq #11791

panf2333 commented Jan 7, 2025 •

edited by github-actions bot

Loading

github-actions bot commented Jan 7, 2025

robertgshaw2-redhat commented Jan 7, 2025

KuntaiDu Jan 8, 2025

panf2333 Jan 8, 2025

KuntaiDu Jan 8, 2025

robertgshaw2-redhat Jan 8, 2025

robertgshaw2-redhat Jan 8, 2025 •

edited

Loading

panf2333 Jan 8, 2025

robertgshaw2-redhat Jan 8, 2025

panf2333 Jan 8, 2025

robertgshaw2-redhat Jan 8, 2025

panf2333 Jan 8, 2025

panf2333 Jan 12, 2025

robertgshaw2-redhat commented Jan 8, 2025 •

edited

Loading

panf2333 commented Jan 8, 2025 •

edited

Loading

panf2333 commented Jan 8, 2025

russellb left a comment

panf2333 commented Jan 9, 2025 •

edited

Loading

russellb commented Jan 9, 2025

panf2333 commented Jan 10, 2025

panf2333 commented Jan 12, 2025

russellb commented Jan 14, 2025

panf2333 commented Jan 15, 2025

russellb commented Jan 15, 2025

KuntaiDu left a comment •

edited

Loading

mergify bot commented Jan 20, 2025

panf2333 commented Jan 20, 2025 •

edited

Loading

KuntaiDu left a comment

KuntaiDu commented Jan 22, 2025

[Frontend] Disaggregate prefill decode with zmq #11791

Are you sure you want to change the base?

[Frontend] Disaggregate prefill decode with zmq #11791

Conversation

panf2333 commented Jan 7, 2025 • edited by github-actions bot Loading

Benchmark

Parameters

Evaluation Steps

Design of ZMQ-based Client-Server Communication

High-level Overview

Design of ZMQ-based Communication

github-actions bot commented Jan 7, 2025

robertgshaw2-redhat commented Jan 7, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

robertgshaw2-redhat Jan 8, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

robertgshaw2-redhat commented Jan 8, 2025 • edited Loading

panf2333 commented Jan 8, 2025 • edited Loading

panf2333 commented Jan 8, 2025

The relationship with client connector and vllm server

The zmq detail between connector and vllm server

russellb left a comment

Choose a reason for hiding this comment

panf2333 commented Jan 9, 2025 • edited Loading

russellb commented Jan 9, 2025

panf2333 commented Jan 10, 2025

panf2333 commented Jan 12, 2025

russellb commented Jan 14, 2025

panf2333 commented Jan 15, 2025

russellb commented Jan 15, 2025

KuntaiDu left a comment • edited Loading

Choose a reason for hiding this comment

mergify bot commented Jan 20, 2025

panf2333 commented Jan 20, 2025 • edited Loading

KuntaiDu left a comment

Choose a reason for hiding this comment

KuntaiDu commented Jan 22, 2025

panf2333 commented Jan 7, 2025 •

edited by github-actions bot

Loading

robertgshaw2-redhat Jan 8, 2025 •

edited

Loading

robertgshaw2-redhat commented Jan 8, 2025 •

edited

Loading

panf2333 commented Jan 8, 2025 •

edited

Loading

panf2333 commented Jan 9, 2025 •

edited

Loading

KuntaiDu left a comment •

edited

Loading

panf2333 commented Jan 20, 2025 •

edited

Loading