-
-
Notifications
You must be signed in to change notification settings - Fork 5.3k
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge branch 'main' into multinode-test
- Loading branch information
Showing
78 changed files
with
3,326 additions
and
995 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,5 +1,6 @@ | ||
# vLLM benchmark suite | ||
|
||
|
||
## Introduction | ||
|
||
This directory contains the performance benchmarking CI for vllm. | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file was deleted.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,45 @@ | ||
|
||
# Nightly benchmark | ||
|
||
The main goal of this benchmarking is two-fold: | ||
- Performance clarity: Provide clarity on which one (vllm, tensorrt-llm, lmdeploy and tgi) leads in performance in what workload. | ||
- Reproducible: one can run the exact same set of benchmarking commands inside the exact same docker by following reproducing instructions in [reproduce.md](). | ||
|
||
|
||
## Docker images | ||
|
||
We benchmark vllm, tensorrt-llm, lmdeploy and tgi using the following docker images: | ||
- vllm/vllm-openai:v0.5.0.post1 | ||
- nvcr.io/nvidia/tritonserver:24.04-trtllm-python-py3 | ||
- openmmlab/lmdeploy:v0.5.0 | ||
- ghcr.io/huggingface/text-generation-inference:2.1 | ||
|
||
<!-- Please check <a href="artifact://workspace/build/buildkite/vllm/performance-benchmark/.buildkite/nightly-benchmarks/nightly-pipeline.yaml">nightly-pipeline.yaml</a> artifact for more details on how we deploy the docker images. --> | ||
|
||
|
||
## Hardware | ||
|
||
One AWS node with 8x NVIDIA A100 GPUs. | ||
|
||
|
||
## Workload description | ||
|
||
We benchmark vllm, tensorrt-llm, lmdeploy and tgi using the following workload: | ||
|
||
- Input length: randomly sample 500 prompts from ShareGPT dataset (with fixed random seed). | ||
- Output length: the corresponding output length of these 500 prompts. | ||
- Models: llama-3 8B, llama-3 70B, mixtral 8x7B. | ||
- Average QPS (query per second): 4 for the small model (llama-3 8B) and 2 for other two models. For each QPS, the arrival time of each query is determined using a random Poisson process (with fixed random seed). | ||
- Evaluation metrics: Throughput (higher the better), TTFT (time to the first token, lower the better), ITL (inter-token latency, lower the better). | ||
|
||
<!-- Check <a href="artifact://workspace/build/buildkite/vllm/performance-benchmark/.buildkite/nightly-benchmarks/tests/nightly-tests.json">nightly-tests.json</a> artifact for more details. --> | ||
|
||
## Plots | ||
|
||
In the following plots, the dot shows the mean and the error bar shows the standard error of the mean. Value 0 means that the corresponding benchmark crashed. | ||
|
||
<img src="artifact://nightly_results.png" alt="Benchmarking results" height=250 > | ||
|
||
## Results | ||
|
||
{nightly_results_benchmarking_table} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,120 @@ | ||
common_pod_spec: &common_pod_spec | ||
priorityClassName: perf-benchmark | ||
nodeSelector: | ||
nvidia.com/gpu.product: NVIDIA-A100-SXM4-80GB | ||
volumes: | ||
- name: devshm | ||
emptyDir: | ||
medium: Memory | ||
- name: hf-cache | ||
hostPath: | ||
path: /root/.cache/huggingface | ||
type: Directory | ||
|
||
common_container_settings: &common_container_settings | ||
command: | ||
- bash .buildkite/nightly-benchmarks/run-nightly-suite.sh | ||
resources: | ||
limits: | ||
nvidia.com/gpu: 8 | ||
volumeMounts: | ||
- name: devshm | ||
mountPath: /dev/shm | ||
- name: hf-cache | ||
mountPath: /root/.cache/huggingface | ||
env: | ||
- name: VLLM_USAGE_SOURCE | ||
value: ci-test | ||
- name: HF_HOME | ||
value: /root/.cache/huggingface | ||
- name: VLLM_SOURCE_CODE_LOC | ||
value: /workspace/build/buildkite/vllm/performance-benchmark | ||
- name: HF_TOKEN | ||
valueFrom: | ||
secretKeyRef: | ||
name: hf-token-secret | ||
key: token | ||
|
||
steps: | ||
- block: ":rocket: Ready for comparing vllm against alternatives? This will take 4 hours." | ||
- label: "A100 trt benchmark" | ||
priority: 100 | ||
agents: | ||
queue: A100 | ||
plugins: | ||
- kubernetes: | ||
podSpec: | ||
<<: *common_pod_spec | ||
containers: | ||
- image: nvcr.io/nvidia/tritonserver:24.04-trtllm-python-py3 | ||
<<: *common_container_settings | ||
|
||
- label: "A100 lmdeploy benchmark" | ||
priority: 100 | ||
agents: | ||
queue: A100 | ||
plugins: | ||
- kubernetes: | ||
podSpec: | ||
<<: *common_pod_spec | ||
containers: | ||
- image: openmmlab/lmdeploy:v0.5.0 | ||
<<: *common_container_settings | ||
|
||
|
||
- label: "A100 vllm benchmark" | ||
priority: 100 | ||
agents: | ||
queue: A100 | ||
plugins: | ||
- kubernetes: | ||
podSpec: | ||
<<: *common_pod_spec | ||
containers: | ||
- image: vllm/vllm-openai:latest | ||
<<: *common_container_settings | ||
|
||
- label: "A100 tgi benchmark" | ||
priority: 100 | ||
agents: | ||
queue: A100 | ||
plugins: | ||
- kubernetes: | ||
podSpec: | ||
<<: *common_pod_spec | ||
containers: | ||
- image: ghcr.io/huggingface/text-generation-inference:2.1 | ||
<<: *common_container_settings | ||
|
||
- wait | ||
|
||
- label: "Plot" | ||
priority: 100 | ||
agents: | ||
queue: A100 | ||
plugins: | ||
- kubernetes: | ||
podSpec: | ||
<<: *common_pod_spec | ||
containers: | ||
- image: vllm/vllm-openai:v0.5.0.post1 | ||
command: | ||
- bash .buildkite/nightly-benchmarks/scripts/nightly-annotate.sh | ||
resources: | ||
limits: | ||
nvidia.com/gpu: 8 | ||
volumeMounts: | ||
- name: devshm | ||
mountPath: /dev/shm | ||
env: | ||
- name: VLLM_USAGE_SOURCE | ||
value: ci-test | ||
- name: VLLM_SOURCE_CODE_LOC | ||
value: /workspace/build/buildkite/vllm/performance-benchmark | ||
- name: HF_TOKEN | ||
valueFrom: | ||
secretKeyRef: | ||
name: hf-token-secret | ||
key: token | ||
|
||
- wait |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,76 @@ | ||
#!/bin/bash | ||
|
||
set -o pipefail | ||
set -x | ||
|
||
check_gpus() { | ||
# check the number of GPUs and GPU type. | ||
declare -g gpu_count=$(nvidia-smi --list-gpus | wc -l) | ||
if [[ $gpu_count -gt 0 ]]; then | ||
echo "GPU found." | ||
else | ||
echo "Need at least 1 GPU to run benchmarking." | ||
exit 1 | ||
fi | ||
declare -g gpu_type=$(echo $(nvidia-smi --query-gpu=name --format=csv,noheader) | awk '{print $2}') | ||
echo "GPU type is $gpu_type" | ||
} | ||
|
||
check_hf_token() { | ||
# check if HF_TOKEN is available and valid | ||
if [[ -z "$HF_TOKEN" ]]; then | ||
echo "Error: HF_TOKEN is not set." | ||
exit 1 | ||
elif [[ ! "$HF_TOKEN" =~ ^hf_ ]]; then | ||
echo "Error: HF_TOKEN does not start with 'hf_'." | ||
exit 1 | ||
else | ||
echo "HF_TOKEN is set and valid." | ||
fi | ||
} | ||
|
||
main() { | ||
|
||
check_gpus | ||
check_hf_token | ||
|
||
df -h | ||
|
||
(which wget && which curl) || (apt-get update && apt-get install -y wget curl) | ||
(which jq) || (apt-get update && apt-get -y install jq) | ||
|
||
cd $VLLM_SOURCE_CODE_LOC/benchmarks | ||
wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json | ||
|
||
|
||
# run lmdeploy | ||
if which lmdeploy >/dev/null; then | ||
echo "lmdeploy is available, redirect to run-lmdeploy-nightly.sh" | ||
bash ../.buildkite/nightly-benchmarks/scripts/run-lmdeploy-nightly.sh | ||
exit 0 | ||
fi | ||
|
||
# run tgi | ||
if [ -e /tgi-entrypoint.sh ]; then | ||
echo "tgi is available, redirect to run-tgi-nightly.sh" | ||
bash ../.buildkite/nightly-benchmarks/scripts/run-tgi-nightly.sh | ||
exit 0 | ||
fi | ||
|
||
# run trt | ||
if which trtllm-build >/dev/null; then | ||
echo "trtllm is available, redirect to run-trt-nightly.sh" | ||
bash ../.buildkite/nightly-benchmarks/scripts/run-trt-nightly.sh | ||
exit 0 | ||
fi | ||
|
||
# run vllm | ||
if [ -e /vllm-workspace ]; then | ||
echo "vllm is available, redirect to run-vllm-nightly.sh" | ||
bash ../.buildkite/nightly-benchmarks/scripts/run-vllm-nightly.sh | ||
exit 0 | ||
fi | ||
|
||
} | ||
|
||
main "$@" |
Oops, something went wrong.