Skip to content

Commit

Permalink
Update deploying_with_k8s.md with AMD ROCm GPU example (#11465)
Browse files Browse the repository at this point in the history
Signed-off-by: Alex He <[email protected]>
Co-authored-by: Cyrus Leung <[email protected]>
  • Loading branch information
AlexHe99 and DarkLight1337 authored Dec 27, 2024
1 parent 6c6f7fe commit d003f3e
Showing 1 changed file with 78 additions and 1 deletion.
79 changes: 78 additions & 1 deletion docs/source/serving/deploying_with_k8s.md
Original file line number Diff line number Diff line change
Expand Up @@ -47,7 +47,11 @@ data:
token: "REPLACE_WITH_TOKEN"
```
Create a deployment file for vLLM to run the model server. The following example deploys the `Mistral-7B-Instruct-v0.3` model:
Next to create the deployment file for vLLM to run the model server. The following example deploys the `Mistral-7B-Instruct-v0.3` model.

Here are two examples for using NVIDIA GPU and AMD GPU.

- NVIDIA GPU

```yaml
apiVersion: apps/v1
Expand Down Expand Up @@ -119,6 +123,79 @@ spec:
periodSeconds: 5
```

- AMD GPU

You can refer to the `deployment.yaml` below if using AMD ROCm GPU like MI300X.

```yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: mistral-7b
namespace: default
labels:
app: mistral-7b
spec:
replicas: 1
selector:
matchLabels:
app: mistral-7b
template:
metadata:
labels:
app: mistral-7b
spec:
volumes:
# PVC
- name: cache-volume
persistentVolumeClaim:
claimName: mistral-7b
# vLLM needs to access the host's shared memory for tensor parallel inference.
- name: shm
emptyDir:
medium: Memory
sizeLimit: "8Gi"
hostNetwork: true
hostIPC: true
containers:
- name: mistral-7b
image: rocm/vllm:rocm6.2_mi300_ubuntu20.04_py3.9_vllm_0.6.4
securityContext:
seccompProfile:
type: Unconfined
runAsGroup: 44
capabilities:
add:
- SYS_PTRACE
command: ["/bin/sh", "-c"]
args: [
"vllm serve mistralai/Mistral-7B-v0.3 --port 8000 --trust-remote-code --enable-chunked-prefill --max_num_batched_tokens 1024"
]
env:
- name: HUGGING_FACE_HUB_TOKEN
valueFrom:
secretKeyRef:
name: hf-token-secret
key: token
ports:
- containerPort: 8000
resources:
limits:
cpu: "10"
memory: 20G
amd.com/gpu: "1"
requests:
cpu: "6"
memory: 6G
amd.com/gpu: "1"
volumeMounts:
- name: cache-volume
mountPath: /root/.cache/huggingface
- name: shm
mountPath: /dev/shm
```
You can get the full example with steps and sample yaml files from <https://github.com/ROCm/k8s-device-plugin/tree/master/example/vllm-serve>.

2. **Create a Kubernetes Service for vLLM**

Next, create a Kubernetes Service file to expose the `mistral-7b` deployment:
Expand Down

0 comments on commit d003f3e

Please sign in to comment.