Overview
Thevllmstack subchart deploys vLLM as an OpenAI-compatible inference server. The backend communicates with it through the dashboard-connect service using the standard /v1/chat/completions API.
The vLLM pod requires a GPU node and is disabled by default. Enable it when you have GPU capacity available.
GPU Node Prerequisite
Before enabling vLLM, ensure:- A GPU node exists in the cluster with NVIDIA drivers installed.
- The NVIDIA device plugin DaemonSet is running so
nvidia.com/gpuis a schedulable resource. - GPU nodes carry the taint
nvidia.com/gpu:NoSchedule(the vLLM pod tolerates this automatically when configured as shown below).
Recommended Model — Qwen3.5-9B-AWQ
For a single GPU with 16–24 Gi VRAM (e.g., A10G, A100-40G, RTX 4090), the recommended model isQuantTrio/Qwen3.5-9B-AWQ — an AWQ-quantized 9B parameter model that fits comfortably in 24 Gi VRAM with a 8 192 token context window.
| Property | Value |
|---|---|
| Model | QuantTrio/Qwen3.5-9B-AWQ |
| Quantization | AWQ (4-bit weights) |
| VRAM usage | ~9–12 Gi (vs. 18 Gi for BF16) |
| Context window | 8 192 tokens |
| GPU | 1× A10G 24 Gi (or equivalent) |
| Disk (weights + cache) | 80 Gi PVC |
Values — Qwen3.5-9B-AWQ (recommended)
Alternative Models
Qwen3.5-9B (BF16, no quantization)
Requires 1× A10G 24 Gi. Slightly lower throughput than AWQ but avoids quantization artifacts.Qwen3.5-27B (4× GPU, high quality)
Requires 4× A10G 24 Gi (e.g.,g5.12xlarge or 4-GPU bare-metal node).
Air-Gapped Deployment
In environments without internet access, pre-download the model weights to a shared volume or container registry mirror:-
Download weights on a machine with internet access:
- Serve the weights from a local HTTP server or a pre-populated PVC.
-
In the values file, set
--download-dirto the local path and remove thehf_token.
Verify the Inference Endpoint
Once the vLLM pod isRunning:
<release-name>-vllmstack-engine.
Configure dashboard-connect
SetLLM_BASE_URL to point at the vLLM ClusterIP service: