vLLM Inference Setup - Cobi Documentation

Overview

The vllmstack subchart deploys vLLM as an OpenAI-compatible inference server. The backend communicates with it through the dashboard-connect service using the standard /v1/chat/completions API. The vLLM pod requires a GPU node and is disabled by default. Enable it when you have GPU capacity available.

GPU Node Prerequisite

Before enabling vLLM, ensure:

A GPU node exists in the cluster with NVIDIA drivers installed.
The NVIDIA device plugin DaemonSet is running so nvidia.com/gpu is a schedulable resource.
GPU nodes carry the taint nvidia.com/gpu:NoSchedule (the vLLM pod tolerates this automatically when configured as shown below).

# Verify GPUs are visible
kubectl get nodes -l karpenter.sh/nodepool=gpu-spot
kubectl describe node <gpu-node> | grep "nvidia.com/gpu"

For bare-metal or on-prem clusters, apply the device plugin manually:

kubectl apply -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.14.5/nvidia-device-plugin.yml

Recommended Model — Qwen3.5-9B-AWQ

For a single GPU with 16–24 Gi VRAM (e.g., A10G, A100-40G, RTX 4090), the recommended model is QuantTrio/Qwen3.5-9B-AWQ — an AWQ-quantized 9B parameter model that fits comfortably in 24 Gi VRAM with a 8 192 token context window.

Property	Value
Model	`QuantTrio/Qwen3.5-9B-AWQ`
Quantization	AWQ (4-bit weights)
VRAM usage	~9–12 Gi (vs. 18 Gi for BF16)
Context window	8 192 tokens
GPU	1× A10G 24 Gi (or equivalent)
Disk (weights + cache)	80 Gi PVC

Values — Qwen3.5-9B-AWQ (recommended)

vllmstack:
  enabled: true

  servingEngineSpec:
    enableEngine: true
    runtimeClassName: "nvidia"   # required for GPU workloads

    tolerations:
      - key: "nvidia.com/gpu"
        operator: "Exists"
        effect: "NoSchedule"

    # Target GPU nodes by node pool label (adjust to your cluster labels)
    # nodeSelector:
    #   karpenter.sh/nodepool: gpu-spot   # Karpenter-managed
    #   # OR for on-prem / bare-metal:
    #   node-role/gpu: "true"

    modelSpec:
      - name: "qwen35-9b"
        repository: "vllm/vllm-openai"
        tag: "latest"
        modelURL: "QuantTrio/Qwen3.5-9B-AWQ"
        replicaCount: 1

        # Resource requests — size for a g5.2xlarge or equivalent on-prem GPU node
        requestCPU: 4
        requestMemory: "28Gi"
        requestGPU: 1
        limitCPU: "4"
        limitMemory: "28Gi"

        shmSize: "2Gi"          # shared memory for single-GPU; no NCCL needed

        pvcStorage: "80Gi"      # model weights (~10 Gi AWQ) + HF cache
        pvcAccessMode:
          - ReadWriteOnce

        # Cache directories on the PVC
        env:
          - name: HF_HUB_CACHE
            value: /data/huggingface/hub
          - name: HUGGINGFACE_HUB_CACHE
            value: /data/huggingface/hub
          - name: TRANSFORMERS_CACHE
            value: /data/huggingface/transformers
          - name: HF_HUB_DISABLE_XET
            value: "1"
          - name: XDG_CACHE_HOME
            value: /data/.cache
          - name: TMPDIR
            value: /data/tmp

        # Init container creates cache dirs on the PVC before vLLM starts
        initContainer:
          name: init-cache-dirs
          image: busybox:1.36
          command: ["sh"]
          args:
            - -c
            - "mkdir -p /data/tmp /data/huggingface/hub /data/huggingface/transformers /data/.cache && chmod -R 777 /data"
          mountPvcStorage: true

        vllmConfig:
          dtype: "float16"
          tensorParallelSize: 1        # single GPU
          maxModelLen: 8192
          gpuMemoryUtilization: 0.90
          enableChunkedPrefill: true
          enablePrefixCaching: false   # disable with AWQ to avoid cache fragmentation
          maxNumSeqs: 16
          extraArgs:
            - "--trust-remote-code"
            - "--language-model-only"
            - "--download-dir"
            - "/data/huggingface"
            - "--quantization"
            - "awq"
            - "--enforce-eager"
            - "--attention-backend"
            - "flashinfer"
            - "--max-num-batched-tokens"
            - "1024"
            - "--disable-log-stats"

        hf_token: "<your-huggingface-token>"

  routerSpec:
    enableRouter: false

Alternative Models

Qwen3.5-9B (BF16, no quantization)

Requires 1× A10G 24 Gi. Slightly lower throughput than AWQ but avoids quantization artifacts.

modelSpec:
  - name: "qwen35-9b-bf16"
    modelURL: "Qwen/Qwen3.5-9B"
    requestGPU: 1
    requestMemory: "20Gi"
    pvcStorage: "30Gi"

    vllmConfig:
      dtype: "bfloat16"
      tensorParallelSize: 1
      maxModelLen: 4096          # reduced — BF16 leaves ~4 Gi for KV cache
      gpuMemoryUtilization: 0.92
      enableChunkedPrefill: true
      enablePrefixCaching: true
      maxNumSeqs: 16
      extraArgs:
        - "--trust-remote-code"

Qwen3.5-27B (4× GPU, high quality)

Requires 4× A10G 24 Gi (e.g., g5.12xlarge or 4-GPU bare-metal node).

modelSpec:
  - name: "qwen35-27b"
    modelURL: "Qwen/Qwen3.5-27B"
    requestCPU: 16
    requestMemory: "60Gi"
    requestGPU: 4
    limitCPU: "16"
    limitMemory: "60Gi"
    shmSize: "20Gi"              # NCCL needs large shared memory for 4-GPU comms
    pvcStorage: "80Gi"           # 27B BF16 weights ≈ 54 Gi

    vllmConfig:
      dtype: "bfloat16"
      tensorParallelSize: 4      # one shard per A10G
      maxModelLen: 8192
      gpuMemoryUtilization: 0.90
      enableChunkedPrefill: true
      enablePrefixCaching: true
      maxNumSeqs: 32
      extraArgs:
        - "--trust-remote-code"

Air-Gapped Deployment

In environments without internet access, pre-download the model weights to a shared volume or container registry mirror:

Download weights on a machine with internet access:

pip install huggingface_hub
huggingface-cli download QuantTrio/Qwen3.5-9B-AWQ --local-dir ./model-weights

Serve the weights from a local HTTP server or a pre-populated PVC.
In the values file, set --download-dir to the local path and remove the hf_token.

Verify the Inference Endpoint

Once the vLLM pod is Running:

# Port-forward the vLLM service (it is a ClusterIP service)
kubectl port-forward -n cobi svc/cobi-dashboard-vllmstack-engine 8000:8000

# Test with a chat completion request
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen35-9b",
    "messages": [{"role": "user", "content": "Hello!"}],
    "max_tokens": 64
  }'

The service name follows the pattern: <release-name>-vllmstack-engine.

Configure dashboard-connect

Set LLM_BASE_URL to point at the vLLM ClusterIP service:

kubectl create secret generic cobi-connect-secrets \
  --namespace cobi \
  --from-literal=LLM_BASE_URL="http://cobi-dashboard-vllmstack-engine:8000/v1" \
  --from-literal=LLM_MODEL="qwen35-9b"

Then reference it in your values:

connect:
  envFrom:
    - secretRef:
        name: cobi-connect-secrets

​Overview

​GPU Node Prerequisite

​Recommended Model — Qwen3.5-9B-AWQ

​Values — Qwen3.5-9B-AWQ (recommended)

​Alternative Models

​Qwen3.5-9B (BF16, no quantization)

​Qwen3.5-27B (4× GPU, high quality)

​Air-Gapped Deployment

​Verify the Inference Endpoint

​Configure dashboard-connect