Skip to main content

Overview

By default, vLLM downloads model weights from Hugging Face Hub at pod startup. In air-gapped or regulated environments you may prefer to host weights in your own infrastructure and have the cluster pull from there instead. Two approaches are supported:
ApproachBest for
Git LFSModel versioning alongside code, existing Git infrastructure (GitHub, GitLab, Gitea)
MinIOSimplest path — you already have MinIO running in the cluster
Both approaches use the same pattern:
  1. Upload model weights to your storage once, from a machine with internet access.
  2. An init container on the vLLM pod copies weights to the PVC before vLLM starts.
  3. vLLM loads from the local PVC path — no HuggingFace token required.

Option A — Git LFS

1. Set up Git LFS in your repository

Git LFS stores large binary files (model weights) in a separate blob store while keeping pointer files in the repo. It is supported by GitHub, GitLab, Gitea, and Bitbucket.
# Install Git LFS (one-time, on your workstation)
git lfs install

# In your model repository
git lfs track "*.safetensors" "*.bin" "*.pt" "*.gguf"
git add .gitattributes
git commit -m "track model weight files with LFS"

2. Download and push model weights

On a machine with internet access:
# Download from Hugging Face
pip install huggingface_hub
huggingface-cli download QuantTrio/Qwen3.5-9B-AWQ \
  --local-dir ./qwen35-9b-awq

# Push to your Git repository
cd your-model-repo
cp -r ./qwen35-9b-awq ./models/qwen35-9b-awq
git add models/qwen35-9b-awq
git commit -m "add Qwen3.5-9B-AWQ weights"
git push
LFS objects (the large weight files) are uploaded to your Git provider’s LFS storage automatically.

3. Create a pull secret in the cluster

Create a read-only access token in your Git provider, then store it as a Kubernetes Secret:
kubectl create secret generic model-repo-credentials \
  --namespace cobi \
  --from-literal=GIT_REPO_URL="https://github.com/your-org/your-model-repo.git" \
  --from-literal=GIT_USERNAME="<username-or-oauth2>" \
  --from-literal=GIT_TOKEN="<read-only-pat-or-deploy-token>"
For GitHub, create a fine-grained PAT with read-only access to the model repository. For GitLab, use a project deploy token with read_repository scope. For Gitea, create an application token.

4. Configure the vLLM init container

Replace the busybox init container with an alpine/git image that clones the repo and pulls LFS files. The model weights land on the PVC at /data/model before vLLM starts.
vllmstack:
  enabled: true

  servingEngineSpec:
    enableEngine: true
    runtimeClassName: "nvidia"

    tolerations:
      - key: "nvidia.com/gpu"
        operator: "Exists"
        effect: "NoSchedule"

    modelSpec:
      - name: "qwen35-9b"
        repository: "vllm/vllm-openai"
        tag: "latest"
        modelURL: "/data/model/models/qwen35-9b-awq"   # local path — no HF download
        replicaCount: 1

        requestCPU: 4
        requestMemory: "28Gi"
        requestGPU: 1
        limitCPU: "4"
        limitMemory: "28Gi"
        shmSize: "2Gi"
        pvcStorage: "80Gi"
        pvcAccessMode:
          - ReadWriteOnce

        env:
          - name: TMPDIR
            value: /data/tmp
          - name: GIT_REPO_URL
            valueFrom:
              secretKeyRef:
                name: model-repo-credentials
                key: GIT_REPO_URL
          - name: GIT_USERNAME
            valueFrom:
              secretKeyRef:
                name: model-repo-credentials
                key: GIT_USERNAME
          - name: GIT_TOKEN
            valueFrom:
              secretKeyRef:
                name: model-repo-credentials
                key: GIT_TOKEN

        initContainer:
          name: pull-model-weights
          image: alpine/git:2.43.0
          command: ["sh"]
          args:
            - -c
            - |
              set -e
              mkdir -p /data/tmp /data/model /data/.cache
              chmod -R 777 /data
              # Skip clone if weights already exist (pod restart)
              if [ -d /data/model/models/qwen35-9b-awq ]; then
                echo "Model weights already present, skipping clone."
                exit 0
              fi
              # Install git-lfs
              apk add --no-cache git-lfs
              git lfs install
              # Clone with embedded credentials (shallow, LFS enabled)
              REPO_URL="https://${GIT_USERNAME}:${GIT_TOKEN}@$(echo $GIT_REPO_URL | sed 's|https://||')"
              git clone --depth=1 "$REPO_URL" /data/model
              echo "Model weights pulled successfully."
          mountPvcStorage: true

        vllmConfig:
          dtype: "float16"
          tensorParallelSize: 1
          maxModelLen: 8192
          gpuMemoryUtilization: 0.90
          enableChunkedPrefill: true
          enablePrefixCaching: false
          maxNumSeqs: 16
          extraArgs:
            - "--trust-remote-code"
            - "--quantization"
            - "awq"
            - "--enforce-eager"
            - "--attention-backend"
            - "flashinfer"
            - "--max-num-batched-tokens"
            - "1024"
            - "--disable-log-stats"

  routerSpec:
    enableRouter: false
The init container skips cloning if the weights directory already exists. This avoids re-downloading on pod restarts. If you push new weights to the repo, delete the PVC and let the pod reprovision.

Option B — MinIO

Since MinIO is already running in the cluster, you can use it as a model registry. Upload weights once; the init container fetches them on first pod start.

1. Upload model weights to MinIO

On a machine with internet access, download the weights and push to MinIO:
# Download from Hugging Face
huggingface-cli download QuantTrio/Qwen3.5-9B-AWQ \
  --local-dir ./qwen35-9b-awq

# Configure MinIO client (port-forward first if uploading remotely)
kubectl port-forward -n cobi svc/cobi-dashboard-minio 9000:9000

mc alias set cobi http://localhost:9000 admin <minio-root-password>

# Create a models bucket and upload
mc mb cobi/models
mc cp --recursive ./qwen35-9b-awq cobi/models/qwen35-9b-awq/
Verify:
mc ls cobi/models/qwen35-9b-awq/

2. Configure the vLLM init container

Replace the busybox init container with minio/mc to pull weights from the models bucket to the PVC:
vllmstack:
  enabled: true

  servingEngineSpec:
    enableEngine: true
    runtimeClassName: "nvidia"

    tolerations:
      - key: "nvidia.com/gpu"
        operator: "Exists"
        effect: "NoSchedule"

    modelSpec:
      - name: "qwen35-9b"
        repository: "vllm/vllm-openai"
        tag: "latest"
        modelURL: "/data/model"   # local path on PVC
        replicaCount: 1

        requestCPU: 4
        requestMemory: "28Gi"
        requestGPU: 1
        limitCPU: "4"
        limitMemory: "28Gi"
        shmSize: "2Gi"
        pvcStorage: "80Gi"
        pvcAccessMode:
          - ReadWriteOnce

        env:
          - name: TMPDIR
            value: /data/tmp
          - name: MINIO_ENDPOINT
            value: "http://cobi-dashboard-minio:9000"
          - name: MINIO_ACCESS_KEY
            valueFrom:
              secretKeyRef:
                name: minio-credentials
                key: root-user
          - name: MINIO_SECRET_KEY
            valueFrom:
              secretKeyRef:
                name: minio-credentials
                key: root-password

        initContainer:
          name: pull-model-weights
          image: minio/mc:latest
          command: ["sh"]
          args:
            - -c
            - |
              set -e
              mkdir -p /data/tmp /data/.cache
              chmod -R 777 /data
              # Skip download if weights already present
              if [ -f /data/model/config.json ]; then
                echo "Model weights already present, skipping download."
                exit 0
              fi
              mc alias set storage "$MINIO_ENDPOINT" "$MINIO_ACCESS_KEY" "$MINIO_SECRET_KEY"
              mc cp --recursive storage/models/qwen35-9b-awq/ /data/model/
              echo "Model weights downloaded successfully."
          mountPvcStorage: true

        vllmConfig:
          dtype: "float16"
          tensorParallelSize: 1
          maxModelLen: 8192
          gpuMemoryUtilization: 0.90
          enableChunkedPrefill: true
          enablePrefixCaching: false
          maxNumSeqs: 16
          extraArgs:
            - "--trust-remote-code"
            - "--quantization"
            - "awq"
            - "--enforce-eager"
            - "--attention-backend"
            - "flashinfer"
            - "--max-num-batched-tokens"
            - "1024"
            - "--disable-log-stats"

  routerSpec:
    enableRouter: false
minio-credentials is the same Secret used by the MinIO subchart itself. See MinIO Setup for how to create it. The secret keys are root-user and root-password.

Monitoring the Init Container

# Watch init container logs while the pod starts
kubectl logs -n cobi -l app.kubernetes.io/name=vllmstack \
  -c pull-model-weights --follow

# Check pod events if the init container fails
kubectl describe pod -n cobi -l app.kubernetes.io/name=vllmstack
A successful pull ends with:
Model weights downloaded successfully.
After the init container exits, vLLM starts and loads from /data/model.

Updating the Model

To deploy a new version of the weights:
# Delete the PVC so the init container re-downloads on next pod start
kubectl delete pvc -n cobi -l app.kubernetes.io/name=vllmstack

# The next helm upgrade or pod restart triggers a fresh pull
helm upgrade cobi-dashboard oci://registry-1.docker.io/hellocobi/cobi-dashboard \
  --namespace cobi \
  --version 0.1.0 \
  --values my-values.yaml \
  --wait