Overview
By default, vLLM downloads model weights from Hugging Face Hub at pod startup. In air-gapped or regulated environments you may prefer to host weights in your own infrastructure and have the cluster pull from there instead.
Two approaches are supported:
| Approach | Best for |
|---|
| Git LFS | Model versioning alongside code, existing Git infrastructure (GitHub, GitLab, Gitea) |
| MinIO | Simplest path — you already have MinIO running in the cluster |
Both approaches use the same pattern:
- Upload model weights to your storage once, from a machine with internet access.
- An init container on the vLLM pod copies weights to the PVC before vLLM starts.
- vLLM loads from the local PVC path — no HuggingFace token required.
Option A — Git LFS
1. Set up Git LFS in your repository
Git LFS stores large binary files (model weights) in a separate blob store while keeping pointer files in the repo. It is supported by GitHub, GitLab, Gitea, and Bitbucket.
# Install Git LFS (one-time, on your workstation)
git lfs install
# In your model repository
git lfs track "*.safetensors" "*.bin" "*.pt" "*.gguf"
git add .gitattributes
git commit -m "track model weight files with LFS"
2. Download and push model weights
On a machine with internet access:
# Download from Hugging Face
pip install huggingface_hub
huggingface-cli download QuantTrio/Qwen3.5-9B-AWQ \
--local-dir ./qwen35-9b-awq
# Push to your Git repository
cd your-model-repo
cp -r ./qwen35-9b-awq ./models/qwen35-9b-awq
git add models/qwen35-9b-awq
git commit -m "add Qwen3.5-9B-AWQ weights"
git push
LFS objects (the large weight files) are uploaded to your Git provider’s LFS storage automatically.
3. Create a pull secret in the cluster
Create a read-only access token in your Git provider, then store it as a Kubernetes Secret:
kubectl create secret generic model-repo-credentials \
--namespace cobi \
--from-literal=GIT_REPO_URL="https://github.com/your-org/your-model-repo.git" \
--from-literal=GIT_USERNAME="<username-or-oauth2>" \
--from-literal=GIT_TOKEN="<read-only-pat-or-deploy-token>"
For GitHub, create a fine-grained PAT with read-only access to the model repository. For GitLab, use a project deploy token with read_repository scope. For Gitea, create an application token.
Replace the busybox init container with an alpine/git image that clones the repo and pulls LFS files. The model weights land on the PVC at /data/model before vLLM starts.
vllmstack:
enabled: true
servingEngineSpec:
enableEngine: true
runtimeClassName: "nvidia"
tolerations:
- key: "nvidia.com/gpu"
operator: "Exists"
effect: "NoSchedule"
modelSpec:
- name: "qwen35-9b"
repository: "vllm/vllm-openai"
tag: "latest"
modelURL: "/data/model/models/qwen35-9b-awq" # local path — no HF download
replicaCount: 1
requestCPU: 4
requestMemory: "28Gi"
requestGPU: 1
limitCPU: "4"
limitMemory: "28Gi"
shmSize: "2Gi"
pvcStorage: "80Gi"
pvcAccessMode:
- ReadWriteOnce
env:
- name: TMPDIR
value: /data/tmp
- name: GIT_REPO_URL
valueFrom:
secretKeyRef:
name: model-repo-credentials
key: GIT_REPO_URL
- name: GIT_USERNAME
valueFrom:
secretKeyRef:
name: model-repo-credentials
key: GIT_USERNAME
- name: GIT_TOKEN
valueFrom:
secretKeyRef:
name: model-repo-credentials
key: GIT_TOKEN
initContainer:
name: pull-model-weights
image: alpine/git:2.43.0
command: ["sh"]
args:
- -c
- |
set -e
mkdir -p /data/tmp /data/model /data/.cache
chmod -R 777 /data
# Skip clone if weights already exist (pod restart)
if [ -d /data/model/models/qwen35-9b-awq ]; then
echo "Model weights already present, skipping clone."
exit 0
fi
# Install git-lfs
apk add --no-cache git-lfs
git lfs install
# Clone with embedded credentials (shallow, LFS enabled)
REPO_URL="https://${GIT_USERNAME}:${GIT_TOKEN}@$(echo $GIT_REPO_URL | sed 's|https://||')"
git clone --depth=1 "$REPO_URL" /data/model
echo "Model weights pulled successfully."
mountPvcStorage: true
vllmConfig:
dtype: "float16"
tensorParallelSize: 1
maxModelLen: 8192
gpuMemoryUtilization: 0.90
enableChunkedPrefill: true
enablePrefixCaching: false
maxNumSeqs: 16
extraArgs:
- "--trust-remote-code"
- "--quantization"
- "awq"
- "--enforce-eager"
- "--attention-backend"
- "flashinfer"
- "--max-num-batched-tokens"
- "1024"
- "--disable-log-stats"
routerSpec:
enableRouter: false
The init container skips cloning if the weights directory already exists. This avoids re-downloading on pod restarts. If you push new weights to the repo, delete the PVC and let the pod reprovision.
Option B — MinIO
Since MinIO is already running in the cluster, you can use it as a model registry. Upload weights once; the init container fetches them on first pod start.
1. Upload model weights to MinIO
On a machine with internet access, download the weights and push to MinIO:
# Download from Hugging Face
huggingface-cli download QuantTrio/Qwen3.5-9B-AWQ \
--local-dir ./qwen35-9b-awq
# Configure MinIO client (port-forward first if uploading remotely)
kubectl port-forward -n cobi svc/cobi-dashboard-minio 9000:9000
mc alias set cobi http://localhost:9000 admin <minio-root-password>
# Create a models bucket and upload
mc mb cobi/models
mc cp --recursive ./qwen35-9b-awq cobi/models/qwen35-9b-awq/
Verify:
mc ls cobi/models/qwen35-9b-awq/
Replace the busybox init container with minio/mc to pull weights from the models bucket to the PVC:
vllmstack:
enabled: true
servingEngineSpec:
enableEngine: true
runtimeClassName: "nvidia"
tolerations:
- key: "nvidia.com/gpu"
operator: "Exists"
effect: "NoSchedule"
modelSpec:
- name: "qwen35-9b"
repository: "vllm/vllm-openai"
tag: "latest"
modelURL: "/data/model" # local path on PVC
replicaCount: 1
requestCPU: 4
requestMemory: "28Gi"
requestGPU: 1
limitCPU: "4"
limitMemory: "28Gi"
shmSize: "2Gi"
pvcStorage: "80Gi"
pvcAccessMode:
- ReadWriteOnce
env:
- name: TMPDIR
value: /data/tmp
- name: MINIO_ENDPOINT
value: "http://cobi-dashboard-minio:9000"
- name: MINIO_ACCESS_KEY
valueFrom:
secretKeyRef:
name: minio-credentials
key: root-user
- name: MINIO_SECRET_KEY
valueFrom:
secretKeyRef:
name: minio-credentials
key: root-password
initContainer:
name: pull-model-weights
image: minio/mc:latest
command: ["sh"]
args:
- -c
- |
set -e
mkdir -p /data/tmp /data/.cache
chmod -R 777 /data
# Skip download if weights already present
if [ -f /data/model/config.json ]; then
echo "Model weights already present, skipping download."
exit 0
fi
mc alias set storage "$MINIO_ENDPOINT" "$MINIO_ACCESS_KEY" "$MINIO_SECRET_KEY"
mc cp --recursive storage/models/qwen35-9b-awq/ /data/model/
echo "Model weights downloaded successfully."
mountPvcStorage: true
vllmConfig:
dtype: "float16"
tensorParallelSize: 1
maxModelLen: 8192
gpuMemoryUtilization: 0.90
enableChunkedPrefill: true
enablePrefixCaching: false
maxNumSeqs: 16
extraArgs:
- "--trust-remote-code"
- "--quantization"
- "awq"
- "--enforce-eager"
- "--attention-backend"
- "flashinfer"
- "--max-num-batched-tokens"
- "1024"
- "--disable-log-stats"
routerSpec:
enableRouter: false
minio-credentials is the same Secret used by the MinIO subchart itself. See MinIO Setup for how to create it. The secret keys are root-user and root-password.
Monitoring the Init Container
# Watch init container logs while the pod starts
kubectl logs -n cobi -l app.kubernetes.io/name=vllmstack \
-c pull-model-weights --follow
# Check pod events if the init container fails
kubectl describe pod -n cobi -l app.kubernetes.io/name=vllmstack
A successful pull ends with:
Model weights downloaded successfully.
After the init container exits, vLLM starts and loads from /data/model.
Updating the Model
To deploy a new version of the weights:
# Delete the PVC so the init container re-downloads on next pod start
kubectl delete pvc -n cobi -l app.kubernetes.io/name=vllmstack
# The next helm upgrade or pod restart triggers a fresh pull
helm upgrade cobi-dashboard oci://registry-1.docker.io/hellocobi/cobi-dashboard \
--namespace cobi \
--version 0.1.0 \
--values my-values.yaml \
--wait