NVIDIA GPU/Driver Compatibility for vLLM + OpenLLM: Qwen 3.5, GLM, DeepSeek, Gemma
A practical compatibility and troubleshooting guide for serving modern open models with vLLM and OpenLLM, including NVIDIA GPU/driver baselines, CUDA alignment, and real-world failure patterns.
If you are deploying modern open models, the biggest production risk is usually not model quality.
It is version mismatch across:
- NVIDIA GPU architecture (Turing / Ampere / Ada / Hopper / Blackwell),
- NVIDIA driver branch,
- CUDA runtime/toolkit,
- PyTorch + vLLM version,
- OpenLLM backend integration and model template assumptions.
This guide is a practical field manual for teams serving Qwen 3.5, GLM, DeepSeek, and Gemma family models.
Scope: inference/serving with vLLM and OpenLLM in Linux environments (bare metal or containers), with emphasis on real incident patterns.
1) The Compatibility Stack You Must Keep Aligned
Think in this order:
- GPU capability (VRAM, tensor core generation, BF16/FP8 practicality)
- Driver branch (must support your CUDA userspace)
- Container CUDA userspace (e.g., cu121, cu124)
- PyTorch binary (compiled for same CUDA family)
- vLLM build (linked against matching torch/cuda ecosystem)
- OpenLLM runtime wiring (backend flags, tokenizer chat template, model config)
If one layer is behind, you can get:
- startup crashes,
CUDA error: invalid device function,- NCCL hangs,
- silent throughput collapse,
- incorrect output format from chat template mismatch.
2) Practical NVIDIA Driver/CUDA Baselines
Exact minimums evolve by release, but these operational baselines reduce incidents for current vLLM stacks:
| CUDA userspace target | Practical driver branch baseline (Linux) | Notes |
|---|---|---|
| CUDA 12.1 class | R530+ (prefer R535+) | Stable floor for many older PyTorch/vLLM combos |
| CUDA 12.2/12.3 class | R535+/R545+ | Good bridge period for mixed clusters |
| CUDA 12.4 class | R550+ | Common baseline for newer high-throughput images |
| CUDA 12.5/12.6 class | R555+/R560+ | Use when stack explicitly requires it |
Why this matters
- Newer containers can start even if host driver is old, then fail at runtime.
- Driver mismatch frequently appears as cryptic kernel launch errors rather than a clear compatibility message.
Minimal host checks
nvidia-smi
cat /proc/driver/nvidia/version
Container checks
python -c "import torch; print(torch.__version__, torch.version.cuda)"
python -c "import vllm; print(vllm.__version__)"
3) Known-Good Versioning Strategy (vLLM + OpenLLM)
Instead of chasing latest for every package, use tested bundles:
- Pin torch + vllm + transformers + tokenizers together.
- Upgrade one axis at a time (driver first, then container base, then runtime).
- Keep OpenLLM backend config pinned per model family.
A practical policy:
- Maintain two lanes:
- stable lane for production,
- canary lane for latest models/features.
- Promote only after load + regression + prompt-format tests pass.
4) Model Family Notes (Qwen 3.5, GLM, DeepSeek, Gemma)
The table below is a deployment-oriented summary (not a benchmark ranking).
| Family | Typical serving mode in vLLM | Common gotcha | What to validate first |
|---|---|---|---|
| Qwen 3.5 | BF16/FP16 or quantized; long-context workloads common | Chat template / tokenizer mismatch causes quality drop | tokenizer_config.json, generation defaults, context tests |
| GLM (latest open checkpoints) | Standard causal LM serving pattern (family-specific prompt style) | Special tokens and prompt wrappers differ from Llama-style defaults | End-to-end prompt template golden tests |
| DeepSeek (latest open instruct/coder variants) | High-throughput decoding; reasoning/coder variants can be memory-heavy | OOM due to optimistic max model len + high concurrency | KV cache sizing and max tokens under real traffic |
| Gemma (latest open variants) | Strong small-to-mid model serving economics | Tokenizer/chat formatting differences between checkpoints | Prompt canonicalization + stop sequence validation |
5) Qwen 3.5 on vLLM: GPU/Driver Planning
For Qwen 3.5-class deployments, plan around three axes:
- Model size (effective weight memory)
- Context length (KV cache growth)
- Concurrency target
Rule-of-thumb hardware tiers
| Deployment target | Practical GPU class | Driver/CUDA recommendation |
|---|---|---|
| Local/prototype, low concurrency | 24GB class (L4/A10/4090) | Driver branch aligned to chosen CUDA 12.x image |
| Small production API | 40–48GB class or 2x24GB | Prefer newer R550+ path for fewer CUDA edge issues |
| Long-context + higher concurrency | 80GB class or multi-GPU | Keep NCCL + driver + CUDA tightly pinned across nodes |
Memory planning reminder
Total memory pressure is not just weights:
Total ≈ Weights + KV Cache + Runtime Overhead + Fragmentation Buffer
Teams often fit weights on paper, then fail in production due to KV cache under realistic conversation lengths.
6) What Breaks Most Often (and How to Fix It)
A) CUDA error: invalid device function
Typical cause
- Binary built for a different CUDA/SM expectation than host/device path.
Fix path
- Verify host driver branch is new enough.
- Use official/prebuilt wheels matching your CUDA lane.
- Avoid mixing random nightly wheels across torch/vllm.
B) NCCL timeout / multi-GPU hang
Typical cause
- Inconsistent driver/CUDA/NCCL environment across nodes, or bad network interface selection.
Fix path
- Ensure all nodes have identical driver major branch.
- Pin container image digest (not floating tags).
- Set/verify NCCL network env explicitly for your fabric.
- Run a small all-reduce smoke test before model serving.
C) OOM despite “model should fit”
Typical cause
- KV cache + concurrency + long context underestimated.
Fix path
- Lower max context / max num seqs initially.
- Use quantization path validated for your workload.
- Increase tensor parallel or move to larger VRAM tier.
- Keep headroom (10–20%) for fragmentation spikes.
D) Bad answer format / role confusion
Typical cause
- Chat template mismatch between model family and serving wrapper defaults.
Fix path
- Lock model-specific template in config.
- Add golden prompt-output tests per family.
- Block deploy if template checksum changes unexpectedly.
E) Throughput collapse after “upgrade”
Typical cause
- Kernel path changes, eager fallback, or scheduler defaults changed.
Fix path
- Compare
tokens/sec, p95 latency, and GPU utilization before/after. - Keep synthetic + replayed production prompts for A/B.
- Roll back quickly if regression > agreed threshold.
7) OpenLLM + vLLM Integration Checklist
Before production cutover:
- Pin OpenLLM version and backend adapter config.
- Pin vLLM + torch + transformers bundle.
- Freeze tokenizer/chat template per model family.
- Record host driver branch in deployment metadata.
- Run canary with real prompt distributions.
- Validate stream format, stop behavior, and tool-call schema.
- Validate autoscaling warmup and cold-start latency.
Recommended CI gates:
- Boot test: model loads and serves a trivial prompt.
- Template test: prompt wrappers produce expected role tokens.
- Load test: fixed QPS/concurrency pass criteria.
- Long-context test: no OOM/hang at planned context percentile.
8) What You May Need to Change During Upgrades
When moving to newer model releases (including latest Qwen/GLM/DeepSeek/Gemma), expect to update:
- tokenizer/chat template logic,
max_model_lenand batching strategy,- quantization choice,
- tensor/pipeline parallel settings,
- driver branch on hosts,
- container base image (CUDA lane).
Things that often do not work well
- Keeping old drivers while upgrading to newest CUDA-tagged runtime images.
- Mixing heterogeneous driver branches in one inference cluster.
- Assuming Llama-family prompt wrapper works for every non-Llama model.
- Raising max context without redoing KV cache/concurrency capacity tests.
9) Reference Rollout Plan (Low-Risk)
- Stage 0: Inventory
- Capture GPU SKU, VRAM, driver branch, CUDA lane, runtime versions.
- Stage 1: Single-node validation
- One model family at a time, fixed prompt harness.
- Stage 2: Multi-GPU/node smoke
- NCCL and network path validation.
- Stage 3: Canary traffic
- Real traffic shadow or small percentage live.
- Stage 4: Full rollout
- Keep rollback image and prior driver-compatible lane ready.
10) Final Takeaway
For modern OpenLLM serving, success comes from compatibility discipline, not just bigger GPUs.
- Treat driver/CUDA/runtime/model-template as one deployable unit.
- Pin known-good bundles for vLLM + OpenLLM.
- For Qwen 3.5 and other latest model families, always re-validate tokenizer/template + long-context memory behavior before full rollout.
If you do this, most “mysterious” production incidents become predictable and preventable.