Stabilize local vLLM DeepGEMM warmup startup by jioffe502 · Pull Request #2292 · NVIDIA/NeMo-Retriever

jioffe502 · 2026-07-02T14:04:25Z

Summary

Default NRL local vLLM startup to VLLM_DEEP_GEMM_WARMUP=skip via os.environ.setdefault(...).
Apply that default before every local NRL vllm.LLM constructor: embedding, VL rerank, captioning, and Nemotron Parse.
Do not set VLLM_USE_DEEP_GEMM=0 and do not hard-code CUDA_HOME; users can still opt into another vLLM warmup mode explicitly.

Why

JP20 local harness runs have been failing during ingest before any rows are written. The failing artifact points to local embedding/vLLM startup, not caption or rerank:

results.json: exit_code: 10, failed_phase: ingest
ingest_plan.json: caption: null, local nvidia/llama-nemotron-embed-1b-v2, local_ingest_embed_backend: "vllm"
query_plan.json: rerank: false

Original trace:

RuntimeError: DeepGEMM backend is not available or outdated. Please install or update the `deep_gemm` to a newer version to enable FP8 kernels.
RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}

File ".../llama_nemotron_embed_1b_v2_embedder.py", line 69, in _ensure_loaded
    self._llm = create_vllm_llm(
File ".../models/inference/vllm.py", line 87, in create_vllm_llm
    return LLM(**kwargs)

Downstream symptom:

lancedb_write summary: total=3147 accepted=0 dropped_bad_length=3147 expected_dim=2048

Review question

Is skipping optional DeepGEMM warmup the right default for NRL local startup reliability, while letting Hopper/Blackwell performance owners opt into VLLM_DEEP_GEMM_WARMUP=full or another upstream-supported mode?

Validation

uv run --project nemo_retriever pytest nemo_retriever/tests/test_vllm_embed.py nemo_retriever/tests/test_nemotron_rerank_vl_v2.py nemo_retriever/tests/test_caption_model_profiles.py -q
- 97 passed, 2 warnings
python -m compileall on changed Python files
git diff --check
Live local embedding smoke with CUDA/DeepGEMM env vars unset:

warmup skip
shape (1, 2048)
dtype torch.float32

Stabilize vLLM DeepGEMM warmup startup

f471a67

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Stabilize local vLLM DeepGEMM warmup startup#2292

Stabilize local vLLM DeepGEMM warmup startup#2292
jioffe502 wants to merge 1 commit into
NVIDIA:mainfrom
jioffe502:codex/vllm-deepgemm-warmup-skip

jioffe502 commented Jul 2, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

jioffe502 commented Jul 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Why

Review question

Validation

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

jioffe502 commented Jul 2, 2026 •

edited

Loading