Skip to content

Stabilize local vLLM DeepGEMM warmup startup#2292

Draft
jioffe502 wants to merge 1 commit into
NVIDIA:mainfrom
jioffe502:codex/vllm-deepgemm-warmup-skip
Draft

Stabilize local vLLM DeepGEMM warmup startup#2292
jioffe502 wants to merge 1 commit into
NVIDIA:mainfrom
jioffe502:codex/vllm-deepgemm-warmup-skip

Conversation

@jioffe502

@jioffe502 jioffe502 commented Jul 2, 2026

Copy link
Copy Markdown
Collaborator

Summary

  • Default NRL local vLLM startup to VLLM_DEEP_GEMM_WARMUP=skip via os.environ.setdefault(...).
  • Apply that default before every local NRL vllm.LLM constructor: embedding, VL rerank, captioning, and Nemotron Parse.
  • Do not set VLLM_USE_DEEP_GEMM=0 and do not hard-code CUDA_HOME; users can still opt into another vLLM warmup mode explicitly.

Why

JP20 local harness runs have been failing during ingest before any rows are written. The failing artifact points to local embedding/vLLM startup, not caption or rerank:

  • results.json: exit_code: 10, failed_phase: ingest
  • ingest_plan.json: caption: null, local nvidia/llama-nemotron-embed-1b-v2, local_ingest_embed_backend: "vllm"
  • query_plan.json: rerank: false

Original trace:

RuntimeError: DeepGEMM backend is not available or outdated. Please install or update the `deep_gemm` to a newer version to enable FP8 kernels.
RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}

File ".../llama_nemotron_embed_1b_v2_embedder.py", line 69, in _ensure_loaded
    self._llm = create_vllm_llm(
File ".../models/inference/vllm.py", line 87, in create_vllm_llm
    return LLM(**kwargs)

Downstream symptom:

lancedb_write summary: total=3147 accepted=0 dropped_bad_length=3147 expected_dim=2048

Review question

Is skipping optional DeepGEMM warmup the right default for NRL local startup reliability, while letting Hopper/Blackwell performance owners opt into VLLM_DEEP_GEMM_WARMUP=full or another upstream-supported mode?

Validation

  • uv run --project nemo_retriever pytest nemo_retriever/tests/test_vllm_embed.py nemo_retriever/tests/test_nemotron_rerank_vl_v2.py nemo_retriever/tests/test_caption_model_profiles.py -q
    • 97 passed, 2 warnings
  • python -m compileall on changed Python files
  • git diff --check
  • Live local embedding smoke with CUDA/DeepGEMM env vars unset:
warmup skip
shape (1, 2048)
dtype torch.float32

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant