Skip to content

Move warp scoped stream to a new model that uses warp blocking streams.#1771

Open
coreyjadams wants to merge 5 commits into
NVIDIA:mainfrom
coreyjadams:make-warp-streams-safe
Open

Move warp scoped stream to a new model that uses warp blocking streams.#1771
coreyjadams wants to merge 5 commits into
NVIDIA:mainfrom
coreyjadams:make-warp-streams-safe

Conversation

@coreyjadams

Copy link
Copy Markdown
Collaborator

This prevents premature garbage collection and illegal memory access.

The root cause of our issues:

  • Warp streams are blocking cuda streams.  Pytorch streams are non-blocking cuda streams.
  • Our data loader in physicsnemo (and all our code) uses pytorch stream infrastructure, so non-blocking streams.
  • We interop with Warp on the cuda side: we create streams in pytorch, pass the objects to warp, and the cuda objects are not changed.  So when we created a pytorch stream, we pass warp a non-blocking stream when warp expected a blocking stream.  Warp's behavior is 100% explained by it's design with blocking streams in mind.

The issues we see are things like the BVH structures, which create temporary objects on the GPU, would have components go out of scope in python and be garbage collected before the GPU kernel completes because it's async execution.  In pytorch, this is handled by event syncs: you can keep the GPU objects alive until a certain event has occurred, but warp will automatically handle this with the blocking streams.

The solution from Lukasz on the warp team is so simple we're all kicking ourselves for not thinking of it earlier

nonblocking_stream = wp.stream_from_torch(torch.cuda.Stream(device))  # <--- non-blocking stream
warp_stream = wp.Stream(device)  # <--- any Warp stream on the correct device

# do stuff on non-blocking stream
with wp.ScopedStream(nonblocking_stream):
    a = wp.zeros(100_000_000, dtype=float)
    for _ in range(1000):
        wp.launch(inc_kernel, dim=a.shape, inputs=[a])

warp_stream.wait_stream(nonblocking_stream)  # <--- magic

This PR codifies this as a new context scope that wraps Warp's stream scope. Using this, the SDF now works again in the GeoTransolver Datapipe!

I moved ALL our functionals over to this scope in this PR, which is why many files have been touch, since it should be exclusively safer without a performance impact. (Blocking stream != host blocking. It's a GPU stream block, not a full GPU sync).

PhysicsNeMo Pull Request

Description

Checklist

Dependencies

Review Process

All PRs are reviewed by the PhysicsNeMo team before merging.

Depending on which files are changed, GitHub may automatically assign a maintainer for review.

We are also testing AI-based code review tools (e.g., Greptile), which may add automated comments with a confidence score.
This score reflects the AI’s assessment of merge readiness and is not a qualitative judgment of your work, nor is
it an indication that the PR will be accepted / rejected.

AI-generated feedback should be reviewed critically for usefulness.
You are not required to respond to every AI comment, but they are intended to help both authors and reviewers.
Please react to Greptile comments with 👍 or 👎 to provide feedback on their accuracy.

This prevents premature garbage collection and illegal memory access.
@copy-pr-bot

copy-pr-bot Bot commented Jul 1, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@greptile-apps

greptile-apps Bot commented Jul 1, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR introduces FunctionSpec.warp_stream_scope, a context manager that replaces bare wp.ScopedStream calls across all Warp-backed functionals. The fix addresses premature garbage collection of temporary Warp objects (BVH structures, hash grids) caused by Warp's blocking-stream assumptions conflicting with PyTorch's non-blocking streams: the scope runs work on the borrowed torch stream, then has a fresh blocking Warp guard stream wait on it, forcing Warp's cleanup to sequence after compute.

  • Core fix in function_spec.py: warp_stream_scope creates a guard = wp.Stream(device) and calls guard.wait_stream(wp_launch_stream) in a finally block, matching the pattern recommended by the Warp team and correctly handling the CPU / None-stream path as a no-op.
  • 28 call sites updated: All uses of wp.ScopedStream in the repo's functionals are replaced with the new scope; point_cloud_ops.py also fixes a pre-existing bug where wp.stream_from_torch(tensor.device) was passing a device object instead of a CUDA stream.
  • _wp_launch placement: The scope is inserted at the innermost helper level in uniform_grid_gradient/utils.py rather than at the operation level, creating one guard stream per kernel call rather than one per operation — inconsistent with the pattern used elsewhere and adding unnecessary CUDA overhead.

Important Files Changed

Filename Overview
physicsnemo/core/function_spec.py Central change: adds warp_stream_scope context manager that wraps work in wp.ScopedStream and installs a blocking guard stream to sequence Warp cleanup after non-blocking torch stream work. Also adds top-level import warp as wp (acceptable since warp-lang is a hard dependency). The implementation correctly follows the Warp team's recommended solution.
physicsnemo/nn/functional/derivatives/uniform_grid_gradient/_warp_impl/utils.py Moves warp_stream_scope inside the low-level _wp_launch helper, creating a new guard wp.Stream for every individual kernel call. Other files place the scope at the operation level (around all kernel launches for one operation), which is both more efficient and more idiomatic.
physicsnemo/domain_parallel/shard_utils/point_cloud_ops.py Switches from wp.stream_from_torch(current_indices.device) (passing a device instead of a stream — a likely pre-existing bug) to FunctionSpec.warp_launch_context(current_indices), and wraps all launches in warp_stream_scope. Correct fix.
physicsnemo/nn/functional/neighbors/radius_search/_warp_impl.py Two wp.ScopedStream usages replaced with warp_stream_scope, one for forward (includes hash grid construction) and one for the gradient backward pass. Both correctly scope the entire block of Warp work under one guard stream.
physicsnemo/nn/functional/geometry/ray_mesh_intersect/_warp_impl/op.py BVH mesh build (_build_warp_mesh) and kernel launch are each wrapped in their own warp_stream_scope. The early return inside the scope is valid — Python still runs the finally block, so guard.wait_stream fires correctly.
physicsnemo/experimental/nn/symmetry/fused_norm_kernels.py Six wp.ScopedStream usages (forward/backward for rmsnorm, layernormsh, layernorm) replaced with warp_stream_scope. Scope placement is at the operation level, which is the correct pattern.
physicsnemo/nn/functional/geometry/sdf.py Single wp.ScopedStream replaced with warp_stream_scope; this is the function mentioned in the PR description as the original regression trigger (SDF in GeoTransolver Datapipe).
physicsnemo/nn/functional/derivatives/rectilinear_grid_gradient/_warp_impl/launch_backward.py Four wp.ScopedStream usages in forward/backward launch helpers replaced with warp_stream_scope. These are one-shot kernel launches without temporary Warp objects, so the guard stream overhead is low.

Comments Outside Diff (1)

  1. physicsnemo/nn/functional/derivatives/uniform_grid_gradient/_warp_impl/utils.py, line 103-129 (link)

    P2 The warp_stream_scope is placed inside _wp_launch, so a new guard wp.Stream is allocated and a CUDA event wait is enqueued on every individual kernel call. Every other functional in this PR places the scope once at the operation level (wrapping all kernel launches for a single forward/backward pass). The uniform-grid kernels don't create temporary Warp objects like BVH structures or hash grids, so the per-call guard-stream overhead provides no additional safety. The recommended fix is to remove the scope from _wp_launch and hoist it into op.py (which already holds FunctionSpec.warp_launch_context) around the _launch_forward / _launch_backward calls.

    Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Reviews (1): Last reviewed commit: "Merge branch 'main' into make-warp-strea..." | Re-trigger Greptile

@peterdsharpe peterdsharpe left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good find, and nice design! We'll have to be vigilant that we remember to use this on any new Warp-backend functionals. Maybe worth adding a sentence to FUNCTIONAL_APIS.md to this effect, beyond the example-change you did? This is also fine as-is though.

Comment thread physicsnemo/core/function_spec.py

@loliverhennigh loliverhennigh left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ktangsali ktangsali left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me! Thanks for the explainer session and walkthrough

@coreyjadams

Copy link
Copy Markdown
Collaborator Author

/ok to test 98d23f1

@coreyjadams coreyjadams enabled auto-merge July 2, 2026 18:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants