Move warp scoped stream to a new model that uses warp blocking streams. by coreyjadams · Pull Request #1771 · NVIDIA/physicsnemo

coreyjadams · 2026-07-01T19:21:39Z

This prevents premature garbage collection and illegal memory access.

The root cause of our issues:

Warp streams are blocking cuda streams. Pytorch streams are non-blocking cuda streams.
Our data loader in physicsnemo (and all our code) uses pytorch stream infrastructure, so non-blocking streams.
We interop with Warp on the cuda side: we create streams in pytorch, pass the objects to warp, and the cuda objects are not changed. So when we created a pytorch stream, we pass warp a non-blocking stream when warp expected a blocking stream. Warp's behavior is 100% explained by it's design with blocking streams in mind.

The issues we see are things like the BVH structures, which create temporary objects on the GPU, would have components go out of scope in python and be garbage collected before the GPU kernel completes because it's async execution. In pytorch, this is handled by event syncs: you can keep the GPU objects alive until a certain event has occurred, but warp will automatically handle this with the blocking streams.

The solution from Lukasz on the warp team is so simple we're all kicking ourselves for not thinking of it earlier

nonblocking_stream = wp.stream_from_torch(torch.cuda.Stream(device))  # <--- non-blocking stream
warp_stream = wp.Stream(device)  # <--- any Warp stream on the correct device

# do stuff on non-blocking stream
with wp.ScopedStream(nonblocking_stream):
    a = wp.zeros(100_000_000, dtype=float)
    for _ in range(1000):
        wp.launch(inc_kernel, dim=a.shape, inputs=[a])

warp_stream.wait_stream(nonblocking_stream)  # <--- magic

This PR codifies this as a new context scope that wraps Warp's stream scope. Using this, the SDF now works again in the GeoTransolver Datapipe!

I moved ALL our functionals over to this scope in this PR, which is why many files have been touch, since it should be exclusively safer without a performance impact. (Blocking stream != host blocking. It's a GPU stream block, not a full GPU sync).

PhysicsNeMo Pull Request

Description

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.
The CHANGELOG.md is up to date with these changes.
An issue is linked to this pull request.
If I am implementing a new model or modifying any existing model, I have followed the Models Implementation Coding Standards.

Dependencies

Review Process

All PRs are reviewed by the PhysicsNeMo team before merging.

Depending on which files are changed, GitHub may automatically assign a maintainer for review.

We are also testing AI-based code review tools (e.g., Greptile), which may add automated comments with a confidence score.
This score reflects the AI’s assessment of merge readiness and is not a qualitative judgment of your work, nor is
it an indication that the PR will be accepted / rejected.

AI-generated feedback should be reviewed critically for usefulness.
You are not required to respond to every AI comment, but they are intended to help both authors and reviewers.
Please react to Greptile comments with 👍 or 👎 to provide feedback on their accuracy.

This prevents premature garbage collection and illegal memory access.

copy-pr-bot · 2026-07-01T19:21:58Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

greptile-apps · 2026-07-01T19:27:45Z

Greptile Summary

This PR introduces FunctionSpec.warp_stream_scope, a context manager that replaces bare wp.ScopedStream calls across all Warp-backed functionals. The fix addresses premature garbage collection of temporary Warp objects (BVH structures, hash grids) caused by Warp's blocking-stream assumptions conflicting with PyTorch's non-blocking streams: the scope runs work on the borrowed torch stream, then has a fresh blocking Warp guard stream wait on it, forcing Warp's cleanup to sequence after compute.

Core fix in function_spec.py: warp_stream_scope creates a guard = wp.Stream(device) and calls guard.wait_stream(wp_launch_stream) in a finally block, matching the pattern recommended by the Warp team and correctly handling the CPU / None-stream path as a no-op.
28 call sites updated: All uses of wp.ScopedStream in the repo's functionals are replaced with the new scope; point_cloud_ops.py also fixes a pre-existing bug where wp.stream_from_torch(tensor.device) was passing a device object instead of a CUDA stream.
_wp_launch placement: The scope is inserted at the innermost helper level in uniform_grid_gradient/utils.py rather than at the operation level, creating one guard stream per kernel call rather than one per operation — inconsistent with the pattern used elsewhere and adding unnecessary CUDA overhead.

Important Files Changed

Filename	Overview
physicsnemo/core/function_spec.py	Central change: adds `warp_stream_scope` context manager that wraps work in `wp.ScopedStream` and installs a blocking guard stream to sequence Warp cleanup after non-blocking torch stream work. Also adds top-level `import warp as wp` (acceptable since warp-lang is a hard dependency). The implementation correctly follows the Warp team's recommended solution.
physicsnemo/nn/functional/derivatives/uniform_grid_gradient/_warp_impl/utils.py	Moves `warp_stream_scope` inside the low-level `_wp_launch` helper, creating a new guard `wp.Stream` for every individual kernel call. Other files place the scope at the operation level (around all kernel launches for one operation), which is both more efficient and more idiomatic.
physicsnemo/domain_parallel/shard_utils/point_cloud_ops.py	Switches from `wp.stream_from_torch(current_indices.device)` (passing a device instead of a stream — a likely pre-existing bug) to `FunctionSpec.warp_launch_context(current_indices)`, and wraps all launches in `warp_stream_scope`. Correct fix.
physicsnemo/nn/functional/neighbors/radius_search/_warp_impl.py	Two `wp.ScopedStream` usages replaced with `warp_stream_scope`, one for forward (includes hash grid construction) and one for the gradient backward pass. Both correctly scope the entire block of Warp work under one guard stream.
physicsnemo/nn/functional/geometry/ray_mesh_intersect/_warp_impl/op.py	BVH mesh build (`_build_warp_mesh`) and kernel launch are each wrapped in their own `warp_stream_scope`. The early `return` inside the scope is valid — Python still runs the `finally` block, so `guard.wait_stream` fires correctly.
physicsnemo/experimental/nn/symmetry/fused_norm_kernels.py	Six `wp.ScopedStream` usages (forward/backward for rmsnorm, layernormsh, layernorm) replaced with `warp_stream_scope`. Scope placement is at the operation level, which is the correct pattern.
physicsnemo/nn/functional/geometry/sdf.py	Single `wp.ScopedStream` replaced with `warp_stream_scope`; this is the function mentioned in the PR description as the original regression trigger (SDF in GeoTransolver Datapipe).
physicsnemo/nn/functional/derivatives/rectilinear_grid_gradient/_warp_impl/launch_backward.py	Four `wp.ScopedStream` usages in forward/backward launch helpers replaced with `warp_stream_scope`. These are one-shot kernel launches without temporary Warp objects, so the guard stream overhead is low.

Comments Outside Diff (1)

physicsnemo/nn/functional/derivatives/uniform_grid_gradient/_warp_impl/utils.py, line 103-129 (link)

The warp_stream_scope is placed inside _wp_launch, so a new guard wp.Stream is allocated and a CUDA event wait is enqueued on every individual kernel call. Every other functional in this PR places the scope once at the operation level (wrapping all kernel launches for a single forward/backward pass). The uniform-grid kernels don't create temporary Warp objects like BVH structures or hash grids, so the per-call guard-stream overhead provides no additional safety. The recommended fix is to remove the scope from _wp_launch and hoist it into op.py (which already holds FunctionSpec.warp_launch_context) around the _launch_forward / _launch_backward calls.

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

_{Reviews (1): Last reviewed commit: "Merge branch 'main' into make-warp-strea..." | Re-trigger Greptile}

peterdsharpe

Good find, and nice design! We'll have to be vigilant that we remember to use this on any new Warp-backend functionals. Maybe worth adding a sentence to FUNCTIONAL_APIS.md to this effect, beyond the example-change you did? This is also fine as-is though.

loliverhennigh

LGTM

ktangsali

Looks good to me! Thanks for the explainer session and walkthrough

coreyjadams · 2026-07-02T18:45:23Z

/ok to test 98d23f1

Move warp scoped stream to a new model that uses warp blocking streams.

b9cb604

This prevents premature garbage collection and illegal memory access.

coreyjadams requested review from CharlelieLrt, RishikeshRanade, ktangsali, loliverhennigh and peterdsharpe as code owners July 1, 2026 19:21

Merge branch 'main' into make-warp-streams-safe

7e44fc6

peterdsharpe approved these changes Jul 1, 2026

View reviewed changes

Merge branch 'main' into make-warp-streams-safe

fe27425

loliverhennigh reviewed Jul 1, 2026

View reviewed changes

Comment thread physicsnemo/core/function_spec.py

loliverhennigh approved these changes Jul 1, 2026

View reviewed changes

Merge branch 'main' into make-warp-streams-safe

3e9fbb8

ktangsali approved these changes Jul 2, 2026

View reviewed changes

Merge branch 'main' into make-warp-streams-safe

98d23f1

coreyjadams enabled auto-merge July 2, 2026 18:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Move warp scoped stream to a new model that uses warp blocking streams.#1771

Move warp scoped stream to a new model that uses warp blocking streams.#1771
coreyjadams wants to merge 5 commits into
NVIDIA:mainfrom
coreyjadams:make-warp-streams-safe

coreyjadams commented Jul 1, 2026

Uh oh!

copy-pr-bot Bot commented Jul 1, 2026

Uh oh!

greptile-apps Bot commented Jul 1, 2026 •

edited

Loading

Comments Outside Diff (1)

Uh oh!

peterdsharpe left a comment

Uh oh!

Uh oh!

loliverhennigh left a comment

Uh oh!

ktangsali left a comment

Uh oh!

coreyjadams commented Jul 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

Conversation

coreyjadams commented Jul 1, 2026

PhysicsNeMo Pull Request

Description

Checklist

Dependencies

Review Process

Uh oh!

copy-pr-bot Bot commented Jul 1, 2026

Uh oh!

greptile-apps Bot commented Jul 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Important Files Changed

Comments Outside Diff (1)

Uh oh!

peterdsharpe left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

loliverhennigh left a comment

Choose a reason for hiding this comment

Uh oh!

ktangsali left a comment

Choose a reason for hiding this comment

Uh oh!

coreyjadams commented Jul 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

greptile-apps Bot commented Jul 1, 2026 •

edited

Loading