Skip to content

fix(onboard): confirm dashboard forward by live local port when openshell can't track it#6116

Open
tedy-y wants to merge 1 commit into
NVIDIA:mainfrom
tedy-y:fix-6099
Open

fix(onboard): confirm dashboard forward by live local port when openshell can't track it#6116
tedy-y wants to merge 1 commit into
NVIDIA:mainfrom
tedy-y:fix-6099

Conversation

@tedy-y

@tedy-y tedy-y commented Jul 1, 2026

Copy link
Copy Markdown

Summary

Fixes the macOS/Colima onboarding failure where nemoclaw onboard rolls back an otherwise-healthy sandbox at the dashboard port-forward step.

openshell forward start --background establishes the SSH ControlMaster tunnel, then does a one-shot discovery probe to register it. On Colima-backed hosts the remote bind completes after that probe, so openshell prints Could not discover backgrounded SSH process; forward may be running but is not tracked and the forward runs untracked — it never appears in openshell forward list even though the local port is already accepting connections and the dashboard is serving HTTP 200. onboard polls that empty list, times out after 180s, and deletes the sandbox.

Fix

When openshell reports the untracked-forward notice and the local forward port is actually live, treat the forward as confirmed (reason: "ok-port-live") instead of timing out.

  • Portable, dependency-free liveness check: a synchronous Node loopback TCP connect to 127.0.0.1:<port> (no lsof/ss/netstat). Works on macOS/Linux/WSL2/Windows.
  • Precisely gated on openshell's own untracked-forward notice (looksLikeUntrackedForward), so a plain empty list never falsely confirms.
  • Conflict-safe: the existing EADDRINUSE check runs first, so a port held by a different process is never mistaken for our forward.
  • Injectable (isPortListening option) so unit tests don't open real sockets.
  • Also covers the agent-fixed (:8642) forward, which uses the same helper.

Call sites only read .ok/.diagnostic, so the added "ok-port-live" reason is non-breaking.

Tests

src/lib/onboard/forward-start.test.ts:

  • untracked notice + live port → ok / ok-port-live
  • untracked notice + dead port → keeps waiting → timeout
  • no untracked notice → port probe is not consulted (guards the gate)
  • looksLikeUntrackedForward matcher unit tests

Reproduced and verified on NemoClaw v0.0.68 → v0.0.69 → v0.0.70 (the bug persists across all three); the equivalent runtime patch completed real Hermes onboards on v0.0.69 and v0.0.70.

Fixes #6099

Signed-off-by: Tedy Yu tedyy@nvidia.com

@copy-pr-bot

copy-pr-bot Bot commented Jul 1, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@coderabbitai

coderabbitai Bot commented Jul 1, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

📝 Walkthrough

Walkthrough

Adds untracked-forward detection and a local port-liveness fallback for detached forward start. When the diagnostic pattern matches and the port is listening, the helper now returns ok-port-live instead of timing out. Tests cover the matcher and fallback paths.

Changes

Untracked forward detection and port-live fallback

Layer / File(s) Summary
Diagnostic matcher and port probe implementation
src/lib/onboard/forward-start.ts
Adds "ok-port-live" to the outcome reason union, an optional isPortListening hook, looksLikeUntrackedForward(diagnostic), and probeLocalPortListening(port) for loopback TCP liveness checks.
Matcher and fallback tests
src/lib/onboard/forward-start.test.ts
Adds coverage for the untracked-forward success path, timeout behavior, probe gating, and the matcher's true/false cases.

Estimated code review effort: 2 (Simple) | ~15 minutes

Suggested labels

area: cli, area: sandbox, bug-fix, platform: macos

Suggested reviewers

cv, ericksoa

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly summarizes the main fix: confirming a forward via a live local port when openshell does not track it.
Linked Issues check ✅ Passed The change matches #6099 by treating an untracked dashboard forward as established when the local port is listening.
Out of Scope Changes check ✅ Passed The PR stays focused on the forward-confirmation fix and related tests, with no obvious unrelated code changes.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (2)
src/lib/onboard/forward-start.ts (1)

96-113: 🔒 Security & Privacy | 🔵 Trivial | 💤 Low value

Consider passing the port as an argv value instead of interpolating it into the script text.

Static analysis flags the require("node:child_process") call for command-injection risk (CWE-78). The actual risk here is negligible since port is typed number and further coerced via Number(port) before interpolation, so this is effectively a false positive. Still, for defense-in-depth and to avoid manually building a JS string, pass the port as a separate process argument rather than interpolating it into the script text.

🔒 Proposed defense-in-depth tweak
   const script =
     'const net=require("net");' +
-    `const s=net.connect(${Number(port)},"127.0.0.1");` +
+    'const s=net.connect(Number(process.argv[2]),"127.0.0.1");' +
     "let settled=false;" +
     "const finish=(code)=>{if(settled)return;settled=true;try{s.destroy();}catch{}process.exit(code);};" +
     `s.setTimeout(${probeTimeoutMs});` +
     's.once("connect",()=>finish(0));' +
     's.once("timeout",()=>finish(1));' +
     's.once("error",()=>finish(1));';
   const { spawnSync } = require("node:child_process");
-  const res = spawnSync(process.execPath, ["-e", script], {
+  const res = spawnSync(process.execPath, ["-e", script, String(Number(port) || 0)], {
     stdio: "ignore",
     timeout: probeTimeoutMs + 2_000,
   });

Separately, please confirm whether the rest of this file uses ES import statements — if so, consider hoisting child_process's spawnSync to a top-level import for consistency rather than a function-local require.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/lib/onboard/forward-start.ts` around lines 96 - 113, The
probeLocalPortListening helper currently embeds the port directly into the -e
script string, which static analysis flags as injection-prone. Update
probeLocalPortListening to accept the port as a separate argv/process argument
and read it inside the spawned script instead of interpolating it into the
script text, keeping the net.connect call behavior unchanged. Also, if this file
already uses ES module imports elsewhere, replace the function-local
require("node:child_process") with a top-level spawnSync import for consistency.

Source: Linters/SAST tools

src/lib/onboard/forward-start.test.ts (1)

302-375: 🎯 Functional Correctness | 🔵 Trivial | ⚡ Quick win

Consider a test for conflict-vs-untracked-forward precedence.

The PR explicitly calls out that the EADDRINUSE conflict check runs before the untracked-forward fallback "to avoid false positives," but no test here exercises a diagnostic that matches both looksLikeForwardPortConflict and looksLikeUntrackedForward simultaneously to prove spawn-conflict wins over ok-port-live. This is the exact safety invariant the fix depends on.

🧪 Suggested additional test
+  it("prefers the port-conflict outcome over the untracked-forward fallback", () => {
+    const fetchList = vi.fn().mockReturnValue(forwardListWith([]));
+    const spawn = vi.fn().mockImplementation(({ stderr }: { stderr: number }) => {
+      fs.writeSync(
+        stderr,
+        "EADDRINUSE: address already in use; forward may be running but is not tracked\n",
+      );
+      return { pid: 999 };
+    });
+    const sleep = vi.fn();
+    const isPortListening = vi.fn().mockReturnValue(true);
+
+    const result = runDetachedForwardStartWithDiagnostics(
+      spawn,
+      fetchList,
+      { port: 18789, sandboxName: "my-sandbox" },
+      { overallTimeoutMs: 10_000, pollIntervalMs: 10, sleepMs: sleep, isPortListening },
+    );
+
+    expect(result.ok).toBe(false);
+    expect(result.reason).toBe("spawn-conflict");
+    expect(isPortListening).not.toHaveBeenCalled();
+  });

Note: verify the actual conflict-diagnostic text expected by looksLikeForwardPortConflict before using it verbatim.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/lib/onboard/forward-start.test.ts` around lines 302 - 375, Add a
regression test around runDetachedForwardStartWithDiagnostics to cover
diagnostics that match both looksLikeForwardPortConflict and
looksLikeUntrackedForward at the same time. Mock spawn so stderr contains a
message that would satisfy both checks, then assert the result is spawn-conflict
and that isPortListening is not consulted, proving the EADDRINUSE conflict path
takes precedence over the live-port fallback.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@src/lib/onboard/forward-start.test.ts`:
- Around line 302-375: Add a regression test around
runDetachedForwardStartWithDiagnostics to cover diagnostics that match both
looksLikeForwardPortConflict and looksLikeUntrackedForward at the same time.
Mock spawn so stderr contains a message that would satisfy both checks, then
assert the result is spawn-conflict and that isPortListening is not consulted,
proving the EADDRINUSE conflict path takes precedence over the live-port
fallback.

In `@src/lib/onboard/forward-start.ts`:
- Around line 96-113: The probeLocalPortListening helper currently embeds the
port directly into the -e script string, which static analysis flags as
injection-prone. Update probeLocalPortListening to accept the port as a separate
argv/process argument and read it inside the spawned script instead of
interpolating it into the script text, keeping the net.connect call behavior
unchanged. Also, if this file already uses ES module imports elsewhere, replace
the function-local require("node:child_process") with a top-level spawnSync
import for consistency.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: c5ce884b-0bd4-49c8-ab75-95556de2a8ab

📥 Commits

Reviewing files that changed from the base of the PR and between acacf93 and 3b94407.

📒 Files selected for processing (2)
  • src/lib/onboard/forward-start.test.ts
  • src/lib/onboard/forward-start.ts

@tedy-y tedy-y self-assigned this Jul 1, 2026
…hell can't track it

On macOS hosts backed by Colima, `openshell forward start --background`
establishes the SSH tunnel but its one-shot discovery probe runs before the
remote bind completes, so the forward is never registered in
`openshell forward list`. onboard polls that empty list, times out after 180s,
and rolls back an otherwise-healthy sandbox even though the dashboard port is
already listening and serving HTTP 200.

When openshell reports "Could not discover backgrounded SSH process; forward
may be running but is not tracked", confirm the forward by the real observable —
the local forward port is actually accepting connections — via a portable,
dependency-free Node loopback TCP probe. The EADDRINUSE conflict check runs
first, so a port held by a different process is never mistaken for our forward,
and the fallback is gated on openshell's own untracked-forward notice so a plain
empty list never falsely confirms.

Also benefits the agent-fixed (:8642) forward, which uses the same helper.

Adds `looksLikeUntrackedForward`, an injectable `isPortListening` option
(default: synchronous Node TCP probe), and unit tests.

Fixes NVIDIA#6099

Signed-off-by: Tedy Yu <tedyy@nvidia.com>

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
src/lib/onboard/forward-start.ts (1)

315-325: 🚀 Performance & Scalability | 🔵 Trivial | ⚡ Quick win

Repeated subprocess spawns while untracked-forward diagnostic persists.

Once looksLikeUntrackedForward(diagSoFar) is true but the port isn't yet listening, this branch re-invokes probeLocalPortListening (spawning a fresh Node child) on every poll iteration (every pollIntervalMs, default 500ms) until either the port comes up or the 180s timeout elapses — potentially hundreds of subprocess spawns per onboard attempt. Each spawn also has its own spawnSync timeout window (up to 3.5s) that can stack on top of pollIntervalMs, silently stretching the effective poll cadence well past what pollIntervalMs implies.

Consider backing off the probe frequency (e.g., only probe every few poll cycles, or with its own interval) once the diagnostic has already matched, to reduce spawn overhead during the wait window.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/lib/onboard/forward-start.ts` around lines 315 - 325, The polling path in
forward-start’s untracked-forward fallback is spawning a new port-probe process
on every loop once looksLikeUntrackedForward(diagSoFar) matches, which can
create excessive overhead while waiting for the port to become live. Update the
forward confirmation flow in forward-start to back off
probeLocalPortListening/probe frequency after the diagnostic first matches, such
as by throttling checks to every few poll cycles or using a separate
interval/cache, while keeping the existing isPortListening confirmation and
timeout behavior intact.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@src/lib/onboard/forward-start.ts`:
- Around line 315-325: The polling path in forward-start’s untracked-forward
fallback is spawning a new port-probe process on every loop once
looksLikeUntrackedForward(diagSoFar) matches, which can create excessive
overhead while waiting for the port to become live. Update the forward
confirmation flow in forward-start to back off probeLocalPortListening/probe
frequency after the diagnostic first matches, such as by throttling checks to
every few poll cycles or using a separate interval/cache, while keeping the
existing isPortListening confirmation and timeout behavior intact.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: aec21292-de9e-453b-a260-1b4741316876

📥 Commits

Reviewing files that changed from the base of the PR and between 515d62e and cb114ac.

📒 Files selected for processing (2)
  • src/lib/onboard/forward-start.test.ts
  • src/lib/onboard/forward-start.ts
🚧 Files skipped from review as they are similar to previous changes (1)
  • src/lib/onboard/forward-start.test.ts

@tedy-y tedy-y requested a review from cv July 1, 2026 08:28
@tedy-y

tedy-y commented Jul 1, 2026

Copy link
Copy Markdown
Author

@cv Hi, is this fix looking good?

@wscurran wscurran added area: onboarding Onboarding FSM, provider setup, sandbox launch, or first-run flow area: sandbox OpenShell sandbox lifecycle, runtime, config, or recovery bug-fix PR fixes a bug or regression platform: macos Affects macOS, including Apple Silicon labels Jul 1, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area: onboarding Onboarding FSM, provider setup, sandbox launch, or first-run flow area: sandbox OpenShell sandbox lifecycle, runtime, config, or recovery bug-fix PR fixes a bug or regression platform: macos Affects macOS, including Apple Silicon

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[macOS][Onboard] nemoclaw onboard deletes the created sandbox after dashboard port-forward "did not appear in list within 180000ms"

2 participants