fix(shields): make relock recovery hint driver-neutral (#6126)#6127
Conversation
The three 'manual intervention required' branches in shields printed `Re-lock manually via kubectl exec, then run: nemoclaw <name> shields up` regardless of the sandbox's OpenShell driver. The docker and vm drivers run per-sandbox direct containers with no k8s control plane, so there is no kubectl to exec with — and docker is the default driver on v0.0.71. Users hitting a failed shields-up rollback were pointed at a command that cannot exist on their install. Replace the hardcoded kubectl instruction with a single manualRelockRecoveryHint() helper that emits recoveries valid on every driver: re-run `shields up`, or rebuild a known-good baseline with `nemoclaw <name> rebuild --yes` (matching the existing driver-neutral hint already used elsewhere in this file). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: Jason Ma <jama@nvidia.com>
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (4)
✅ Files skipped from review due to trivial changes (1)
🚧 Files skipped from review as they are similar to previous changes (3)
📝 WalkthroughWalkthroughCentralized shields recovery messaging now avoids kubectl-specific instructions, adds readiness-based rollback classification, updates related docs, and expands test coverage for staged recovery and severity handling. ChangesDriver-neutral recovery hint and staged severity
Estimated code review effort: 3 (Moderate) | ~25 minutes Suggested labels: Suggested reviewers: 🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
Code Coverage OverviewLanguages: TypeScript TypeScript / code-coverage/pluginThe overall coverage in the branch is 96%. Coverage data for the branch is not yet available. Show a code coverage summary of the most covered files.
TypeScript / code-coverage/cliThe overall coverage in the branch is 68%. Coverage data for the branch is not yet available. Show a code coverage summary of the most covered files.
Updated |
Move the issue reference to a trailing '(#6126)' suffix and lead with behavior instead of a metadata label, satisfying the test-title-style repository check. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: Jason Ma <jama@nvidia.com>
E2E Advisor RecommendationRequired E2E: Dispatch hint: Full advisor summaryE2E Recommendation AdvisorBase: Required E2E
Optional E2E
New E2E recommendations
Dispatch hint
|
E2E Target RecommendationRequired E2E targets: Dispatch required E2E targets:
Full E2E target advisor summaryE2E Target AdvisorBase: Required E2E targets
Optional E2E targets
Relevant changed files
|
PR Review Advisor (Nemotron Ultra) — Changes requestedMerge posture: Do not merge yet Action checklist
Findings index
🚨 Required before mergeAddress these before merging unless a maintainer explicitly overrides the advisor with rationale.
|
PR Review Advisor — No blocking findingsMerge posture: No blocking advisor findings Action checklist
Test follow-ups to resolve or justifyIf these cover changed behavior, prefer adding them in this PR; otherwise state why existing coverage is enough or link the follow-up.
This is an automated, non-binding review; it still expects maintainers and agents to respond to each required or warning item. Treat suggestions as current-PR improvements when they touch changed code; defer only with maintainer rationale or a linked follow-up. A human maintainer must make the final merge decision. |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
src/lib/shields/index.ts (1)
2857-2862: 🎯 Functional Correctness | 🔵 Trivial | ⚡ Quick winFourth "manual intervention" site still lacks recovery guidance.
The drift-remediation failure path prints
"Config remains drifted — manual intervention required."with no follow-up recovery instructions, unlike the three sites now usingmanualRelockRecoveryHint. For consistency with the PR's goal of driver-neutral, actionable recovery messaging, consider also emitting the hint here.♻️ Proposed fix
const message = relock.error ?? "Config re-lock did not re-confirm after settle window"; console.error(` ERROR: ${message}`); console.error(" Config remains drifted — manual intervention required."); + console.error(manualRelockRecoveryHint(sandboxName)); return failShieldsCommand(message, opts.throwOnError);🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/lib/shields/index.ts` around lines 2857 - 2862, The drift-remediation failure path in the relock check still prints a manual-intervention message without any recovery guidance. Update the failure branch around the relock confirmation logic to also emit the same manual recovery hint used at the other manual intervention sites, likely via manualRelockRecoveryHint alongside the existing console.error calls before failShieldsCommand is returned.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/lib/shields/index.test.ts`:
- Around line 1091-1109: The current tests only validate
manualRelockRecoveryHint in isolation and do not prove the real print sites use
it. Add or keep a test that drives the actual failure paths in
rollbackShieldsDown and the two shieldsUpWithoutHostLock branches by mocking
run/lockAgentConfig to fail, then assert the emitted console.error output is
driver-neutral and contains the sandbox-specific recovery hint. This ensures the
call sites no longer rely on a hardcoded kubectl message and that the migrated
path is exercised end-to-end.
---
Nitpick comments:
In `@src/lib/shields/index.ts`:
- Around line 2857-2862: The drift-remediation failure path in the relock check
still prints a manual-intervention message without any recovery guidance. Update
the failure branch around the relock confirmation logic to also emit the same
manual recovery hint used at the other manual intervention sites, likely via
manualRelockRecoveryHint alongside the existing console.error calls before
failShieldsCommand is returned.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 4bdcc6c3-bb12-42f7-b0a9-7f44368a7a5d
📒 Files selected for processing (2)
src/lib/shields/index.test.tssrc/lib/shields/index.ts
prekshivyas
left a comment
There was a problem hiding this comment.
Reviewed the diff. Pure recovery-hint text change — manualRelockRecoveryHint is driver-neutral and routed through all three "manual intervention required" sites; no lock/seal/exec logic touched. Tests assert no kubectl and both valid recoveries. CodeRabbit + both PR advisors + E2E + CI all green. LGTM.
Signed-off-by: Carlos Villela <cvillela@nvidia.com>
|
@coderabbitai review |
✅ Action performedReview finished.
|
|
🌿 Preview your docs: https://nvidia-preview-pr-6127.docs.buildwithfern.com/nemoclaw |
Signed-off-by: Carlos Villela <cvillela@nvidia.com>
There was a problem hiding this comment.
🧹 Nitpick comments (1)
src/lib/shields/flow.test.ts (1)
654-678: 🎯 Functional Correctness | 🔵 Trivial | ⚡ Quick winBroad
runmock stubs everyrunner.runcall, not just policy restore.
run: () => ({ status: 1 })makes all invocations ofrunner.runfail during this test, not just the snapshot-restore command. The assertion on the thrown message ("policy restore exited with status 1") confirms the intended failure point is reached, but the mock doesn't distinguish which command produced the failure — if an earlierruncall inshieldsUpwere reordered or another step is added, this test would still pass for the wrong reason.Consider scoping the mock to the specific restore command (matching on
cmdcontents) so the test only fails the target call and better documents/asserts the claim it's testing.As per path instructions, tests should flag "broad mocks that bypass the behavior under test... conditionals that make a test pass without exercising its claim."
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/lib/shields/flow.test.ts` around lines 654 - 678, The test uses an overly broad runner.run stub that fails every command, which can hide whether shieldsUp() is actually exercising the snapshot-restore path. Update the harness setup in flow.test.ts so the mock only returns status 1 for the specific policy restore command invoked during shieldsUp("openclaw", ...), and let other runner.run calls succeed. Use the existing shieldsUp, harness, and runner.run command matching to scope the failure to the restore step being asserted.Source: Path instructions
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@src/lib/shields/flow.test.ts`:
- Around line 654-678: The test uses an overly broad runner.run stub that fails
every command, which can hide whether shieldsUp() is actually exercising the
snapshot-restore path. Update the harness setup in flow.test.ts so the mock only
returns status 1 for the specific policy restore command invoked during
shieldsUp("openclaw", ...), and let other runner.run calls succeed. Use the
existing shieldsUp, harness, and runner.run command matching to scope the
failure to the restore step being asserted.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: c7d856d6-e3be-475d-a100-ee360eae34fd
📒 Files selected for processing (4)
docs/reference/commands-nemohermes.mdxdocs/reference/commands.mdxsrc/lib/shields/flow.test.tssrc/lib/shields/index.ts
✅ Files skipped from review due to trivial changes (2)
- docs/reference/commands.mdx
- docs/reference/commands-nemohermes.mdx
Signed-off-by: Carlos Villela <cvillela@nvidia.com>
|
@cv heads-up — CI went red after
Leaving these to you since it's your active refactor — flagging so it's not a surprise. The PR was otherwise green + approved before this commit. Ping me if you want a hand once the refactor settles. |
|
@coderabbitai review |
✅ Action performedReview finished.
|
Vitest E2E Target Results —
|
| Job | Result |
|---|---|
| agent-turn-latency | |
| bedrock-runtime-compatible-anthropic | |
| brave-search | |
| channels-add-remove | |
| channels-stop-start | |
| cloud-inference | |
| cloud-onboard | |
| common-egress-agent | |
| concurrent-gateway-ports | |
| credential-migration | |
| credential-sanitization | |
| cron-preflight-inference-local | |
| device-auth-health | |
| diagnostics | |
| docs-validation | |
| double-onboard | |
| full-e2e | |
| gateway-drift-preflight | |
| gateway-guard-recovery | |
| gateway-health-honest | |
| gpu-double-onboard | |
| gpu-e2e | |
| hermes-dashboard | |
| hermes-discord | |
| hermes-e2e | |
| hermes-inference-switch | |
| hermes-root-entrypoint-smoke | |
| hermes-sandbox-secret-boundary | |
| hermes-slack | |
| inference-routing | |
| issue-2478-crash-loop-recovery | |
| issue-4434-tui-unreachable-inference | |
| issue-4462-scope-upgrade-approval | |
| jetson-nvmap-gpu | |
| kimi-inference-compat | |
| launchable-smoke | |
| live | |
| messaging-compatible-endpoint | |
| messaging-providers | |
| model-router-provider-routed-inference | |
| network-policy | |
| ollama-auth-proxy | |
| onboard-negative-paths | |
| onboard-repair | |
| onboard-resume | |
| openclaw-discord-pairing | |
| openclaw-inference-switch | |
| openclaw-skill-cli | |
| openclaw-slack-pairing | |
| openclaw-tui-chat-correlation | |
| openshell-gateway-auth-contract | |
| openshell-gateway-upgrade | |
| openshell-version-pin | |
| overlayfs-autofix | |
| rebuild-hermes | |
| rebuild-hermes-stale-base | |
| rebuild-openclaw | |
| runtime-overrides | |
| sandbox-operations | |
| sandbox-rebuild | |
| sandbox-rlimits-connect | |
| sandbox-survival | |
| security-posture | |
| sessions-agents-cli | |
| shields-config | |
| skill-agent | |
| snapshot-commands | |
| spark-install | |
| state-backup-restore | |
| telegram-injection | |
| token-rotation | |
| tunnel-lifecycle | |
| upgrade-stale-sandbox |
|
Maintainer disposition for exact head
CodeRabbit reports no actionable comments, and the exact-head GPT advisor reports |
Signed-off-by: Carlos Villela <cvillela@nvidia.com>
|
@coderabbitai review\n\nExact head |
|
✅ Action performedReview finished.
|
|
Follow-up maintainer disposition for exact head The earlier disposition at #6127 (comment) still applies. The only code change since then is a test-only 15-second timeout for the integration-style flow test that exceeded Vitest's 5-second default in two hosted coverage runs. The exact-head GPT advisor remains
PRA-3/PRA-4 and the remaining repeated items continue to be explicitly overridden for the safety, scope, coverage, and cohesion reasons in the earlier disposition. No further code change is warranted from this advisor output. |
Vitest E2E Target Results — ✅ All requested jobs passedRun: 28544153726
|
|
CodeRabbit finding at |
Summary
Replace the invalid
kubectlrelock hint with staged, driver-neutral recovery for every shields branch that can leave OpenClaw config unlocked or drifted. The guidance now asks operators to confirm the sandbox is running and ready, retryshields up, and rebuild only if that retry fails, while preserving CRITICAL severity for structural, unsafe, or mixed rollback failures.Related Issue
Fixes #6126
Changes
nemoclawornemohermesCLI name instead of hardcoding a Kubernetes-only path.ifstatements.rebuild --yesis destructive, so a failed readiness probe must never trigger it automatically.Type of Change
Quality Gates
Verification
Verifiedin GitHubnpx prek run --from-ref main --to-ref HEADpassesnpm testpasses (broad runtime changes only)npm run docsbuilds without warnings (doc changes only)Evidence
umask 022, including repository checks and full CLI/integration coverage; normal push hooks passed CLI type-checking and version synchronization.git diff --checkpassed.npm run docscompleted with 0 errors and 2 pre-existing warnings; generated agent variants are current.rebuild --yes, so recovery remains an explicit readiness check, bounded retry, then operator-approved rebuild. This staged workaround can be removed when the runtime exposes a trusted primitive that proves or restores readiness before a safe retry without recreating the sandbox.shields-configlive job supplies runtime regression coverage but intentionally does not inject destructive failure states.flow.test.tsremains one cohesive lifecycle/rollback harness at 982 lines, below the repository's 1500-line test budget. Splitting these seven recovery cases would duplicate the same timer, policy, registry, privileged-exec, branding, and audit setup rather than isolate a distinct subsystem.Signed-off-by: Jason Ma jama@nvidia.com
Signed-off-by: Carlos Villela cvillela@nvidia.com
Summary by CodeRabbit
rebuild --yes.shields uptroubleshooting guidance across command references to match the new confirm-and-retry flow and therebuild --yesfallback.