You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
On startup the daemon replays persisted subclusters and restarts their vats, fetching each vat's code from the bundleSpec recorded in the kernel store. If a single referenced bundle no longer exists on disk (moved, deleted, or otherwise unresolvable), the fetch throws ENOENT, which propagates all the way to the daemon's top-level catch and aborts the entire daemon startup.
Because the daemon is spawned detached with stdio: 'ignore', the CLI only ever sees ensureDaemon's generic "Daemon did not start within 30s" timeout, and the actual fatal is written to a discarded stderr as the useless string Daemon fatal: [object Object].
Net effect: one orphaned bundle reference makes the whole kernel unbootable, with no actionable error surfaced.
Severity / impact
Fragility. Any persisted vat whose bundle is moved or deleted takes down the daemon. Persisted kernel state that outlives its bundle files is common (bundles rebuilt to a new path, artifacts pruned, absolute paths that don't survive relocation), so this is a realistic failure mode, not a corner case.
Poor observability. The real cause is masked twice over — first stringified to [object Object], then hidden behind a 30s timeout — so an affected user has no way to diagnose it.
Wedged process. A failed restore leaves a process holding the kernel.sqlite lock and a stale daemon.sock, because the daemon writes its pidfile only after a successful init. The orphan-interlock therefore can't detect a daemon that hung/failed during init, and subsequent start attempts collide on the sqlite lock.
Failure chain (file:line)
Kernel boot restores persisted subclusters/vats.packages/ocap-kernel/src/Kernel.ts:285 (restore persisted subclusters, added in fix(ocap-kernel): restore IO channels for persisted subclusters #963), the vat-restart path packages/ocap-kernel/src/vats/VatManager.ts:208 (restartVat), and packages/ocap-kernel/src/vats/SubclusterManager.ts:372 (restorePersistedIOChannels).
The bundle is fetched from disk and throws ENOENT.packages/kernel-node-runtime/src/vat/fetch-blob.ts:16 — return new Response(await fs.readFile(url.fileURLToPath(parsedURL))). The thrown value is a Node errno object with no useful .stack.
The error aborts startup and is masked.packages/kernel-cli/src/commands/daemon-entry.ts:17-20 — main().catch((error) => process.stderr.write(\Daemon fatal: ${String(error)}\n`)). String(errnoObject)→[object Object], and stderr is discarded because the spawn uses stdio: 'ignore' (packages/kernel-cli/src/commands/daemon-spawn.ts`).
The CLI reports only the timeout.packages/kernel-cli/src/commands/daemon-spawn.ts (MAX_POLLS = 300, POLL_INTERVAL_MS = 100 → 30s) → "Daemon did not start within 30s".
Observed fatal (path genericized):
ENOENT: no such file or directory, open '/…/some-vat.bundle'
Reproduction (deterministic)
Use a throwaway --home so nothing real is touched.
Start a daemon against a scratch home and launch a subcluster whose vat bundleSpec is a file:// path to a bundle you control.
Stop the daemon. Confirm the subcluster is persisted in <home>/kernel.sqlite.
Delete (or rename) the referenced .bundle file.
Start the daemon again against the same <home>.
Observed: the daemon never becomes responsive; pingDaemon stays false; the CLI prints "Daemon did not start within 30s"; <home>/daemon.log (or the discarded stderr) shows Daemon fatal: [object Object]. A node process is left holding the sqlite lock and a stale daemon.sock.
Minimal isolation harness that reproduces just the restore (run on a copy of the store; place inside the repo so workspace resolution and the lockdown shim work):
import'@metamask/kernel-shims/endoify-node';import{makeKernel}from'@metamask/kernel-node-runtime';import{Logger}from'@metamask/logger';// copy kernel.sqlite into <dir>, then:const{ kernel, kernelDatabase }=awaitmakeKernel({resetStorage: false,dbFilename: `${dir}/kernel.sqlite`,logger: newLogger({tags: ['rt']}),});awaitkernel.initIdentity();// throws ENOENT here when a referenced bundle is missing
Intended tests (TDD — write first)
Extend the existing #963 suite and add coverage at each masking layer:
Restore tolerates a missing bundle (core).packages/ocap-kernel/src/vats/SubclusterManager.test.ts (and/or VatManager.test.ts): given a persisted subcluster whose vat bundleSpec fetch rejects with ENOENT, kernel boot completes; the offending subcluster/vat is quarantined (skipped, not launched) with a structured warning; other subclusters still restore.
fetch-blob surfaces a typed, path-bearing error.packages/kernel-node-runtime/src/vat/fetch-blob.test.ts: a missing file:// URL rejects with an Error whose message includes the resolved path (not a bare errno object).
Daemon fatal handler is legible.packages/kernel-cli/src/commands/daemon-entry.test.ts (add if absent): the top-level catch renders error.stack ?? error.message ?? inspect(error) — never [object Object] — and writes the fatal to daemon.log (the file transport), since stderr is discarded under stdio: 'ignore'.
Interlock detects a daemon that died during init. Regression: the pidfile is written (or a lock taken) early enough that a crashed/hung-during-init daemon is detected on the next ensureDaemon instead of silently colliding on the sqlite lock.
Integration. Boot a daemon against a persisted store containing exactly one missing bundle; assert pingDaemon succeeds and the daemon is usable.
Intended fix
Two independent defects; fix both. The core fix is (A).
A. Make restore fault-tolerant (correctness). On kernel boot, a per-subcluster/per-vat restore failure must not abort the whole kernel. Catch failures around the vat-restart / restorePersistedIOChannels loop (SubclusterManager.ts:372, Kernel.ts:285) and quarantine the unrestorable subcluster: skip launching it, log a structured warning naming the subcluster id and the failing bundleSpec, and leave the rest of the kernel bootable. Decide quarantine semantics:
keep the persisted record but mark it unrestorable (repairable if the bundle returns), or
provide an explicit prune/GC path (e.g. ocap subcluster prune) rather than silently deleting persisted state.
B. Make the fatal observable (observability).
daemon-entry.ts:17-20: render the real error (error.stack, else error.message, else util.inspect(error)), and log it via the file transport to daemon.log before exiting — stderr is discarded for a detached daemon.
Ensure a fatal during init releases the sqlite handle and does not leave a wedged process. Consider writing the pidfile / taking the lock early so the ensureDaemon orphan-interlock can report it.
Acceptance criteria
A persisted store with a missing vat bundle boots the daemon successfully; the affected subcluster is quarantined and logged; pingDaemon succeeds.
No code path renders [object Object]; the real error (with path) appears in daemon.log.
A daemon that fails/hangs during init does not leave the sqlite lock / socket wedged undetectably; ensureDaemon gives an actionable message.
yarn build, yarn lint:fix, and the touched packages' test:dev:quiet pass. Changelog entries added for any consumer-observable change (kernel-node-runtime, ocap-kernel, kernel-cli).
Summary
On startup the daemon replays persisted subclusters and restarts their vats, fetching each vat's code from the
bundleSpecrecorded in the kernel store. If a single referenced bundle no longer exists on disk (moved, deleted, or otherwise unresolvable), the fetch throwsENOENT, which propagates all the way to the daemon's top-level catch and aborts the entire daemon startup.Because the daemon is spawned detached with
stdio: 'ignore', the CLI only ever seesensureDaemon's generic "Daemon did not start within 30s" timeout, and the actual fatal is written to a discarded stderr as the useless stringDaemon fatal: [object Object].Net effect: one orphaned bundle reference makes the whole kernel unbootable, with no actionable error surfaced.
Severity / impact
[object Object], then hidden behind a 30s timeout — so an affected user has no way to diagnose it.kernel.sqlitelock and a staledaemon.sock, because the daemon writes its pidfile only after a successful init. The orphan-interlock therefore can't detect a daemon that hung/failed during init, and subsequent start attempts collide on the sqlite lock.Failure chain (file:line)
packages/ocap-kernel/src/Kernel.ts:285(restore persisted subclusters, added in fix(ocap-kernel): restore IO channels for persisted subclusters #963), the vat-restart pathpackages/ocap-kernel/src/vats/VatManager.ts:208(restartVat), andpackages/ocap-kernel/src/vats/SubclusterManager.ts:372(restorePersistedIOChannels).ENOENT.packages/kernel-node-runtime/src/vat/fetch-blob.ts:16—return new Response(await fs.readFile(url.fileURLToPath(parsedURL))). The thrown value is a Nodeerrnoobject with no useful.stack.packages/kernel-cli/src/commands/daemon-entry.ts:17-20—main().catch((error) => process.stderr.write(\Daemon fatal: ${String(error)}\n`)).String(errnoObject)→[object Object], and stderr is discarded because the spawn usesstdio: 'ignore'(packages/kernel-cli/src/commands/daemon-spawn.ts`).packages/kernel-cli/src/commands/daemon-spawn.ts(MAX_POLLS = 300,POLL_INTERVAL_MS = 100→ 30s) → "Daemon did not start within 30s".Observed fatal (path genericized):
Reproduction (deterministic)
Use a throwaway
--homeso nothing real is touched.bundleSpecis afile://path to a bundle you control.<home>/kernel.sqlite..bundlefile.<home>.Observed: the daemon never becomes responsive;
pingDaemonstaysfalse; the CLI prints "Daemon did not start within 30s";<home>/daemon.log(or the discarded stderr) showsDaemon fatal: [object Object]. A node process is left holding the sqlite lock and a staledaemon.sock.Minimal isolation harness that reproduces just the restore (run on a copy of the store; place inside the repo so workspace resolution and the lockdown shim work):
Intended tests (TDD — write first)
Extend the existing #963 suite and add coverage at each masking layer:
packages/ocap-kernel/src/vats/SubclusterManager.test.ts(and/orVatManager.test.ts): given a persisted subcluster whose vatbundleSpecfetch rejects withENOENT, kernel boot completes; the offending subcluster/vat is quarantined (skipped, not launched) with a structured warning; other subclusters still restore.fetch-blobsurfaces a typed, path-bearing error.packages/kernel-node-runtime/src/vat/fetch-blob.test.ts: a missingfile://URL rejects with anErrorwhose message includes the resolved path (not a bare errno object).packages/kernel-cli/src/commands/daemon-entry.test.ts(add if absent): the top-level catch renderserror.stack ?? error.message ?? inspect(error)— never[object Object]— and writes the fatal todaemon.log(the file transport), since stderr is discarded understdio: 'ignore'.ensureDaemoninstead of silently colliding on the sqlite lock.pingDaemonsucceeds and the daemon is usable.Intended fix
Two independent defects; fix both. The core fix is (A).
A. Make restore fault-tolerant (correctness). On kernel boot, a per-subcluster/per-vat restore failure must not abort the whole kernel. Catch failures around the vat-restart /
restorePersistedIOChannelsloop (SubclusterManager.ts:372,Kernel.ts:285) and quarantine the unrestorable subcluster: skip launching it, log a structured warning naming the subcluster id and the failingbundleSpec, and leave the rest of the kernel bootable. Decide quarantine semantics:unrestorable(repairable if the bundle returns), orocap subcluster prune) rather than silently deleting persisted state.B. Make the fatal observable (observability).
daemon-entry.ts:17-20: render the real error (error.stack, elseerror.message, elseutil.inspect(error)), and log it via the file transport todaemon.logbefore exiting — stderr is discarded for a detached daemon.ensureDaemonorphan-interlock can report it.Acceptance criteria
pingDaemonsucceeds.[object Object]; the real error (with path) appears indaemon.log.ensureDaemongives an actionable message.SubclusterManagertests stay green.yarn build,yarn lint:fix, and the touched packages'test:dev:quietpass. Changelog entries added for any consumer-observable change (kernel-node-runtime, ocap-kernel, kernel-cli).Key references
packages/kernel-cli/src/commands/daemon-entry.ts:17-20— masked fatalpackages/kernel-cli/src/commands/daemon-spawn.ts— 30s startup budget,stdio: 'ignore'packages/kernel-node-runtime/src/vat/fetch-blob.ts:16— ENOENT originpackages/ocap-kernel/src/Kernel.ts:285— restore persisted subclusters (fix(ocap-kernel): restore IO channels for persisted subclusters #963)packages/ocap-kernel/src/vats/SubclusterManager.ts:372—restorePersistedIOChannelspackages/ocap-kernel/src/vats/VatManager.ts:208—restartVat