Skip to content

Daemon startup aborts entirely when a persisted subcluster references a missing vat bundle #964

Description

@grypez

Summary

On startup the daemon replays persisted subclusters and restarts their vats, fetching each vat's code from the bundleSpec recorded in the kernel store. If a single referenced bundle no longer exists on disk (moved, deleted, or otherwise unresolvable), the fetch throws ENOENT, which propagates all the way to the daemon's top-level catch and aborts the entire daemon startup.

Because the daemon is spawned detached with stdio: 'ignore', the CLI only ever sees ensureDaemon's generic "Daemon did not start within 30s" timeout, and the actual fatal is written to a discarded stderr as the useless string Daemon fatal: [object Object].

Net effect: one orphaned bundle reference makes the whole kernel unbootable, with no actionable error surfaced.

Severity / impact

  • Fragility. Any persisted vat whose bundle is moved or deleted takes down the daemon. Persisted kernel state that outlives its bundle files is common (bundles rebuilt to a new path, artifacts pruned, absolute paths that don't survive relocation), so this is a realistic failure mode, not a corner case.
  • Poor observability. The real cause is masked twice over — first stringified to [object Object], then hidden behind a 30s timeout — so an affected user has no way to diagnose it.
  • Wedged process. A failed restore leaves a process holding the kernel.sqlite lock and a stale daemon.sock, because the daemon writes its pidfile only after a successful init. The orphan-interlock therefore can't detect a daemon that hung/failed during init, and subsequent start attempts collide on the sqlite lock.

Failure chain (file:line)

  1. Kernel boot restores persisted subclusters/vats. packages/ocap-kernel/src/Kernel.ts:285 (restore persisted subclusters, added in fix(ocap-kernel): restore IO channels for persisted subclusters #963), the vat-restart path packages/ocap-kernel/src/vats/VatManager.ts:208 (restartVat), and packages/ocap-kernel/src/vats/SubclusterManager.ts:372 (restorePersistedIOChannels).
  2. The bundle is fetched from disk and throws ENOENT. packages/kernel-node-runtime/src/vat/fetch-blob.ts:16return new Response(await fs.readFile(url.fileURLToPath(parsedURL))). The thrown value is a Node errno object with no useful .stack.
  3. The error aborts startup and is masked. packages/kernel-cli/src/commands/daemon-entry.ts:17-20main().catch((error) => process.stderr.write(\Daemon fatal: ${String(error)}\n`)). String(errnoObject)[object Object], and stderr is discarded because the spawn uses stdio: 'ignore' (packages/kernel-cli/src/commands/daemon-spawn.ts`).
  4. The CLI reports only the timeout. packages/kernel-cli/src/commands/daemon-spawn.ts (MAX_POLLS = 300, POLL_INTERVAL_MS = 100 → 30s) → "Daemon did not start within 30s".

Observed fatal (path genericized):

ENOENT: no such file or directory, open '/…/some-vat.bundle'

Reproduction (deterministic)

Use a throwaway --home so nothing real is touched.

  1. Start a daemon against a scratch home and launch a subcluster whose vat bundleSpec is a file:// path to a bundle you control.
  2. Stop the daemon. Confirm the subcluster is persisted in <home>/kernel.sqlite.
  3. Delete (or rename) the referenced .bundle file.
  4. Start the daemon again against the same <home>.

Observed: the daemon never becomes responsive; pingDaemon stays false; the CLI prints "Daemon did not start within 30s"; <home>/daemon.log (or the discarded stderr) shows Daemon fatal: [object Object]. A node process is left holding the sqlite lock and a stale daemon.sock.

Minimal isolation harness that reproduces just the restore (run on a copy of the store; place inside the repo so workspace resolution and the lockdown shim work):

import '@metamask/kernel-shims/endoify-node';
import { makeKernel } from '@metamask/kernel-node-runtime';
import { Logger } from '@metamask/logger';
// copy kernel.sqlite into <dir>, then:
const { kernel, kernelDatabase } = await makeKernel({
  resetStorage: false, dbFilename: `${dir}/kernel.sqlite`, logger: new Logger({ tags: ['rt'] }),
});
await kernel.initIdentity();   // throws ENOENT here when a referenced bundle is missing

Intended tests (TDD — write first)

Extend the existing #963 suite and add coverage at each masking layer:

  1. Restore tolerates a missing bundle (core). packages/ocap-kernel/src/vats/SubclusterManager.test.ts (and/or VatManager.test.ts): given a persisted subcluster whose vat bundleSpec fetch rejects with ENOENT, kernel boot completes; the offending subcluster/vat is quarantined (skipped, not launched) with a structured warning; other subclusters still restore.
  2. fetch-blob surfaces a typed, path-bearing error. packages/kernel-node-runtime/src/vat/fetch-blob.test.ts: a missing file:// URL rejects with an Error whose message includes the resolved path (not a bare errno object).
  3. Daemon fatal handler is legible. packages/kernel-cli/src/commands/daemon-entry.test.ts (add if absent): the top-level catch renders error.stack ?? error.message ?? inspect(error) — never [object Object]and writes the fatal to daemon.log (the file transport), since stderr is discarded under stdio: 'ignore'.
  4. Interlock detects a daemon that died during init. Regression: the pidfile is written (or a lock taken) early enough that a crashed/hung-during-init daemon is detected on the next ensureDaemon instead of silently colliding on the sqlite lock.
  5. Integration. Boot a daemon against a persisted store containing exactly one missing bundle; assert pingDaemon succeeds and the daemon is usable.

Intended fix

Two independent defects; fix both. The core fix is (A).

A. Make restore fault-tolerant (correctness). On kernel boot, a per-subcluster/per-vat restore failure must not abort the whole kernel. Catch failures around the vat-restart / restorePersistedIOChannels loop (SubclusterManager.ts:372, Kernel.ts:285) and quarantine the unrestorable subcluster: skip launching it, log a structured warning naming the subcluster id and the failing bundleSpec, and leave the rest of the kernel bootable. Decide quarantine semantics:

  • keep the persisted record but mark it unrestorable (repairable if the bundle returns), or
  • provide an explicit prune/GC path (e.g. ocap subcluster prune) rather than silently deleting persisted state.

B. Make the fatal observable (observability).

  • daemon-entry.ts:17-20: render the real error (error.stack, else error.message, else util.inspect(error)), and log it via the file transport to daemon.log before exiting — stderr is discarded for a detached daemon.
  • Ensure a fatal during init releases the sqlite handle and does not leave a wedged process. Consider writing the pidfile / taking the lock early so the ensureDaemon orphan-interlock can report it.

Acceptance criteria

  • A persisted store with a missing vat bundle boots the daemon successfully; the affected subcluster is quarantined and logged; pingDaemon succeeds.
  • No code path renders [object Object]; the real error (with path) appears in daemon.log.
  • A daemon that fails/hangs during init does not leave the sqlite lock / socket wedged undetectably; ensureDaemon gives an actionable message.
  • Tests 1–5 pass; existing fix(ocap-kernel): restore IO channels for persisted subclusters #963 SubclusterManager tests stay green.
  • yarn build, yarn lint:fix, and the touched packages' test:dev:quiet pass. Changelog entries added for any consumer-observable change (kernel-node-runtime, ocap-kernel, kernel-cli).

Key references

  • packages/kernel-cli/src/commands/daemon-entry.ts:17-20 — masked fatal
  • packages/kernel-cli/src/commands/daemon-spawn.ts — 30s startup budget, stdio: 'ignore'
  • packages/kernel-node-runtime/src/vat/fetch-blob.ts:16 — ENOENT origin
  • packages/ocap-kernel/src/Kernel.ts:285 — restore persisted subclusters (fix(ocap-kernel): restore IO channels for persisted subclusters #963)
  • packages/ocap-kernel/src/vats/SubclusterManager.ts:372restorePersistedIOChannels
  • packages/ocap-kernel/src/vats/VatManager.ts:208restartVat

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions