Skip to content

fix(wsclient): retry repo-state probe on transient host backpressure#56

Merged
sanity merged 2 commits into
mainfrom
fix/mirror-probe-transient-retry
Jun 16, 2026
Merged

fix(wsclient): retry repo-state probe on transient host backpressure#56
sanity merged 2 commits into
mainfrom
fix/mirror-probe-transient-retry

Conversation

@sanity

@sanity sanity commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

Problem

The Mirror to Freenet workflow (and freenet-git rescue) failed with exit 128 on a single transient gateway error during the repo-state probe. From the 2026-06-16 run that paged the dev Matrix room:

error: every probe failed with transport/other error (1 probe(s)); first failure:
current key 7Zazi...: recv: client error: error while executing operation in the
network: contract queue full, try again later

contract queue full, try again later is explicit backpressure the gateway invites us to retry, yet get_state_with_legacy_fallback ran the probe exactly once and bailed. put_pack already retried this same class of error on the write side, so the read-side probe was the lone gap — making the daily freenet-core mirror one transient blip away from a false-alarm Matrix page on every push. Multiple failures of this and the related put timed out mode were observed across 2026-06-15/16.

Solution

Retry the whole probe sequence up to 4 times with exponential backoff (2s/4s/8s, shared with put_pack via a new retry_backoff helper) when — and only when — every probe outcome is transient (a timeout, or a host error matching is_transient_host_error: queue full / try again later / timeouts).

A single authoritative NotFound/Empty (or a dead-connection transport error) short-circuits the loop, so genuine "no state on the network" still fails fast with the precise per-probe message and never masks data loss behind a delay.

The probe body is extracted into probe_all_keys, which returns the structured per-probe outcome list on failure, so the retry decision is made on real outcomes rather than by re-parsing the formatted error string. Both probe callers (the git-remote-freenet list/push path and the rescue path) get the resilience for free; the public signature is unchanged.

Scope note

The separate put timed out mirror failures go through put_pack, which already retries. Tuning that budget is tracked in #53 and deliberately left out here: bumping per-chunk PUT attempts under sustained backpressure risks the mirror job's 30-minute runtime budget, so it deserves its own change. This PR is the single, low-risk fix for the reported exit-128 incident.

Testing

8 new unit tests pin:

  • the transient/permanent error classifier (verbatim incident string + put timed out; rejects NotFound / connection-reset / decode errors),
  • the outcomes_all_transient retry gate (single backpressure probe retries; all-timeout retries; any authoritative outcome or dead-connection error short-circuits; empty input is not retryable),
  • the shared exponential backoff schedule.

cargo fmt --all --check, cargo clippy --workspace --all-targets -D warnings, and cargo test --workspace all pass locally.

Related

[AI-assisted - Claude]

sanity added 2 commits June 16, 2026 10:36
## Problem

The `Mirror to Freenet` workflow (and `freenet-git rescue`) failed with
exit 128 on a single transient gateway error during the repo-state
probe:

    error: every probe failed with transport/other error (1 probe(s));
    first failure: current key 7Zazi...: recv: client error: error while
    executing operation in the network: contract queue full, try again later

`contract queue full, try again later` is explicit backpressure the
gateway invites us to retry, yet `get_state_with_legacy_fallback` ran
the probe exactly once and bailed. `put_pack` already retried this same
class of error on the write side, so the read-side probe was the lone
gap — making the daily freenet-core mirror one transient blip away from
a false-alarm Matrix page on every push (multiple failures observed
2026-06-15/16).

## Solution

Retry the whole probe sequence up to 4 times with exponential backoff
(2s/4s/8s, shared with `put_pack` via `retry_backoff`) when — and only
when — *every* probe outcome is transient (a timeout, or a host error
matching `is_transient_host_error`: "queue full" / "try again later" /
timeouts). A single authoritative `NotFound`/`Empty` short-circuits the
loop so genuine "no state on the network" still fails fast with the
precise per-probe message and never masks data loss behind a delay.

The probe body is extracted into `probe_all_keys` (returns the per-probe
outcome list on failure) so the retry decision is made on structured
outcomes, not by re-parsing the formatted error string.

Scope note: the separate `put timed out` mirror failures go through
`put_pack`, which already retries; tuning that budget is tracked in #53
and deliberately left out to keep this a single, low-risk change that
does not risk the mirror job's 30-minute runtime budget.

## Testing

8 new unit tests pin the transient/permanent classifier, the
`outcomes_all_transient` retry gate (single backpressure probe and
all-timeout retry; authoritative-outcome and dead-connection
short-circuit; empty-input guard), and the shared backoff schedule.
`cargo fmt`, `cargo clippy -D warnings`, and `cargo test --workspace`
all pass.

Related: #36 (per-chunk fetch retry), #53 (put_pack rescue timeouts).

[AI-assisted - Claude]
…rrent NotFound

Review findings from PR #56 (Codex + adversarial Claude reviewers):

- Codex P2 (correctness): the "all outcomes transient" gate failed to
  retry a transient legacy-key probe when the current key was
  authoritatively NotFound — the normal migration/fallback case. The
  current key's NotFound says nothing about whether a legacy key holds
  the state, so a transient blip on that legacy probe was wrongly
  treated as fatal. Replaced `outcomes_all_transient` with
  `outcomes_worth_retrying` (3-way `RetryDisposition`): retry when ANY
  outcome is transient and NONE is a hard transport error; NotFound/
  Empty are per-key authoritative and no longer block a sibling retry.
  A dead-connection HardError still aborts fast. New regression test
  `outcomes_worth_retrying_retries_transient_legacy_after_current_not_found`.

- Doc accuracy (MAJOR): the PROBE_MAX_ATTEMPTS comment claimed "at most
  14s". Timeout outcomes are also retryable and each burns a full per-op
  timeout, so the real worst case is ~`attempts × (1+legacy) × timeout`.
  Comment now states the true budget (~12 min for freenet-core, L=0) and
  flags the legacy-key / chunked-put_pack scaling.

- warn! message no longer hardcodes "backpressure" (also fires on plain
  timeouts): now "transient: host backpressure or timeout".

- retry_backoff uses saturating_pow + a MAX_BACKOFF_SECS=60 cap so a
  future attempt-ceiling bump can't overflow/panic; values 2/4/8 for the
  current callers are unchanged. Pinned by an extended backoff test.

- Documented that is_transient_host_error's substring match is
  intentionally broad, and corrected put_pack's stale "retries on
  transient host errors" doc (it retries on any error, safe because
  re-PUTs are idempotent).

[AI-assisted - Claude]
@sanity

sanity commented Jun 16, 2026

Copy link
Copy Markdown
Contributor Author

Multi-model review (consolidated)

Reviewed by an external non-Claude model (Codex, gpt-5.5) plus three adversarial Claude reviewers (skeptical bug-hunt, code-first, retry/backoff failure-mode lens). All findings below are addressed in 1993e15; Codex re-review of the updated diff: "no discrete, actionable regressions… preserves existing behavior for non-transient failures while adding bounded retries for transient probe failures."

Findings & resolutions

# Severity Finding Resolution
1 P2 (correctness, Codex) The original "every outcome transient" gate failed to retry a transient legacy-key probe when the current key was authoritatively NotFound — the normal migration/fallback case. The current key's NotFound doesn't prove a legacy key is absent. Replaced outcomes_all_transient with outcomes_worth_retrying backed by a 3-way RetryDisposition: retry when any outcome is transient and none is a hard transport error; NotFound/Empty are per-key and no longer block a sibling retry; a dead-connection HardError still aborts fast. New regression test outcomes_worth_retrying_retries_transient_legacy_after_current_not_found.
2 MAJOR (doc accuracy) PROBE_MAX_ATTEMPTS comment claimed "at most 14s". Timeout outcomes are also retryable and each burns a full per-op timeout, so the real worst case is ~attempts × (1+legacy) × timeout. Comment now states the true budget (4 × 180 + 14 ≈ 12 min for freenet-core, 0 legacy keys) and flags the legacy-key / chunked-put_pack scaling to revisit before raising the constant.
3 MINOR warn! hardcoded "transient host backpressure" but also fires on plain timeouts. Reworded to "transient: host backpressure or timeout".
4 MINOR is_transient_host_error substring match (esp. "timeout") is broad and could rot as gateway wording drifts. Kept broad (a false-transient costs one backoff cycle; missing a real transient variant is worse) and documented the trade-off explicitly.
5 MINOR retry_backoff 2u64.pow(attempt) is an overflow/absurd-sleep footgun if an attempt ceiling is ever raised. saturating_pow + MAX_BACKOFF_SECS = 60 cap; current callers' 2/4/8s values unchanged. Pinned by an extended backoff test.
6 NIT Pre-existing put_pack doc said "retries on transient host errors" but it retries on any error. Corrected the doc (retry is unconditional and safe because re-PUTs are content-addressed/idempotent).

Test-coverage note

The async retry loop itself (get_state_with_legacy_fallback) isn't unit-tested — WebApi has no trait seam and the file's convention is to test pure helpers/dispatch, not the WS round-trip. The substantive retry decision (outcomes_worth_retrying, is_transient_host_error, retry_backoff) is fully unit-tested (9 tests incl. the Codex regression and hard-error abort); loop termination/short-circuit are trivial and verified by inspection. I did not add a brittle source-scrape pin since this crate has no such precedent.

CI green on 1993e15 (fmt, clippy, test, fixture-tamper-check). Full cargo test --workspace green locally.

[AI-assisted - Claude]

@sanity sanity merged commit f641138 into main Jun 16, 2026
5 checks passed
@sanity sanity deleted the fix/mirror-probe-transient-retry branch June 16, 2026 16:16
sanity added a commit that referenced this pull request Jun 16, 2026
Ships the repo-state probe transient-backpressure retry fixed in #56.
The Mirror to Freenet workflow installs freenet-git from crates.io, so
the fix only takes effect once published. Only the freenet-git crate
changed; the sibling crates are unchanged and stay at their published
versions (their `^0.1.21` requirement resolves against them).

[AI-assisted - Claude]
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant