fix(wsclient): retry repo-state probe on transient host backpressure by sanity · Pull Request #56 · freenet/freenet-git

sanity · 2026-06-16T15:37:17Z

Problem

The Mirror to Freenet workflow (and freenet-git rescue) failed with exit 128 on a single transient gateway error during the repo-state probe. From the 2026-06-16 run that paged the dev Matrix room:

error: every probe failed with transport/other error (1 probe(s)); first failure:
current key 7Zazi...: recv: client error: error while executing operation in the
network: contract queue full, try again later

contract queue full, try again later is explicit backpressure the gateway invites us to retry, yet get_state_with_legacy_fallback ran the probe exactly once and bailed. put_pack already retried this same class of error on the write side, so the read-side probe was the lone gap — making the daily freenet-core mirror one transient blip away from a false-alarm Matrix page on every push. Multiple failures of this and the related put timed out mode were observed across 2026-06-15/16.

Solution

Retry the whole probe sequence up to 4 times with exponential backoff (2s/4s/8s, shared with put_pack via a new retry_backoff helper) when — and only when — every probe outcome is transient (a timeout, or a host error matching is_transient_host_error: queue full / try again later / timeouts).

A single authoritative NotFound/Empty (or a dead-connection transport error) short-circuits the loop, so genuine "no state on the network" still fails fast with the precise per-probe message and never masks data loss behind a delay.

The probe body is extracted into probe_all_keys, which returns the structured per-probe outcome list on failure, so the retry decision is made on real outcomes rather than by re-parsing the formatted error string. Both probe callers (the git-remote-freenet list/push path and the rescue path) get the resilience for free; the public signature is unchanged.

Scope note

The separate put timed out mirror failures go through put_pack, which already retries. Tuning that budget is tracked in #53 and deliberately left out here: bumping per-chunk PUT attempts under sustained backpressure risks the mirror job's 30-minute runtime budget, so it deserves its own change. This PR is the single, low-risk fix for the reported exit-128 incident.

Testing

8 new unit tests pin:

the transient/permanent error classifier (verbatim incident string + put timed out; rejects NotFound / connection-reset / decode errors),
the outcomes_all_transient retry gate (single backpressure probe retries; all-timeout retries; any authoritative outcome or dead-connection error short-circuits; empty input is not retryable),
the shared exponential backoff schedule.

cargo fmt --all --check, cargo clippy --workspace --all-targets -D warnings, and cargo test --workspace all pass locally.

## Problem The `Mirror to Freenet` workflow (and `freenet-git rescue`) failed with exit 128 on a single transient gateway error during the repo-state probe: error: every probe failed with transport/other error (1 probe(s)); first failure: current key 7Zazi...: recv: client error: error while executing operation in the network: contract queue full, try again later `contract queue full, try again later` is explicit backpressure the gateway invites us to retry, yet `get_state_with_legacy_fallback` ran the probe exactly once and bailed. `put_pack` already retried this same class of error on the write side, so the read-side probe was the lone gap — making the daily freenet-core mirror one transient blip away from a false-alarm Matrix page on every push (multiple failures observed 2026-06-15/16). ## Solution Retry the whole probe sequence up to 4 times with exponential backoff (2s/4s/8s, shared with `put_pack` via `retry_backoff`) when — and only when — *every* probe outcome is transient (a timeout, or a host error matching `is_transient_host_error`: "queue full" / "try again later" / timeouts). A single authoritative `NotFound`/`Empty` short-circuits the loop so genuine "no state on the network" still fails fast with the precise per-probe message and never masks data loss behind a delay. The probe body is extracted into `probe_all_keys` (returns the per-probe outcome list on failure) so the retry decision is made on structured outcomes, not by re-parsing the formatted error string. Scope note: the separate `put timed out` mirror failures go through `put_pack`, which already retries; tuning that budget is tracked in #53 and deliberately left out to keep this a single, low-risk change that does not risk the mirror job's 30-minute runtime budget. ## Testing 8 new unit tests pin the transient/permanent classifier, the `outcomes_all_transient` retry gate (single backpressure probe and all-timeout retry; authoritative-outcome and dead-connection short-circuit; empty-input guard), and the shared backoff schedule. `cargo fmt`, `cargo clippy -D warnings`, and `cargo test --workspace` all pass. Related: #36 (per-chunk fetch retry), #53 (put_pack rescue timeouts). [AI-assisted - Claude]

…rrent NotFound Review findings from PR #56 (Codex + adversarial Claude reviewers): - Codex P2 (correctness): the "all outcomes transient" gate failed to retry a transient legacy-key probe when the current key was authoritatively NotFound — the normal migration/fallback case. The current key's NotFound says nothing about whether a legacy key holds the state, so a transient blip on that legacy probe was wrongly treated as fatal. Replaced `outcomes_all_transient` with `outcomes_worth_retrying` (3-way `RetryDisposition`): retry when ANY outcome is transient and NONE is a hard transport error; NotFound/ Empty are per-key authoritative and no longer block a sibling retry. A dead-connection HardError still aborts fast. New regression test `outcomes_worth_retrying_retries_transient_legacy_after_current_not_found`. - Doc accuracy (MAJOR): the PROBE_MAX_ATTEMPTS comment claimed "at most 14s". Timeout outcomes are also retryable and each burns a full per-op timeout, so the real worst case is ~`attempts × (1+legacy) × timeout`. Comment now states the true budget (~12 min for freenet-core, L=0) and flags the legacy-key / chunked-put_pack scaling. - warn! message no longer hardcodes "backpressure" (also fires on plain timeouts): now "transient: host backpressure or timeout". - retry_backoff uses saturating_pow + a MAX_BACKOFF_SECS=60 cap so a future attempt-ceiling bump can't overflow/panic; values 2/4/8 for the current callers are unchanged. Pinned by an extended backoff test. - Documented that is_transient_host_error's substring match is intentionally broad, and corrected put_pack's stale "retries on transient host errors" doc (it retries on any error, safe because re-PUTs are idempotent). [AI-assisted - Claude]

sanity · 2026-06-16T16:15:12Z

Multi-model review (consolidated)

Reviewed by an external non-Claude model (Codex, gpt-5.5) plus three adversarial Claude reviewers (skeptical bug-hunt, code-first, retry/backoff failure-mode lens). All findings below are addressed in 1993e15; Codex re-review of the updated diff: "no discrete, actionable regressions… preserves existing behavior for non-transient failures while adding bounded retries for transient probe failures."

Findings & resolutions

#	Severity	Finding	Resolution
1	P2 (correctness, Codex)	The original "every outcome transient" gate failed to retry a transient legacy-key probe when the current key was authoritatively `NotFound` — the normal migration/fallback case. The current key's NotFound doesn't prove a legacy key is absent.	Replaced `outcomes_all_transient` with `outcomes_worth_retrying` backed by a 3-way `RetryDisposition`: retry when any outcome is transient and none is a hard transport error; `NotFound`/`Empty` are per-key and no longer block a sibling retry; a dead-connection `HardError` still aborts fast. New regression test `outcomes_worth_retrying_retries_transient_legacy_after_current_not_found`.
2	MAJOR (doc accuracy)	`PROBE_MAX_ATTEMPTS` comment claimed "at most 14s". `Timeout` outcomes are also retryable and each burns a full per-op timeout, so the real worst case is `~attempts × (1+legacy) × timeout`.	Comment now states the true budget (`4 × 180 + 14 ≈ 12 min` for freenet-core, 0 legacy keys) and flags the legacy-key / chunked-`put_pack` scaling to revisit before raising the constant.
3	MINOR	`warn!` hardcoded "transient host backpressure" but also fires on plain timeouts.	Reworded to "transient: host backpressure or timeout".
4	MINOR	`is_transient_host_error` substring match (esp. `"timeout"`) is broad and could rot as gateway wording drifts.	Kept broad (a false-transient costs one backoff cycle; missing a real transient variant is worse) and documented the trade-off explicitly.
5	MINOR	`retry_backoff` `2u64.pow(attempt)` is an overflow/absurd-sleep footgun if an attempt ceiling is ever raised.	`saturating_pow` + `MAX_BACKOFF_SECS = 60` cap; current callers' 2/4/8s values unchanged. Pinned by an extended backoff test.
6	NIT	Pre-existing `put_pack` doc said "retries on transient host errors" but it retries on any error.	Corrected the doc (retry is unconditional and safe because re-PUTs are content-addressed/idempotent).

Test-coverage note

The async retry loop itself (get_state_with_legacy_fallback) isn't unit-tested — WebApi has no trait seam and the file's convention is to test pure helpers/dispatch, not the WS round-trip. The substantive retry decision (outcomes_worth_retrying, is_transient_host_error, retry_backoff) is fully unit-tested (9 tests incl. the Codex regression and hard-error abort); loop termination/short-circuit are trivial and verified by inspection. I did not add a brittle source-scrape pin since this crate has no such precedent.

CI green on 1993e15 (fmt, clippy, test, fixture-tamper-check). Full cargo test --workspace green locally.

[AI-assisted - Claude]

Ships the repo-state probe transient-backpressure retry fixed in #56. The Mirror to Freenet workflow installs freenet-git from crates.io, so the fix only takes effect once published. Only the freenet-git crate changed; the sibling crates are unchanged and stay at their published versions (their `^0.1.21` requirement resolves against them). [AI-assisted - Claude]

sanity added 2 commits June 16, 2026 10:36

sanity merged commit f641138 into main Jun 16, 2026
5 checks passed

sanity deleted the fix/mirror-probe-transient-retry branch June 16, 2026 16:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(wsclient): retry repo-state probe on transient host backpressure#56

fix(wsclient): retry repo-state probe on transient host backpressure#56
sanity merged 2 commits into
mainfrom
fix/mirror-probe-transient-retry

sanity commented Jun 16, 2026

Uh oh!

sanity commented Jun 16, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

sanity commented Jun 16, 2026

Problem

Solution

Scope note

Testing

Related

Uh oh!

sanity commented Jun 16, 2026

Multi-model review (consolidated)

Findings & resolutions

Test-coverage note

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant