fix(wsclient): retry repo-state probe on transient host backpressure#56
Conversation
## Problem
The `Mirror to Freenet` workflow (and `freenet-git rescue`) failed with
exit 128 on a single transient gateway error during the repo-state
probe:
error: every probe failed with transport/other error (1 probe(s));
first failure: current key 7Zazi...: recv: client error: error while
executing operation in the network: contract queue full, try again later
`contract queue full, try again later` is explicit backpressure the
gateway invites us to retry, yet `get_state_with_legacy_fallback` ran
the probe exactly once and bailed. `put_pack` already retried this same
class of error on the write side, so the read-side probe was the lone
gap — making the daily freenet-core mirror one transient blip away from
a false-alarm Matrix page on every push (multiple failures observed
2026-06-15/16).
## Solution
Retry the whole probe sequence up to 4 times with exponential backoff
(2s/4s/8s, shared with `put_pack` via `retry_backoff`) when — and only
when — *every* probe outcome is transient (a timeout, or a host error
matching `is_transient_host_error`: "queue full" / "try again later" /
timeouts). A single authoritative `NotFound`/`Empty` short-circuits the
loop so genuine "no state on the network" still fails fast with the
precise per-probe message and never masks data loss behind a delay.
The probe body is extracted into `probe_all_keys` (returns the per-probe
outcome list on failure) so the retry decision is made on structured
outcomes, not by re-parsing the formatted error string.
Scope note: the separate `put timed out` mirror failures go through
`put_pack`, which already retries; tuning that budget is tracked in #53
and deliberately left out to keep this a single, low-risk change that
does not risk the mirror job's 30-minute runtime budget.
## Testing
8 new unit tests pin the transient/permanent classifier, the
`outcomes_all_transient` retry gate (single backpressure probe and
all-timeout retry; authoritative-outcome and dead-connection
short-circuit; empty-input guard), and the shared backoff schedule.
`cargo fmt`, `cargo clippy -D warnings`, and `cargo test --workspace`
all pass.
Related: #36 (per-chunk fetch retry), #53 (put_pack rescue timeouts).
[AI-assisted - Claude]
…rrent NotFound Review findings from PR #56 (Codex + adversarial Claude reviewers): - Codex P2 (correctness): the "all outcomes transient" gate failed to retry a transient legacy-key probe when the current key was authoritatively NotFound — the normal migration/fallback case. The current key's NotFound says nothing about whether a legacy key holds the state, so a transient blip on that legacy probe was wrongly treated as fatal. Replaced `outcomes_all_transient` with `outcomes_worth_retrying` (3-way `RetryDisposition`): retry when ANY outcome is transient and NONE is a hard transport error; NotFound/ Empty are per-key authoritative and no longer block a sibling retry. A dead-connection HardError still aborts fast. New regression test `outcomes_worth_retrying_retries_transient_legacy_after_current_not_found`. - Doc accuracy (MAJOR): the PROBE_MAX_ATTEMPTS comment claimed "at most 14s". Timeout outcomes are also retryable and each burns a full per-op timeout, so the real worst case is ~`attempts × (1+legacy) × timeout`. Comment now states the true budget (~12 min for freenet-core, L=0) and flags the legacy-key / chunked-put_pack scaling. - warn! message no longer hardcodes "backpressure" (also fires on plain timeouts): now "transient: host backpressure or timeout". - retry_backoff uses saturating_pow + a MAX_BACKOFF_SECS=60 cap so a future attempt-ceiling bump can't overflow/panic; values 2/4/8 for the current callers are unchanged. Pinned by an extended backoff test. - Documented that is_transient_host_error's substring match is intentionally broad, and corrected put_pack's stale "retries on transient host errors" doc (it retries on any error, safe because re-PUTs are idempotent). [AI-assisted - Claude]
Multi-model review (consolidated)Reviewed by an external non-Claude model (Codex, Findings & resolutions
Test-coverage noteThe async retry loop itself ( CI green on [AI-assisted - Claude] |
Ships the repo-state probe transient-backpressure retry fixed in #56. The Mirror to Freenet workflow installs freenet-git from crates.io, so the fix only takes effect once published. Only the freenet-git crate changed; the sibling crates are unchanged and stay at their published versions (their `^0.1.21` requirement resolves against them). [AI-assisted - Claude]
Problem
The
Mirror to Freenetworkflow (andfreenet-git rescue) failed with exit 128 on a single transient gateway error during the repo-state probe. From the 2026-06-16 run that paged the dev Matrix room:contract queue full, try again lateris explicit backpressure the gateway invites us to retry, yetget_state_with_legacy_fallbackran the probe exactly once and bailed.put_packalready retried this same class of error on the write side, so the read-side probe was the lone gap — making the daily freenet-core mirror one transient blip away from a false-alarm Matrix page on every push. Multiple failures of this and the relatedput timed outmode were observed across 2026-06-15/16.Solution
Retry the whole probe sequence up to 4 times with exponential backoff (2s/4s/8s, shared with
put_packvia a newretry_backoffhelper) when — and only when — every probe outcome is transient (a timeout, or a host error matchingis_transient_host_error:queue full/try again later/ timeouts).A single authoritative
NotFound/Empty(or a dead-connection transport error) short-circuits the loop, so genuine "no state on the network" still fails fast with the precise per-probe message and never masks data loss behind a delay.The probe body is extracted into
probe_all_keys, which returns the structured per-probe outcome list on failure, so the retry decision is made on real outcomes rather than by re-parsing the formatted error string. Both probe callers (thegit-remote-freenetlist/push path and therescuepath) get the resilience for free; the public signature is unchanged.Scope note
The separate
put timed outmirror failures go throughput_pack, which already retries. Tuning that budget is tracked in #53 and deliberately left out here: bumping per-chunk PUT attempts under sustained backpressure risks the mirror job's 30-minute runtime budget, so it deserves its own change. This PR is the single, low-risk fix for the reported exit-128 incident.Testing
8 new unit tests pin:
put timed out; rejects NotFound / connection-reset / decode errors),outcomes_all_transientretry gate (single backpressure probe retries; all-timeout retries; any authoritative outcome or dead-connection error short-circuits; empty input is not retryable),cargo fmt --all --check,cargo clippy --workspace --all-targets -D warnings, andcargo test --workspaceall pass locally.Related
put_packtimeouts on rescue (theput timed outmode above)[AI-assisted - Claude]