Skip to content

Add CalDAV intermittent-connection probe + log correlation#232

Open
aatchison wants to merge 1 commit into
mainfrom
diagnostics/caldav-intermittent-probe
Open

Add CalDAV intermittent-connection probe + log correlation#232
aatchison wants to merge 1 commit into
mainfrom
diagnostics/caldav-intermittent-probe

Conversation

@aatchison

Copy link
Copy Markdown
Contributor

Why

Thunderbird desktop users have been intermittently seeing "failed to connect to server mail.thundermail.com" for the calendar (Stalwart CalDAV), roughly once every 2–3 hours. Monitoring is green and the nightly integration test passes — but a once-a-day connect test cannot catch a failure that happens intermittently. Multiple people have observed it.

This adds diagnostic tooling to answer one question: is the root cause Stalwart, the network/edge, or the Thunderbird client? The cause may well be unrelated to Stalwart — this is how we find out for sure.

What

Under research/caldav-probe/:

  • probe.py — stdlib-only poller (runs locally, no install step). Polls https://<host>/dav/cal/ on an interval and times each connection phase separately — DNS → TCP → TLS → CalDAV PROPFIND — classifying failures granularly (DNS_FAIL, TCP_FAIL, TLS_FAIL, AUTH_FAIL, HTTP_5XX, TIMEOUT, DAV_ERROR, OK). Appends one JSON line per attempt. The caldav lib was intentionally avoided because it collapses every failure into one opaque exception; raw sockets give true per-phase attribution.
  • correlate.py — reads the JSONL, finds failures, and queries /tb/prod/stalwart CloudWatch logs in a window around each failure, filtered to this machine's public IP, so we see the server's view of our connections (or that it logged nothing).
  • README.md, .env.example, .gitignore (.env + *.jsonl never committed).

Interpreting results

Probe result Stalwart logs Verdict
Fails error at our IP/time Server-side (Stalwart)
TLS_FAIL (cert) Cert / edge
TCP_FAIL / TIMEOUT silent Network / edge / load balancer
Always OK Client-side (Thunderbird) or unprobed layer

Validation

Verified end-to-end against prod with dummy creds: DNS/TCP/TLS phases all time correctly and the request classifies as AUTH_FAIL (401). Resolved host 3.78.21.144.

🤖 Generated with Claude Code

Diagnostic tooling for the intermittent 'failed to connect to server
mail.thundermail.com' calendar errors seen in Thunderbird desktop. A once-a-day
integration test cannot catch a failure that occurs ~once every 2-3 hours, so
this polls the CalDAV endpoint frequently and records which connection phase
fails, then correlates failures against Stalwart CloudWatch logs to determine
whether the root cause is Stalwart, the network/edge, or the client.

- probe.py: stdlib-only poller timing DNS/TCP/TLS/PROPFIND phases with granular
  failure classification; appends JSONL results.
- correlate.py: matches probe failures against /tb/prod/stalwart logs by public
  IP and timestamp window.
- .env/.jsonl gitignored; README documents usage and result interpretation.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant