fix: prevent long scans from hanging or dying on fd exhaustion#601
fix: prevent long scans from hanging or dying on fd exhaustion#6010xfelixli wants to merge 1 commit into
Conversation
Two independent failure modes surfaced on long multi-agent scans: 1. Too many open files (OSError 24). A long scan accumulates file descriptors (httpx pools, docker socket polling, per-subagent resources). The process never raised its soft RLIMIT_NOFILE, so on macOS (default soft limit 256) a long scan blew past the ceiling and crashed at whatever fd-opening call lost the race. Raise the soft limit toward the hard limit at startup. 2. Agents stuck at "Starting agent...". LLM_TIMEOUT was only applied to the startup warm-up call, never to agent-loop calls. Models on the openai/ prefix go through the SDK's OpenAI provider whose default timeout is 600s; as an httpx read timeout that parks an agent for ten minutes on a stalled stream. Install a default AsyncOpenAI client whose read timeout is bounded by LLM_TIMEOUT so a stalled stream raises and the existing retry policy recovers the agent. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Greptile SummaryThis PR addresses two failure modes in long-running multi-agent scans: fd exhaustion on macOS (default soft limit of 256) and agents stuck waiting up to 10 minutes due to the OpenAI SDK's 600s default timeout never being bounded by
Confidence Score: 3/5The OpenAI timeout fix is safe to merge, but the fd-limit fix is likely a no-op on macOS and should be revisited before relying on it. The strix/interface/main.py — the Important Files Changed
Prompt To Fix All With AIFix the following 2 code review issues. Work through them one at a time, proposing concise fixes.
---
### Issue 1 of 2
strix/interface/main.py:757-763
**macOS: `setrlimit` always fails, leaving the limit at 256**
When the hard limit is reported as `RLIM_INFINITY` (the common macOS case — `getrlimit(RLIMIT_NOFILE)` typically returns `(256, 9223372036854775807)` on Ventura/Sonoma), `target` becomes `1_048_576`. macOS however caps `RLIMIT_NOFILE` at `kern.maxfilesperproc` (typically 10240), so `setrlimit(RLIMIT_NOFILE, (1_048_576, RLIM_INFINITY))` raises `ValueError`, which is silently swallowed with a debug log, and the limit stays at 256. The fix does nothing on the primary platform it is meant to address. A progressive-fallback loop — trying e.g. `65536`, then `10240`, then `8192` — would get at least one value accepted by the kernel.
### Issue 2 of 2
strix/config/models.py:100
**Timeout fix skipped when `api_key` is absent**
The guard `if not llm.api_key` exits early, so the bounded-timeout client is never installed for configurations that rely solely on `OPENAI_API_KEY` being present in the process environment but not piped through pydantic-settings (e.g. set before the process starts, not in a `.env` file read by the settings layer). In practice `LlmSettings` does read `OPENAI_API_KEY` via `AliasChoices`, so the field will be populated for standard setups. But it's worth confirming that the env-var aliasing is always evaluated before this guard runs — if settings are constructed before the env var is exported (e.g. in certain test fixtures), the guard silently prevents the timeout fix from taking effect.
Reviews (1): Last reviewed commit: "fix: prevent long scans from hanging or ..." | Re-trigger Greptile |
| target = hard if hard != resource.RLIM_INFINITY else 1_048_576 | ||
| if soft >= target: | ||
| return | ||
| try: | ||
| resource.setrlimit(resource.RLIMIT_NOFILE, (target, hard)) | ||
| except (ValueError, OSError): | ||
| logger.debug("Could not raise RLIMIT_NOFILE from %s to %s", soft, target, exc_info=True) |
There was a problem hiding this comment.
macOS:
setrlimit always fails, leaving the limit at 256
When the hard limit is reported as RLIM_INFINITY (the common macOS case — getrlimit(RLIMIT_NOFILE) typically returns (256, 9223372036854775807) on Ventura/Sonoma), target becomes 1_048_576. macOS however caps RLIMIT_NOFILE at kern.maxfilesperproc (typically 10240), so setrlimit(RLIMIT_NOFILE, (1_048_576, RLIM_INFINITY)) raises ValueError, which is silently swallowed with a debug log, and the limit stays at 256. The fix does nothing on the primary platform it is meant to address. A progressive-fallback loop — trying e.g. 65536, then 10240, then 8192 — would get at least one value accepted by the kernel.
Prompt To Fix With AI
This is a comment left during a code review.
Path: strix/interface/main.py
Line: 757-763
Comment:
**macOS: `setrlimit` always fails, leaving the limit at 256**
When the hard limit is reported as `RLIM_INFINITY` (the common macOS case — `getrlimit(RLIMIT_NOFILE)` typically returns `(256, 9223372036854775807)` on Ventura/Sonoma), `target` becomes `1_048_576`. macOS however caps `RLIMIT_NOFILE` at `kern.maxfilesperproc` (typically 10240), so `setrlimit(RLIMIT_NOFILE, (1_048_576, RLIM_INFINITY))` raises `ValueError`, which is silently swallowed with a debug log, and the limit stays at 256. The fix does nothing on the primary platform it is meant to address. A progressive-fallback loop — trying e.g. `65536`, then `10240`, then `8192` — would get at least one value accepted by the kernel.
How can I resolve this? If you propose a fix, please make it concise.| The client carries the key/base_url because ``set_default_openai_client`` | ||
| takes precedence over ``set_default_openai_key``. | ||
| """ | ||
| if not llm.api_key or llm.timeout <= 0: |
There was a problem hiding this comment.
Timeout fix skipped when
api_key is absent
The guard if not llm.api_key exits early, so the bounded-timeout client is never installed for configurations that rely solely on OPENAI_API_KEY being present in the process environment but not piped through pydantic-settings (e.g. set before the process starts, not in a .env file read by the settings layer). In practice LlmSettings does read OPENAI_API_KEY via AliasChoices, so the field will be populated for standard setups. But it's worth confirming that the env-var aliasing is always evaluated before this guard runs — if settings are constructed before the env var is exported (e.g. in certain test fixtures), the guard silently prevents the timeout fix from taking effect.
Prompt To Fix With AI
This is a comment left during a code review.
Path: strix/config/models.py
Line: 100
Comment:
**Timeout fix skipped when `api_key` is absent**
The guard `if not llm.api_key` exits early, so the bounded-timeout client is never installed for configurations that rely solely on `OPENAI_API_KEY` being present in the process environment but not piped through pydantic-settings (e.g. set before the process starts, not in a `.env` file read by the settings layer). In practice `LlmSettings` does read `OPENAI_API_KEY` via `AliasChoices`, so the field will be populated for standard setups. But it's worth confirming that the env-var aliasing is always evaluated before this guard runs — if settings are constructed before the env var is exported (e.g. in certain test fixtures), the guard silently prevents the timeout fix from taking effect.
How can I resolve this? If you propose a fix, please make it concise.Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
Problem
Two independent failure modes show up on long-running multi-agent scans:
1.
OSError: [Errno 24] Too many open filesA long scan steadily accumulates file descriptors (httpx connection pools, docker socket polling via
container.reload(), per-subagent sessions). The process never raises its softRLIMIT_NOFILE, so on macOS — where the default soft limit is 256 — a long scan eventually blows past the ceiling and crashes at whatever fd-opening call happens to lose the race. The traceback location is effectively random, which makes it look like a mysterious hang.2. Agents stuck at "Starting agent…"
LLM_TIMEOUTwas only ever applied to the startup warm-up call, never to the actual agent-loop LLM calls. Models on theopenai/prefix go through the SDK's OpenAI provider, whoseAsyncOpenAIclient defaults to a 600s timeout. Because httpx treats that as a read timeout, a stalled stream parks the agent for up to 10 minutes — surfaced in the TUI as an agent stuck at "Starting agent…".Fix
_raise_open_files_limit()at startup raises the softRLIMIT_NOFILEtoward the hard limit (best-effort; no-op on Windows / when already high)._configure_openai_client_timeout()installs a defaultAsyncOpenAIclient whose timeout is bounded byLLM_TIMEOUT. As an httpx read (idle) timeout it resets on every received chunk, so healthy long generations are never cut off, but a truly stalled stream raises afterLLM_TIMEOUTand the existing retry policy recovers the agent.Both changes are conservative and default-safe (
LLM_TIMEOUTalready defaults to 300s). The OpenAI-client fix is verified end-to-end against a customapi_base.🤖 Generated with Claude Code