Skip to content

fix: prevent long scans from hanging or dying on fd exhaustion#601

Open
0xfelixli wants to merge 1 commit into
usestrix:mainfrom
0xfelixli:fix/long-scan-stability
Open

fix: prevent long scans from hanging or dying on fd exhaustion#601
0xfelixli wants to merge 1 commit into
usestrix:mainfrom
0xfelixli:fix/long-scan-stability

Conversation

@0xfelixli

Copy link
Copy Markdown

Problem

Two independent failure modes show up on long-running multi-agent scans:

1. OSError: [Errno 24] Too many open files

A long scan steadily accumulates file descriptors (httpx connection pools, docker socket polling via container.reload(), per-subagent sessions). The process never raises its soft RLIMIT_NOFILE, so on macOS — where the default soft limit is 256 — a long scan eventually blows past the ceiling and crashes at whatever fd-opening call happens to lose the race. The traceback location is effectively random, which makes it look like a mysterious hang.

2. Agents stuck at "Starting agent…"

LLM_TIMEOUT was only ever applied to the startup warm-up call, never to the actual agent-loop LLM calls. Models on the openai/ prefix go through the SDK's OpenAI provider, whose AsyncOpenAI client defaults to a 600s timeout. Because httpx treats that as a read timeout, a stalled stream parks the agent for up to 10 minutes — surfaced in the TUI as an agent stuck at "Starting agent…".

Fix

  1. _raise_open_files_limit() at startup raises the soft RLIMIT_NOFILE toward the hard limit (best-effort; no-op on Windows / when already high).
  2. _configure_openai_client_timeout() installs a default AsyncOpenAI client whose timeout is bounded by LLM_TIMEOUT. As an httpx read (idle) timeout it resets on every received chunk, so healthy long generations are never cut off, but a truly stalled stream raises after LLM_TIMEOUT and the existing retry policy recovers the agent.

Both changes are conservative and default-safe (LLM_TIMEOUT already defaults to 300s). The OpenAI-client fix is verified end-to-end against a custom api_base.

🤖 Generated with Claude Code

Two independent failure modes surfaced on long multi-agent scans:

1. Too many open files (OSError 24). A long scan accumulates file
   descriptors (httpx pools, docker socket polling, per-subagent
   resources). The process never raised its soft RLIMIT_NOFILE, so on
   macOS (default soft limit 256) a long scan blew past the ceiling and
   crashed at whatever fd-opening call lost the race. Raise the soft
   limit toward the hard limit at startup.

2. Agents stuck at "Starting agent...". LLM_TIMEOUT was only applied to
   the startup warm-up call, never to agent-loop calls. Models on the
   openai/ prefix go through the SDK's OpenAI provider whose default
   timeout is 600s; as an httpx read timeout that parks an agent for ten
   minutes on a stalled stream. Install a default AsyncOpenAI client
   whose read timeout is bounded by LLM_TIMEOUT so a stalled stream
   raises and the existing retry policy recovers the agent.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@greptile-apps

greptile-apps Bot commented Jun 29, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR addresses two failure modes in long-running multi-agent scans: fd exhaustion on macOS (default soft limit of 256) and agents stuck waiting up to 10 minutes due to the OpenAI SDK's 600s default timeout never being bounded by LLM_TIMEOUT.

  • _raise_open_files_limit() in main.py attempts to raise the soft RLIMIT_NOFILE toward the hard limit at startup, but the 1_048_576 fallback used when hard == RLIM_INFINITY exceeds macOS's per-process kernel cap (kern.maxfilesperproc ≈ 10240), causing setrlimit to fail silently — the fd exhaustion fix is effectively a no-op on the platform it targets most.
  • _configure_openai_client_timeout() in models.py installs a custom AsyncOpenAI client with a timeout bounded by LLM_TIMEOUT, correctly applying an idle read timeout to OpenAI-prefix model streams without cutting off healthy long generations.

Confidence Score: 3/5

The OpenAI timeout fix is safe to merge, but the fd-limit fix is likely a no-op on macOS and should be revisited before relying on it.

The _raise_open_files_limit function silently does nothing on macOS because the 1_048_576 target exceeds the kernel's per-process cap, causing setrlimit to fail and the limit to remain at 256. This is the exact platform the PR targets for fd exhaustion, so half the stated fix does not take effect.

strix/interface/main.py — the _raise_open_files_limit fallback target and error-handling strategy deserve a closer look for macOS behavior.

Important Files Changed

Filename Overview
strix/interface/main.py Adds _raise_open_files_limit() called at startup; the 1_048_576 fallback target exceeds macOS's per-process fd cap (kern.maxfilesperproc ≈ 10240), causing setrlimit to fail silently and leaving the limit unchanged on macOS — the primary failure platform.
strix/config/models.py Adds _configure_openai_client_timeout() to install a bounded-timeout AsyncOpenAI client at startup; logic is sound for the OpenAI-prefix path, with a minor guard that skips the fix when api_key is absent.
Prompt To Fix All With AI
Fix the following 2 code review issues. Work through them one at a time, proposing concise fixes.

---

### Issue 1 of 2
strix/interface/main.py:757-763
**macOS: `setrlimit` always fails, leaving the limit at 256**

When the hard limit is reported as `RLIM_INFINITY` (the common macOS case — `getrlimit(RLIMIT_NOFILE)` typically returns `(256, 9223372036854775807)` on Ventura/Sonoma), `target` becomes `1_048_576`. macOS however caps `RLIMIT_NOFILE` at `kern.maxfilesperproc` (typically 10240), so `setrlimit(RLIMIT_NOFILE, (1_048_576, RLIM_INFINITY))` raises `ValueError`, which is silently swallowed with a debug log, and the limit stays at 256. The fix does nothing on the primary platform it is meant to address. A progressive-fallback loop — trying e.g. `65536`, then `10240`, then `8192` — would get at least one value accepted by the kernel.

### Issue 2 of 2
strix/config/models.py:100
**Timeout fix skipped when `api_key` is absent**

The guard `if not llm.api_key` exits early, so the bounded-timeout client is never installed for configurations that rely solely on `OPENAI_API_KEY` being present in the process environment but not piped through pydantic-settings (e.g. set before the process starts, not in a `.env` file read by the settings layer). In practice `LlmSettings` does read `OPENAI_API_KEY` via `AliasChoices`, so the field will be populated for standard setups. But it's worth confirming that the env-var aliasing is always evaluated before this guard runs — if settings are constructed before the env var is exported (e.g. in certain test fixtures), the guard silently prevents the timeout fix from taking effect.

Reviews (1): Last reviewed commit: "fix: prevent long scans from hanging or ..." | Re-trigger Greptile

Comment thread strix/interface/main.py
Comment on lines +757 to +763
target = hard if hard != resource.RLIM_INFINITY else 1_048_576
if soft >= target:
return
try:
resource.setrlimit(resource.RLIMIT_NOFILE, (target, hard))
except (ValueError, OSError):
logger.debug("Could not raise RLIMIT_NOFILE from %s to %s", soft, target, exc_info=True)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 macOS: setrlimit always fails, leaving the limit at 256

When the hard limit is reported as RLIM_INFINITY (the common macOS case — getrlimit(RLIMIT_NOFILE) typically returns (256, 9223372036854775807) on Ventura/Sonoma), target becomes 1_048_576. macOS however caps RLIMIT_NOFILE at kern.maxfilesperproc (typically 10240), so setrlimit(RLIMIT_NOFILE, (1_048_576, RLIM_INFINITY)) raises ValueError, which is silently swallowed with a debug log, and the limit stays at 256. The fix does nothing on the primary platform it is meant to address. A progressive-fallback loop — trying e.g. 65536, then 10240, then 8192 — would get at least one value accepted by the kernel.

Prompt To Fix With AI
This is a comment left during a code review.
Path: strix/interface/main.py
Line: 757-763

Comment:
**macOS: `setrlimit` always fails, leaving the limit at 256**

When the hard limit is reported as `RLIM_INFINITY` (the common macOS case — `getrlimit(RLIMIT_NOFILE)` typically returns `(256, 9223372036854775807)` on Ventura/Sonoma), `target` becomes `1_048_576`. macOS however caps `RLIMIT_NOFILE` at `kern.maxfilesperproc` (typically 10240), so `setrlimit(RLIMIT_NOFILE, (1_048_576, RLIM_INFINITY))` raises `ValueError`, which is silently swallowed with a debug log, and the limit stays at 256. The fix does nothing on the primary platform it is meant to address. A progressive-fallback loop — trying e.g. `65536`, then `10240`, then `8192` — would get at least one value accepted by the kernel.

How can I resolve this? If you propose a fix, please make it concise.

Comment thread strix/config/models.py
The client carries the key/base_url because ``set_default_openai_client``
takes precedence over ``set_default_openai_key``.
"""
if not llm.api_key or llm.timeout <= 0:

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Timeout fix skipped when api_key is absent

The guard if not llm.api_key exits early, so the bounded-timeout client is never installed for configurations that rely solely on OPENAI_API_KEY being present in the process environment but not piped through pydantic-settings (e.g. set before the process starts, not in a .env file read by the settings layer). In practice LlmSettings does read OPENAI_API_KEY via AliasChoices, so the field will be populated for standard setups. But it's worth confirming that the env-var aliasing is always evaluated before this guard runs — if settings are constructed before the env var is exported (e.g. in certain test fixtures), the guard silently prevents the timeout fix from taking effect.

Prompt To Fix With AI
This is a comment left during a code review.
Path: strix/config/models.py
Line: 100

Comment:
**Timeout fix skipped when `api_key` is absent**

The guard `if not llm.api_key` exits early, so the bounded-timeout client is never installed for configurations that rely solely on `OPENAI_API_KEY` being present in the process environment but not piped through pydantic-settings (e.g. set before the process starts, not in a `.env` file read by the settings layer). In practice `LlmSettings` does read `OPENAI_API_KEY` via `AliasChoices`, so the field will be populated for standard setups. But it's worth confirming that the env-var aliasing is always evaluated before this guard runs — if settings are constructed before the env var is exported (e.g. in certain test fixtures), the guard silently prevents the timeout fix from taking effect.

How can I resolve this? If you propose a fix, please make it concise.

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant