fix: prevent long scans from hanging or dying on fd exhaustion by 0xfelixli · Pull Request #601 · usestrix/strix

0xfelixli · 2026-06-29T10:16:17Z

Problem

Two independent failure modes show up on long-running multi-agent scans:

1. `OSError: [Errno 24] Too many open files`

A long scan steadily accumulates file descriptors (httpx connection pools, docker socket polling via container.reload(), per-subagent sessions). The process never raises its soft RLIMIT_NOFILE, so on macOS — where the default soft limit is 256 — a long scan eventually blows past the ceiling and crashes at whatever fd-opening call happens to lose the race. The traceback location is effectively random, which makes it look like a mysterious hang.

2. Agents stuck at "Starting agent…"

LLM_TIMEOUT was only ever applied to the startup warm-up call, never to the actual agent-loop LLM calls. Models on the openai/ prefix go through the SDK's OpenAI provider, whose AsyncOpenAI client defaults to a 600s timeout. Because httpx treats that as a read timeout, a stalled stream parks the agent for up to 10 minutes — surfaced in the TUI as an agent stuck at "Starting agent…".

Fix

_raise_open_files_limit() at startup raises the soft RLIMIT_NOFILE toward the hard limit (best-effort; no-op on Windows / when already high).
_configure_openai_client_timeout() installs a default AsyncOpenAI client whose timeout is bounded by LLM_TIMEOUT. As an httpx read (idle) timeout it resets on every received chunk, so healthy long generations are never cut off, but a truly stalled stream raises after LLM_TIMEOUT and the existing retry policy recovers the agent.

Both changes are conservative and default-safe (LLM_TIMEOUT already defaults to 300s). The OpenAI-client fix is verified end-to-end against a custom api_base.

🤖 Generated with Claude Code

Two independent failure modes surfaced on long multi-agent scans: 1. Too many open files (OSError 24). A long scan accumulates file descriptors (httpx pools, docker socket polling, per-subagent resources). The process never raised its soft RLIMIT_NOFILE, so on macOS (default soft limit 256) a long scan blew past the ceiling and crashed at whatever fd-opening call lost the race. Raise the soft limit toward the hard limit at startup. 2. Agents stuck at "Starting agent...". LLM_TIMEOUT was only applied to the startup warm-up call, never to agent-loop calls. Models on the openai/ prefix go through the SDK's OpenAI provider whose default timeout is 600s; as an httpx read timeout that parks an agent for ten minutes on a stalled stream. Install a default AsyncOpenAI client whose read timeout is bounded by LLM_TIMEOUT so a stalled stream raises and the existing retry policy recovers the agent. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

greptile-apps · 2026-06-29T10:19:39Z

Greptile Summary

This PR addresses two failure modes in long-running multi-agent scans: fd exhaustion on macOS (default soft limit of 256) and agents stuck waiting up to 10 minutes due to the OpenAI SDK's 600s default timeout never being bounded by LLM_TIMEOUT.

_raise_open_files_limit() in main.py attempts to raise the soft RLIMIT_NOFILE toward the hard limit at startup, but the 1_048_576 fallback used when hard == RLIM_INFINITY exceeds macOS's per-process kernel cap (kern.maxfilesperproc ≈ 10240), causing setrlimit to fail silently — the fd exhaustion fix is effectively a no-op on the platform it targets most.
_configure_openai_client_timeout() in models.py installs a custom AsyncOpenAI client with a timeout bounded by LLM_TIMEOUT, correctly applying an idle read timeout to OpenAI-prefix model streams without cutting off healthy long generations.

Confidence Score: 3/5

The OpenAI timeout fix is safe to merge, but the fd-limit fix is likely a no-op on macOS and should be revisited before relying on it.

The _raise_open_files_limit function silently does nothing on macOS because the 1_048_576 target exceeds the kernel's per-process cap, causing setrlimit to fail and the limit to remain at 256. This is the exact platform the PR targets for fd exhaustion, so half the stated fix does not take effect.

strix/interface/main.py — the _raise_open_files_limit fallback target and error-handling strategy deserve a closer look for macOS behavior.

Important Files Changed

Filename	Overview
strix/interface/main.py	Adds `_raise_open_files_limit()` called at startup; the `1_048_576` fallback target exceeds macOS's per-process fd cap (`kern.maxfilesperproc` ≈ 10240), causing `setrlimit` to fail silently and leaving the limit unchanged on macOS — the primary failure platform.
strix/config/models.py	Adds `_configure_openai_client_timeout()` to install a bounded-timeout `AsyncOpenAI` client at startup; logic is sound for the OpenAI-prefix path, with a minor guard that skips the fix when `api_key` is absent.

Prompt To Fix All With AI

Fix the following 2 code review issues. Work through them one at a time, proposing concise fixes.

---

### Issue 1 of 2
strix/interface/main.py:757-763
**macOS: `setrlimit` always fails, leaving the limit at 256**

When the hard limit is reported as `RLIM_INFINITY` (the common macOS case — `getrlimit(RLIMIT_NOFILE)` typically returns `(256, 9223372036854775807)` on Ventura/Sonoma), `target` becomes `1_048_576`. macOS however caps `RLIMIT_NOFILE` at `kern.maxfilesperproc` (typically 10240), so `setrlimit(RLIMIT_NOFILE, (1_048_576, RLIM_INFINITY))` raises `ValueError`, which is silently swallowed with a debug log, and the limit stays at 256. The fix does nothing on the primary platform it is meant to address. A progressive-fallback loop — trying e.g. `65536`, then `10240`, then `8192` — would get at least one value accepted by the kernel.

### Issue 2 of 2
strix/config/models.py:100
**Timeout fix skipped when `api_key` is absent**

The guard `if not llm.api_key` exits early, so the bounded-timeout client is never installed for configurations that rely solely on `OPENAI_API_KEY` being present in the process environment but not piped through pydantic-settings (e.g. set before the process starts, not in a `.env` file read by the settings layer). In practice `LlmSettings` does read `OPENAI_API_KEY` via `AliasChoices`, so the field will be populated for standard setups. But it's worth confirming that the env-var aliasing is always evaluated before this guard runs — if settings are constructed before the env var is exported (e.g. in certain test fixtures), the guard silently prevents the timeout fix from taking effect.

_{Reviews (1): Last reviewed commit: "fix: prevent long scans from hanging or ..." | Re-trigger Greptile}

greptile-apps · 2026-06-29T10:19:43Z

+    target = hard if hard != resource.RLIM_INFINITY else 1_048_576
+    if soft >= target:
+        return
+    try:
+        resource.setrlimit(resource.RLIMIT_NOFILE, (target, hard))
+    except (ValueError, OSError):
+        logger.debug("Could not raise RLIMIT_NOFILE from %s to %s", soft, target, exc_info=True)


macOS: setrlimit always fails, leaving the limit at 256

When the hard limit is reported as RLIM_INFINITY (the common macOS case — getrlimit(RLIMIT_NOFILE) typically returns (256, 9223372036854775807) on Ventura/Sonoma), target becomes 1_048_576. macOS however caps RLIMIT_NOFILE at kern.maxfilesperproc (typically 10240), so setrlimit(RLIMIT_NOFILE, (1_048_576, RLIM_INFINITY)) raises ValueError, which is silently swallowed with a debug log, and the limit stays at 256. The fix does nothing on the primary platform it is meant to address. A progressive-fallback loop — trying e.g. 65536, then 10240, then 8192 — would get at least one value accepted by the kernel.

Prompt To Fix With AI

This is a comment left during a code review. Path: strix/interface/main.py Line: 757-763 Comment: **macOS: `setrlimit` always fails, leaving the limit at 256** When the hard limit is reported as `RLIM_INFINITY` (the common macOS case — `getrlimit(RLIMIT_NOFILE)` typically returns `(256, 9223372036854775807)` on Ventura/Sonoma), `target` becomes `1_048_576`. macOS however caps `RLIMIT_NOFILE` at `kern.maxfilesperproc` (typically 10240), so `setrlimit(RLIMIT_NOFILE, (1_048_576, RLIM_INFINITY))` raises `ValueError`, which is silently swallowed with a debug log, and the limit stays at 256. The fix does nothing on the primary platform it is meant to address. A progressive-fallback loop — trying e.g. `65536`, then `10240`, then `8192` — would get at least one value accepted by the kernel. How can I resolve this? If you propose a fix, please make it concise.

greptile-apps · 2026-06-29T10:19:44Z

+    The client carries the key/base_url because ``set_default_openai_client``
+    takes precedence over ``set_default_openai_key``.
+    """
+    if not llm.api_key or llm.timeout <= 0:


Timeout fix skipped when api_key is absent

The guard if not llm.api_key exits early, so the bounded-timeout client is never installed for configurations that rely solely on OPENAI_API_KEY being present in the process environment but not piped through pydantic-settings (e.g. set before the process starts, not in a .env file read by the settings layer). In practice LlmSettings does read OPENAI_API_KEY via AliasChoices, so the field will be populated for standard setups. But it's worth confirming that the env-var aliasing is always evaluated before this guard runs — if settings are constructed before the env var is exported (e.g. in certain test fixtures), the guard silently prevents the timeout fix from taking effect.

Prompt To Fix With AI

This is a comment left during a code review. Path: strix/config/models.py Line: 100 Comment: **Timeout fix skipped when `api_key` is absent** The guard `if not llm.api_key` exits early, so the bounded-timeout client is never installed for configurations that rely solely on `OPENAI_API_KEY` being present in the process environment but not piped through pydantic-settings (e.g. set before the process starts, not in a `.env` file read by the settings layer). In practice `LlmSettings` does read `OPENAI_API_KEY` via `AliasChoices`, so the field will be populated for standard setups. But it's worth confirming that the env-var aliasing is always evaluated before this guard runs — if settings are constructed before the env var is exported (e.g. in certain test fixtures), the guard silently prevents the timeout fix from taking effect. How can I resolve this? If you propose a fix, please make it concise.

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

greptile-apps Bot reviewed Jun 29, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: prevent long scans from hanging or dying on fd exhaustion#601

fix: prevent long scans from hanging or dying on fd exhaustion#601
0xfelixli wants to merge 1 commit into
usestrix:mainfrom
0xfelixli:fix/long-scan-stability

0xfelixli commented Jun 29, 2026

Uh oh!

greptile-apps Bot commented Jun 29, 2026

Uh oh!

greptile-apps Bot Jun 29, 2026

Uh oh!

greptile-apps Bot Jun 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

0xfelixli commented Jun 29, 2026

Problem

1. OSError: [Errno 24] Too many open files

2. Agents stuck at "Starting agent…"

Fix

Uh oh!

greptile-apps Bot commented Jun 29, 2026

Greptile Summary

Confidence Score: 3/5

Important Files Changed

Uh oh!

greptile-apps Bot Jun 29, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Jun 29, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

1. `OSError: [Errno 24] Too many open files`