Skip to content

feat: implement retry logic for restartable stream errors in agent execution#613

Open
TheArtificialQ wants to merge 1 commit into
usestrix:mainfrom
TheArtificialQ:fix/retry-logic
Open

feat: implement retry logic for restartable stream errors in agent execution#613
TheArtificialQ wants to merge 1 commit into
usestrix:mainfrom
TheArtificialQ:fix/retry-logic

Conversation

@TheArtificialQ

Copy link
Copy Markdown

This PR addresses issue #580 by adding retry logic to the agent execution loop when specific types of exceptions are thrown in the openai-agents package or any of its upstream packages.

When one of these predefined exceptions is caught, it is logged and the loop is restarted. The maximum number of retries is capped at 3.

This is how it looks in the strix.log file:

2026-06-26 14:40:03.126 WARNING 192-168-50-241_a517 - strix.core.execution: Restartable stream exception for e6b3c08b
Traceback (most recent call last):
  File "strix/core/execution.py", line 373, in _run_cycle
  File "agents/result.py", line 773, in stream_events
  File "agents/result.py", line 850, in _await_task_safely
  File "agents/result.py", line 608, in _await_run_and_cleanup
  File "agents/run_internal/run_loop.py", line 1014, in start_streaming
  File "agents/run_internal/run_loop.py", line 1635, in run_single_turn_streamed
  File "agents/run_internal/turn_resolution.py", line 1879, in get_single_step_result_from_response
  File "agents/run_internal/turn_resolution.py", line 1828, in process_model_response
agents.exceptions.ModelBehaviorError: Tool exec_command<|channel|>commentary not found in agent strix
2026-06-26 14:40:03.128 WARNING 192-168-50-241_a517 - strix.core.execution: Restarting stream for e6b3c08b after ModelBehaviorError (1/3)

The list of exception types on lines 42–44 is based on the results of my internal testing. I’m sure other exception types could be added as well, but the current list fixed all issues of this kind that I experienced.

@greptile-apps

greptile-apps Bot commented Jul 1, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR adds a bounded retry loop to _run_cycle that catches three specific exception types from the openai-agents/litellm stack and restarts the stream up to three times before propagating the error.

  • ModelBehaviorError and MidStreamFallbackError are clearly transient and make sense to retry; BadRequestError covers the full range of HTTP 400 responses (context overflow, content-policy rejections, malformed payloads) and will fail identically on every retry when the root cause is permanent.
  • Session state is not cleared between retries, so a mid-stream tool-hallucination error (ModelBehaviorError) may persist in session history and reproduce on the next attempt.
  • Tests are well-structured and cover both the successful-retry and limit-exceeded paths.

Confidence Score: 3/5

The retry loop is structurally correct and well-tested, but catching all BadRequestError exceptions may silently retry permanent API failures, and uncleared session state could cause ModelBehaviorError to recur on every attempt.

The BadRequestError catch is the sharpest concern: a context-length-exceeded or content-policy 400 will never recover, yet the code retries it three times. Combined with the session-state question for ModelBehaviorError, the two main retry cases each carry a scenario where the retry loop makes things worse rather than better.

strix/core/execution.py — specifically the _STREAM_RESTARTABLE_EXCEPTIONS tuple and the absence of session-state cleanup before each retry.

Important Files Changed

Filename Overview
strix/core/execution.py Adds stream restart logic with a counter + exception tuple; BadRequestError inclusion covers permanently-failing 400s, and session state is not reset before retrying.
tests/test_execution.py Two new async tests cover the retry-success and retry-limit paths with a fake stream; assertions on emitted events, log messages, and stream state are thorough.
Prompt To Fix All With AI
Fix the following 3 code review issues. Work through them one at a time, proposing concise fixes.

---

### Issue 1 of 3
strix/core/execution.py:43-44
**`BadRequestError` is likely not safely restartable**

`litellm.exceptions.BadRequestError` maps to HTTP 400, which covers genuinely permanent conditions such as context-window overflow, content-policy violations, and malformed payloads. Retrying these three times delays failure by seconds with no chance of recovery. A context-length-exceeded 400, for example, will fail on every retry with the same session history in place. `ModelBehaviorError` and `MidStreamFallbackError` both describe transient/recoverable states; `BadRequestError` does not share that property by default. If there is a specific litellm 400 sub-error that turned out to be transient in testing, it would be safer to match on a more specific condition (e.g., checking `exc.status_code` or `exc.message`) rather than catching the entire class.

### Issue 2 of 3
strix/core/execution.py:356
**`stream_restarts` counter is not reset after an image-strip recovery**

Both retry counters live at the same scope and neither is reset when the other path fires. After two stream restarts (`stream_restarts == 2`) followed by an image-strip recovery, only one stream-restart attempt remains before the exception propagates. Whether this asymmetry is intentional, a comment explaining the intended combined budget would prevent future confusion.

### Issue 3 of 3
strix/core/execution.py:400-420
**Session state is not cleared before retrying a `ModelBehaviorError`**

When `ModelBehaviorError` fires mid-stream (e.g., the model hallucinated a tool call), the agent framework may have already appended the invalid assistant turn to `session`. Restarting with the same session object means the retry runs with that corrupted context still in history, which can reproduce the exact same model behaviour on the next attempt. The `image_strips` path mitigates this by setting `input_data = []` so the full history comes from a sanitised session; a similar consideration may apply here depending on whether the session is mutated before the exception surfaces.

Reviews (1): Last reviewed commit: "feat: implement retry logic for restarta..." | Re-trigger Greptile

Comment thread strix/core/execution.py
Comment on lines +43 to +44
MidStreamFallbackError,
BadRequestError,

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 BadRequestError is likely not safely restartable

litellm.exceptions.BadRequestError maps to HTTP 400, which covers genuinely permanent conditions such as context-window overflow, content-policy violations, and malformed payloads. Retrying these three times delays failure by seconds with no chance of recovery. A context-length-exceeded 400, for example, will fail on every retry with the same session history in place. ModelBehaviorError and MidStreamFallbackError both describe transient/recoverable states; BadRequestError does not share that property by default. If there is a specific litellm 400 sub-error that turned out to be transient in testing, it would be safer to match on a more specific condition (e.g., checking exc.status_code or exc.message) rather than catching the entire class.

Prompt To Fix With AI
This is a comment left during a code review.
Path: strix/core/execution.py
Line: 43-44

Comment:
**`BadRequestError` is likely not safely restartable**

`litellm.exceptions.BadRequestError` maps to HTTP 400, which covers genuinely permanent conditions such as context-window overflow, content-policy violations, and malformed payloads. Retrying these three times delays failure by seconds with no chance of recovery. A context-length-exceeded 400, for example, will fail on every retry with the same session history in place. `ModelBehaviorError` and `MidStreamFallbackError` both describe transient/recoverable states; `BadRequestError` does not share that property by default. If there is a specific litellm 400 sub-error that turned out to be transient in testing, it would be safer to match on a more specific condition (e.g., checking `exc.status_code` or `exc.message`) rather than catching the entire class.

How can I resolve this? If you propose a fix, please make it concise.

Comment thread strix/core/execution.py
hooks: RunHooks[dict[str, Any]] | None,
) -> RunResultBase | None:
image_strips = 0
stream_restarts = 0

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 stream_restarts counter is not reset after an image-strip recovery

Both retry counters live at the same scope and neither is reset when the other path fires. After two stream restarts (stream_restarts == 2) followed by an image-strip recovery, only one stream-restart attempt remains before the exception propagates. Whether this asymmetry is intentional, a comment explaining the intended combined budget would prevent future confusion.

Prompt To Fix With AI
This is a comment left during a code review.
Path: strix/core/execution.py
Line: 356

Comment:
**`stream_restarts` counter is not reset after an image-strip recovery**

Both retry counters live at the same scope and neither is reset when the other path fires. After two stream restarts (`stream_restarts == 2`) followed by an image-strip recovery, only one stream-restart attempt remains before the exception propagates. Whether this asymmetry is intentional, a comment explaining the intended combined budget would prevent future confusion.

How can I resolve this? If you propose a fix, please make it concise.

Comment thread strix/core/execution.py
Comment on lines +400 to +420
except _STREAM_RESTARTABLE_EXCEPTIONS as exc:
if stream_restarts >= _STREAM_RESTART_LIMIT:
raise
stream_restarts += 1
restart_stream = True
logger.warning(
"Restartable stream exception for %s",
agent_id,
exc_info=True,
)
logger.warning(
"Restarting stream for %s after %s (%d/%d)",
agent_id,
type(exc).__name__,
stream_restarts,
_STREAM_RESTART_LIMIT,
)
finally:
await coordinator.detach_stream(agent_id, stream)
if restart_stream:
continue

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Session state is not cleared before retrying a ModelBehaviorError

When ModelBehaviorError fires mid-stream (e.g., the model hallucinated a tool call), the agent framework may have already appended the invalid assistant turn to session. Restarting with the same session object means the retry runs with that corrupted context still in history, which can reproduce the exact same model behaviour on the next attempt. The image_strips path mitigates this by setting input_data = [] so the full history comes from a sanitised session; a similar consideration may apply here depending on whether the session is mutated before the exception surfaces.

Prompt To Fix With AI
This is a comment left during a code review.
Path: strix/core/execution.py
Line: 400-420

Comment:
**Session state is not cleared before retrying a `ModelBehaviorError`**

When `ModelBehaviorError` fires mid-stream (e.g., the model hallucinated a tool call), the agent framework may have already appended the invalid assistant turn to `session`. Restarting with the same session object means the retry runs with that corrupted context still in history, which can reproduce the exact same model behaviour on the next attempt. The `image_strips` path mitigates this by setting `input_data = []` so the full history comes from a sanitised session; a similar consideration may apply here depending on whether the session is mutated before the exception surfaces.

How can I resolve this? If you propose a fix, please make it concise.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant