feat: implement retry logic for restartable stream errors in agent execution by TheArtificialQ · Pull Request #613 · usestrix/strix

TheArtificialQ · 2026-07-01T10:41:41Z

This PR addresses issue #580 by adding retry logic to the agent execution loop when specific types of exceptions are thrown in the openai-agents package or any of its upstream packages.

When one of these predefined exceptions is caught, it is logged and the loop is restarted. The maximum number of retries is capped at 3.

This is how it looks in the strix.log file:

2026-06-26 14:40:03.126 WARNING 192-168-50-241_a517 - strix.core.execution: Restartable stream exception for e6b3c08b
Traceback (most recent call last):
  File "strix/core/execution.py", line 373, in _run_cycle
  File "agents/result.py", line 773, in stream_events
  File "agents/result.py", line 850, in _await_task_safely
  File "agents/result.py", line 608, in _await_run_and_cleanup
  File "agents/run_internal/run_loop.py", line 1014, in start_streaming
  File "agents/run_internal/run_loop.py", line 1635, in run_single_turn_streamed
  File "agents/run_internal/turn_resolution.py", line 1879, in get_single_step_result_from_response
  File "agents/run_internal/turn_resolution.py", line 1828, in process_model_response
agents.exceptions.ModelBehaviorError: Tool exec_command<|channel|>commentary not found in agent strix
2026-06-26 14:40:03.128 WARNING 192-168-50-241_a517 - strix.core.execution: Restarting stream for e6b3c08b after ModelBehaviorError (1/3)

The list of exception types on lines 42–44 is based on the results of my internal testing. I’m sure other exception types could be added as well, but the current list fixed all issues of this kind that I experienced.

…ecution

greptile-apps · 2026-07-01T10:44:53Z

Greptile Summary

This PR adds a bounded retry loop to _run_cycle that catches three specific exception types from the openai-agents/litellm stack and restarts the stream up to three times before propagating the error.

ModelBehaviorError and MidStreamFallbackError are clearly transient and make sense to retry; BadRequestError covers the full range of HTTP 400 responses (context overflow, content-policy rejections, malformed payloads) and will fail identically on every retry when the root cause is permanent.
Session state is not cleared between retries, so a mid-stream tool-hallucination error (ModelBehaviorError) may persist in session history and reproduce on the next attempt.
Tests are well-structured and cover both the successful-retry and limit-exceeded paths.

Confidence Score: 3/5

The retry loop is structurally correct and well-tested, but catching all BadRequestError exceptions may silently retry permanent API failures, and uncleared session state could cause ModelBehaviorError to recur on every attempt.

The BadRequestError catch is the sharpest concern: a context-length-exceeded or content-policy 400 will never recover, yet the code retries it three times. Combined with the session-state question for ModelBehaviorError, the two main retry cases each carry a scenario where the retry loop makes things worse rather than better.

strix/core/execution.py — specifically the _STREAM_RESTARTABLE_EXCEPTIONS tuple and the absence of session-state cleanup before each retry.

Important Files Changed

Filename	Overview
strix/core/execution.py	Adds stream restart logic with a counter + exception tuple; `BadRequestError` inclusion covers permanently-failing 400s, and session state is not reset before retrying.
tests/test_execution.py	Two new async tests cover the retry-success and retry-limit paths with a fake stream; assertions on emitted events, log messages, and stream state are thorough.

Prompt To Fix All With AI

Fix the following 3 code review issues. Work through them one at a time, proposing concise fixes.

---

### Issue 1 of 3
strix/core/execution.py:43-44
**`BadRequestError` is likely not safely restartable**

`litellm.exceptions.BadRequestError` maps to HTTP 400, which covers genuinely permanent conditions such as context-window overflow, content-policy violations, and malformed payloads. Retrying these three times delays failure by seconds with no chance of recovery. A context-length-exceeded 400, for example, will fail on every retry with the same session history in place. `ModelBehaviorError` and `MidStreamFallbackError` both describe transient/recoverable states; `BadRequestError` does not share that property by default. If there is a specific litellm 400 sub-error that turned out to be transient in testing, it would be safer to match on a more specific condition (e.g., checking `exc.status_code` or `exc.message`) rather than catching the entire class.

### Issue 2 of 3
strix/core/execution.py:356
**`stream_restarts` counter is not reset after an image-strip recovery**

Both retry counters live at the same scope and neither is reset when the other path fires. After two stream restarts (`stream_restarts == 2`) followed by an image-strip recovery, only one stream-restart attempt remains before the exception propagates. Whether this asymmetry is intentional, a comment explaining the intended combined budget would prevent future confusion.

### Issue 3 of 3
strix/core/execution.py:400-420
**Session state is not cleared before retrying a `ModelBehaviorError`**

When `ModelBehaviorError` fires mid-stream (e.g., the model hallucinated a tool call), the agent framework may have already appended the invalid assistant turn to `session`. Restarting with the same session object means the retry runs with that corrupted context still in history, which can reproduce the exact same model behaviour on the next attempt. The `image_strips` path mitigates this by setting `input_data = []` so the full history comes from a sanitised session; a similar consideration may apply here depending on whether the session is mutated before the exception surfaces.

_{Reviews (1): Last reviewed commit: "feat: implement retry logic for restarta..." | Re-trigger Greptile}

greptile-apps · 2026-07-01T10:44:57Z

+    MidStreamFallbackError,
+    BadRequestError,


BadRequestError is likely not safely restartable

litellm.exceptions.BadRequestError maps to HTTP 400, which covers genuinely permanent conditions such as context-window overflow, content-policy violations, and malformed payloads. Retrying these three times delays failure by seconds with no chance of recovery. A context-length-exceeded 400, for example, will fail on every retry with the same session history in place. ModelBehaviorError and MidStreamFallbackError both describe transient/recoverable states; BadRequestError does not share that property by default. If there is a specific litellm 400 sub-error that turned out to be transient in testing, it would be safer to match on a more specific condition (e.g., checking exc.status_code or exc.message) rather than catching the entire class.

Prompt To Fix With AI

This is a comment left during a code review. Path: strix/core/execution.py Line: 43-44 Comment: **`BadRequestError` is likely not safely restartable** `litellm.exceptions.BadRequestError` maps to HTTP 400, which covers genuinely permanent conditions such as context-window overflow, content-policy violations, and malformed payloads. Retrying these three times delays failure by seconds with no chance of recovery. A context-length-exceeded 400, for example, will fail on every retry with the same session history in place. `ModelBehaviorError` and `MidStreamFallbackError` both describe transient/recoverable states; `BadRequestError` does not share that property by default. If there is a specific litellm 400 sub-error that turned out to be transient in testing, it would be safer to match on a more specific condition (e.g., checking `exc.status_code` or `exc.message`) rather than catching the entire class. How can I resolve this? If you propose a fix, please make it concise.

greptile-apps · 2026-07-01T10:44:58Z

    hooks: RunHooks[dict[str, Any]] | None,
 ) -> RunResultBase | None:
    image_strips = 0
+    stream_restarts = 0


stream_restarts counter is not reset after an image-strip recovery

Both retry counters live at the same scope and neither is reset when the other path fires. After two stream restarts (stream_restarts == 2) followed by an image-strip recovery, only one stream-restart attempt remains before the exception propagates. Whether this asymmetry is intentional, a comment explaining the intended combined budget would prevent future confusion.

Prompt To Fix With AI

This is a comment left during a code review. Path: strix/core/execution.py Line: 356 Comment: **`stream_restarts` counter is not reset after an image-strip recovery** Both retry counters live at the same scope and neither is reset when the other path fires. After two stream restarts (`stream_restarts == 2`) followed by an image-strip recovery, only one stream-restart attempt remains before the exception propagates. Whether this asymmetry is intentional, a comment explaining the intended combined budget would prevent future confusion. How can I resolve this? If you propose a fix, please make it concise.

greptile-apps · 2026-07-01T10:44:59Z

+                except _STREAM_RESTARTABLE_EXCEPTIONS as exc:
+                    if stream_restarts >= _STREAM_RESTART_LIMIT:
+                        raise
+                    stream_restarts += 1
+                    restart_stream = True
+                    logger.warning(
+                        "Restartable stream exception for %s",
+                        agent_id,
+                        exc_info=True,
+                    )
+                    logger.warning(
+                        "Restarting stream for %s after %s (%d/%d)",
+                        agent_id,
+                        type(exc).__name__,
+                        stream_restarts,
+                        _STREAM_RESTART_LIMIT,
+                    )
            finally:
                await coordinator.detach_stream(agent_id, stream)
+            if restart_stream:
+                continue


Session state is not cleared before retrying a ModelBehaviorError

When ModelBehaviorError fires mid-stream (e.g., the model hallucinated a tool call), the agent framework may have already appended the invalid assistant turn to session. Restarting with the same session object means the retry runs with that corrupted context still in history, which can reproduce the exact same model behaviour on the next attempt. The image_strips path mitigates this by setting input_data = [] so the full history comes from a sanitised session; a similar consideration may apply here depending on whether the session is mutated before the exception surfaces.

Prompt To Fix With AI

This is a comment left during a code review. Path: strix/core/execution.py Line: 400-420 Comment: **Session state is not cleared before retrying a `ModelBehaviorError`** When `ModelBehaviorError` fires mid-stream (e.g., the model hallucinated a tool call), the agent framework may have already appended the invalid assistant turn to `session`. Restarting with the same session object means the retry runs with that corrupted context still in history, which can reproduce the exact same model behaviour on the next attempt. The `image_strips` path mitigates this by setting `input_data = []` so the full history comes from a sanitised session; a similar consideration may apply here depending on whether the session is mutated before the exception surfaces. How can I resolve this? If you propose a fix, please make it concise.

feat: implement retry logic for restartable stream errors in agent ex…

cea1e8a

…ecution

greptile-apps Bot reviewed Jul 1, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: implement retry logic for restartable stream errors in agent execution#613

feat: implement retry logic for restartable stream errors in agent execution#613
TheArtificialQ wants to merge 1 commit into
usestrix:mainfrom
TheArtificialQ:fix/retry-logic

TheArtificialQ commented Jul 1, 2026

Uh oh!

greptile-apps Bot commented Jul 1, 2026

Uh oh!

greptile-apps Bot Jul 1, 2026

Uh oh!

greptile-apps Bot Jul 1, 2026

Uh oh!

greptile-apps Bot Jul 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

TheArtificialQ commented Jul 1, 2026

Uh oh!

greptile-apps Bot commented Jul 1, 2026

Greptile Summary

Confidence Score: 3/5

Important Files Changed

Uh oh!

greptile-apps Bot Jul 1, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Jul 1, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Jul 1, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant