feat: implement retry logic for restartable stream errors in agent execution#613
feat: implement retry logic for restartable stream errors in agent execution#613TheArtificialQ wants to merge 1 commit into
Conversation
Greptile SummaryThis PR adds a bounded retry loop to
Confidence Score: 3/5The retry loop is structurally correct and well-tested, but catching all The strix/core/execution.py — specifically the Important Files Changed
Prompt To Fix All With AIFix the following 3 code review issues. Work through them one at a time, proposing concise fixes.
---
### Issue 1 of 3
strix/core/execution.py:43-44
**`BadRequestError` is likely not safely restartable**
`litellm.exceptions.BadRequestError` maps to HTTP 400, which covers genuinely permanent conditions such as context-window overflow, content-policy violations, and malformed payloads. Retrying these three times delays failure by seconds with no chance of recovery. A context-length-exceeded 400, for example, will fail on every retry with the same session history in place. `ModelBehaviorError` and `MidStreamFallbackError` both describe transient/recoverable states; `BadRequestError` does not share that property by default. If there is a specific litellm 400 sub-error that turned out to be transient in testing, it would be safer to match on a more specific condition (e.g., checking `exc.status_code` or `exc.message`) rather than catching the entire class.
### Issue 2 of 3
strix/core/execution.py:356
**`stream_restarts` counter is not reset after an image-strip recovery**
Both retry counters live at the same scope and neither is reset when the other path fires. After two stream restarts (`stream_restarts == 2`) followed by an image-strip recovery, only one stream-restart attempt remains before the exception propagates. Whether this asymmetry is intentional, a comment explaining the intended combined budget would prevent future confusion.
### Issue 3 of 3
strix/core/execution.py:400-420
**Session state is not cleared before retrying a `ModelBehaviorError`**
When `ModelBehaviorError` fires mid-stream (e.g., the model hallucinated a tool call), the agent framework may have already appended the invalid assistant turn to `session`. Restarting with the same session object means the retry runs with that corrupted context still in history, which can reproduce the exact same model behaviour on the next attempt. The `image_strips` path mitigates this by setting `input_data = []` so the full history comes from a sanitised session; a similar consideration may apply here depending on whether the session is mutated before the exception surfaces.
Reviews (1): Last reviewed commit: "feat: implement retry logic for restarta..." | Re-trigger Greptile |
| MidStreamFallbackError, | ||
| BadRequestError, |
There was a problem hiding this comment.
BadRequestError is likely not safely restartable
litellm.exceptions.BadRequestError maps to HTTP 400, which covers genuinely permanent conditions such as context-window overflow, content-policy violations, and malformed payloads. Retrying these three times delays failure by seconds with no chance of recovery. A context-length-exceeded 400, for example, will fail on every retry with the same session history in place. ModelBehaviorError and MidStreamFallbackError both describe transient/recoverable states; BadRequestError does not share that property by default. If there is a specific litellm 400 sub-error that turned out to be transient in testing, it would be safer to match on a more specific condition (e.g., checking exc.status_code or exc.message) rather than catching the entire class.
Prompt To Fix With AI
This is a comment left during a code review.
Path: strix/core/execution.py
Line: 43-44
Comment:
**`BadRequestError` is likely not safely restartable**
`litellm.exceptions.BadRequestError` maps to HTTP 400, which covers genuinely permanent conditions such as context-window overflow, content-policy violations, and malformed payloads. Retrying these three times delays failure by seconds with no chance of recovery. A context-length-exceeded 400, for example, will fail on every retry with the same session history in place. `ModelBehaviorError` and `MidStreamFallbackError` both describe transient/recoverable states; `BadRequestError` does not share that property by default. If there is a specific litellm 400 sub-error that turned out to be transient in testing, it would be safer to match on a more specific condition (e.g., checking `exc.status_code` or `exc.message`) rather than catching the entire class.
How can I resolve this? If you propose a fix, please make it concise.| hooks: RunHooks[dict[str, Any]] | None, | ||
| ) -> RunResultBase | None: | ||
| image_strips = 0 | ||
| stream_restarts = 0 |
There was a problem hiding this comment.
stream_restarts counter is not reset after an image-strip recovery
Both retry counters live at the same scope and neither is reset when the other path fires. After two stream restarts (stream_restarts == 2) followed by an image-strip recovery, only one stream-restart attempt remains before the exception propagates. Whether this asymmetry is intentional, a comment explaining the intended combined budget would prevent future confusion.
Prompt To Fix With AI
This is a comment left during a code review.
Path: strix/core/execution.py
Line: 356
Comment:
**`stream_restarts` counter is not reset after an image-strip recovery**
Both retry counters live at the same scope and neither is reset when the other path fires. After two stream restarts (`stream_restarts == 2`) followed by an image-strip recovery, only one stream-restart attempt remains before the exception propagates. Whether this asymmetry is intentional, a comment explaining the intended combined budget would prevent future confusion.
How can I resolve this? If you propose a fix, please make it concise.| except _STREAM_RESTARTABLE_EXCEPTIONS as exc: | ||
| if stream_restarts >= _STREAM_RESTART_LIMIT: | ||
| raise | ||
| stream_restarts += 1 | ||
| restart_stream = True | ||
| logger.warning( | ||
| "Restartable stream exception for %s", | ||
| agent_id, | ||
| exc_info=True, | ||
| ) | ||
| logger.warning( | ||
| "Restarting stream for %s after %s (%d/%d)", | ||
| agent_id, | ||
| type(exc).__name__, | ||
| stream_restarts, | ||
| _STREAM_RESTART_LIMIT, | ||
| ) | ||
| finally: | ||
| await coordinator.detach_stream(agent_id, stream) | ||
| if restart_stream: | ||
| continue |
There was a problem hiding this comment.
Session state is not cleared before retrying a
ModelBehaviorError
When ModelBehaviorError fires mid-stream (e.g., the model hallucinated a tool call), the agent framework may have already appended the invalid assistant turn to session. Restarting with the same session object means the retry runs with that corrupted context still in history, which can reproduce the exact same model behaviour on the next attempt. The image_strips path mitigates this by setting input_data = [] so the full history comes from a sanitised session; a similar consideration may apply here depending on whether the session is mutated before the exception surfaces.
Prompt To Fix With AI
This is a comment left during a code review.
Path: strix/core/execution.py
Line: 400-420
Comment:
**Session state is not cleared before retrying a `ModelBehaviorError`**
When `ModelBehaviorError` fires mid-stream (e.g., the model hallucinated a tool call), the agent framework may have already appended the invalid assistant turn to `session`. Restarting with the same session object means the retry runs with that corrupted context still in history, which can reproduce the exact same model behaviour on the next attempt. The `image_strips` path mitigates this by setting `input_data = []` so the full history comes from a sanitised session; a similar consideration may apply here depending on whether the session is mutated before the exception surfaces.
How can I resolve this? If you propose a fix, please make it concise.
This PR addresses issue #580 by adding retry logic to the agent execution loop when specific types of exceptions are thrown in the
openai-agentspackage or any of its upstream packages.When one of these predefined exceptions is caught, it is logged and the loop is restarted. The maximum number of retries is capped at 3.
This is how it looks in the strix.log file:
The list of exception types on lines 42–44 is based on the results of my internal testing. I’m sure other exception types could be added as well, but the current list fixed all issues of this kind that I experienced.