Skip to content

Conversation

@Kwanghoon-Choi
Copy link

Problem (#343)

When running tests/benchmark/benchmark_store.py and pressing ctrl+c, every runner/algorithm logs an traceback for asyncio.CancelledError.
SIGINT (ctrl+c) ends up as asyncio.CancelledError, which is difficult to distinguish from internal cancellations.

Goal

Treat ctrl+c as a graceful shutdown (no traceback), while treating non-SIGINT asyncio.CancelledError as a failure with a traceback.

Fix

  • Added _run_with_sigint as an entry point for runner/algorithm coroutines.
    • Set a global flag (SIGINT_SEEN) from the SIGINT handler.
    • On receiving SIGINT, the handler calls main_task.cancel() to convert the signal into asyncio.CancelledError.
    • Classify asyncio.CancelledError based on SIGINT_SEEN.
      • If set, it is treated as an intentional shutdown (no traceback) and main_process raises KeyboardInterrupt.
      • If not, the original CancelledError is propagated as before.

Tests

  • test_run_with_sigint_child_runner_exits_cleanly_on_sigint
  • test_run_with_sigint_propagates_non_sigint_cancelled_error

The tests verify that _run_with_sigint() distinguishes between SIGINT-triggered cancellation and internal asyncio.CancelledError.

Reproduce

# Terminal 1
agl store --port 4747

# Terminal 2
python -m tests.benchmark.benchmark_store
# Press ctrl+c when AgentOps generates traces

Copilot AI review requested due to automatic review settings December 5, 2025 01:25
@Kwanghoon-Choi
Copy link
Author

@microsoft-github-policy-service agree

Copilot finished reviewing on behalf of Kwanghoon-Choi December 5, 2025 01:39
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes issue #343 where pressing ctrl+c during benchmark runs caused redundant asyncio.CancelledError tracebacks from every runner/algorithm. The fix introduces a new _run_with_sigint() method that distinguishes between SIGINT-triggered cancellations (which should be treated as graceful shutdowns) and internal asyncio.CancelledError exceptions (which should be treated as failures).

Key changes:

  • Added _run_with_sigint() method that sets up a SIGINT handler and uses a global flag SIGINT_SEEN to track whether cancellation was caused by SIGINT
  • Modified all asyncio.run() calls to use _run_with_sigint() instead, providing consistent SIGINT handling across algorithm and runner processes
  • Added exception handlers for asyncio.CancelledError in _execute_algorithm and _execute_runner to ensure stop events are set before propagating the error

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 6 comments.

File Description
agentlightning/execution/client_server.py Adds global SIGINT_SEEN flag, implements _run_with_sigint() method with SIGINT handler, adds CancelledError handlers to executor methods, and replaces all asyncio.run() calls with _run_with_sigint()
tests/execution/test_client_server.py Adds _cancel_in_runner() helper function and two new tests to verify SIGINT handling (graceful shutdown) and non-SIGINT CancelledError propagation

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.


SIGINT_SEEN: bool = False


Copy link

Copilot AI Dec 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The global flag SIGINT_SEEN is not reset after use, which can cause incorrect behavior if _run_with_sigint is called multiple times in the same process. Once set to True, it will remain True for subsequent calls, causing non-SIGINT CancelledError exceptions to be incorrectly treated as SIGINT-triggered cancellations.

Consider using a context-local flag (e.g., passing a flag as a parameter or using a thread-local/task-local variable) or explicitly resetting the flag at the start of _run_with_sigint.

Suggested change
def _run_with_sigint(...):
global SIGINT_SEEN
SIGINT_SEEN = False
# ... rest of the function ...

Copilot uses AI. Check for mistakes.
try:
loop.run_until_complete(main_task)
except asyncio.CancelledError:
global SIGINT_SEEN
Copy link

Copilot AI Dec 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The global keyword is unnecessary here since you're only reading SIGINT_SEEN, not modifying it. The global declaration is only needed when assigning to a global variable within a function.

Remove the global SIGINT_SEEN declaration on line 256.

Suggested change
global SIGINT_SEEN

Copilot uses AI. Check for mistakes.
Comment on lines 232 to 241
make_coro = lambda: self._execute_runner(runner, worker_id, store, stop_evt)
else:
assert algorithm is not None
make_coro = lambda: self._execute_algorithm(algorithm, store, stop_evt)

loop = asyncio.new_event_loop()
asyncio.set_event_loop(loop)

# Explicitly create the main task so the SIGINT handler can do cancel()
main_task: asyncio.Task[None] = loop.create_task(make_coro())
Copy link

Copilot AI Dec 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] The lambda functions used here capture variables from the outer scope. Consider using a more explicit approach for better readability and to avoid potential closure-related issues:

if kind == "runner":
    assert runner is not None
    coro = self._execute_runner(runner, worker_id, store, stop_evt)
else:
    assert algorithm is not None
    coro = self._execute_algorithm(algorithm, store, stop_evt)

main_task: asyncio.Task[None] = loop.create_task(coro)
Suggested change
make_coro = lambda: self._execute_runner(runner, worker_id, store, stop_evt)
else:
assert algorithm is not None
make_coro = lambda: self._execute_algorithm(algorithm, store, stop_evt)
loop = asyncio.new_event_loop()
asyncio.set_event_loop(loop)
# Explicitly create the main task so the SIGINT handler can do cancel()
main_task: asyncio.Task[None] = loop.create_task(make_coro())
coro = self._execute_runner(runner, worker_id, store, stop_evt)
else:
assert algorithm is not None
coro = self._execute_algorithm(algorithm, store, stop_evt)
loop = asyncio.new_event_loop()
asyncio.set_event_loop(loop)
# Explicitly create the main task so the SIGINT handler can do cancel()
main_task: asyncio.Task[None] = loop.create_task(coro)

Copilot uses AI. Check for mistakes.
Comment on lines 1282 to 1286
assert p.pid is not None
os.kill(p.pid, signal.SIGINT)
p.join(timeout=5.0)
assert not p.is_alive()
assert p.exitcode == 0
Copy link

Copilot AI Dec 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test may have a race condition. After sending SIGINT with os.kill(), there's no guarantee that the signal handler has completed setting the global flag before the process exits. Consider adding a small delay after sending the signal or checking that the process actually received and handled the signal.

Additionally, this test doesn't verify that SIGINT_SEEN was actually set in the child process, it only checks the exit code. If the global flag mechanism fails in a child process (e.g., in spawn mode), this test might still pass if the process exits cleanly for other reasons.

Copilot uses AI. Check for mistakes.
# the runner/algorithm receives asyncio.CancelledError.
main_task.cancel()

signal.signal(signal.SIGINT, _sigint_handler)
Copy link

Copilot AI Dec 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The signal handler doesn't restore the previous SIGINT handler after the function completes. If _run_with_sigint is called multiple times or if the caller expects to have their own SIGINT handler, this could cause issues.

Consider saving and restoring the previous handler:

previous_handler = signal.signal(signal.SIGINT, _sigint_handler)
try:
    # ... existing code ...
finally:
    signal.signal(signal.SIGINT, previous_handler)
    # ... existing cleanup code ...

Copilot uses AI. Check for mistakes.

import pytest

from agentlightning.execution import client_server as cs_mod
Copy link

Copilot AI Dec 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Import of 'cs_mod' is not used.

Suggested change
from agentlightning.execution import client_server as cs_mod

Copilot uses AI. Check for mistakes.
assert algorithm is not None
make_coro = lambda: self._execute_algorithm(algorithm, store, stop_evt)

loop = asyncio.new_event_loop()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to create a new event loop here? what will happen if we already have a event loop?

# the runner/algorithm receives asyncio.CancelledError.
main_task.cancel()

signal.signal(signal.SIGINT, _sigint_handler)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this registration is process wide. Let's be extra careful on that.

try: except KeyboardInterrupt might be a better idea if that works.

@ultmaster ultmaster linked an issue Dec 5, 2025 that may be closed by this pull request
@Kwanghoon-Choi
Copy link
Author

I removed the previous logic and shifted exception handling to the outer asyncio.run() wrapper in _spawn_runners & _spawn_algorithm_process.

When a SIGINT occurs, syncio.run() presents asyncio.CancelledError within the coroutine, but escalates it as a KeyboardInterrupt to the outer.

By raising CancelledError inside _execute_runner & _execute_algorithm, the outer handler can distinguish KeyboardInterrupt (graceful shutdown w/o traceback) from other crashes (w/ traceback).

Also added tests that verify if _spawn_runners distinguishes between SIGINT and asyncio.CancelledError.

@ultmaster
Copy link
Contributor

ultmaster commented Dec 5, 2025

@Kwanghoon-Choi have you tested what will happen if the algorithm crashes (in which case stop_evt is set)? will the runners emit a lot of logs and tracebacks?

@Kwanghoon-Choi
Copy link
Author

Since the runner handles the stop_evt gracefully, no runner tracebacks are emitted when the algorithm crashes and stop_evt is set.

Tested by sending ctrl+c to the terminal running agl store to cause the algorithm to crash.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Hitting KeyboardInterrupt produces too many error logs

2 participants