Fix redundant cancel tracebacks on ctrl+c (issue #343) #370

Kwanghoon-Choi · 2025-12-05T01:25:51Z

Problem (#343)

When running tests/benchmark/benchmark_store.py and pressing ctrl+c, every runner/algorithm logs an traceback for asyncio.CancelledError.
SIGINT (ctrl+c) ends up as asyncio.CancelledError, which is difficult to distinguish from internal cancellations.

Goal

Treat ctrl+c as a graceful shutdown (no traceback), while treating non-SIGINT asyncio.CancelledError as a failure with a traceback.

Fix

Added _run_with_sigint as an entry point for runner/algorithm coroutines.
- Set a global flag (SIGINT_SEEN) from the SIGINT handler.
- On receiving SIGINT, the handler calls main_task.cancel() to convert the signal into asyncio.CancelledError.
- Classify asyncio.CancelledError based on SIGINT_SEEN.
  - If set, it is treated as an intentional shutdown (no traceback) and main_process raises KeyboardInterrupt.
  - If not, the original CancelledError is propagated as before.

Tests

test_run_with_sigint_child_runner_exits_cleanly_on_sigint
test_run_with_sigint_propagates_non_sigint_cancelled_error

The tests verify that _run_with_sigint() distinguishes between SIGINT-triggered cancellation and internal asyncio.CancelledError.

Reproduce

# Terminal 1
agl store --port 4747

# Terminal 2
python -m tests.benchmark.benchmark_store
# Press ctrl+c when AgentOps generates traces

Kwanghoon-Choi · 2025-12-05T01:26:39Z

@microsoft-github-policy-service agree

Copilot

Pull request overview

This PR fixes issue #343 where pressing ctrl+c during benchmark runs caused redundant asyncio.CancelledError tracebacks from every runner/algorithm. The fix introduces a new _run_with_sigint() method that distinguishes between SIGINT-triggered cancellations (which should be treated as graceful shutdowns) and internal asyncio.CancelledError exceptions (which should be treated as failures).

Key changes:

Added _run_with_sigint() method that sets up a SIGINT handler and uses a global flag SIGINT_SEEN to track whether cancellation was caused by SIGINT
Modified all asyncio.run() calls to use _run_with_sigint() instead, providing consistent SIGINT handling across algorithm and runner processes
Added exception handlers for asyncio.CancelledError in _execute_algorithm and _execute_runner to ensure stop events are set before propagating the error

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 6 comments.

File	Description
agentlightning/execution/client_server.py	Adds global `SIGINT_SEEN` flag, implements `_run_with_sigint()` method with SIGINT handler, adds `CancelledError` handlers to executor methods, and replaces all `asyncio.run()` calls with `_run_with_sigint()`
tests/execution/test_client_server.py	Adds `_cancel_in_runner()` helper function and two new tests to verify SIGINT handling (graceful shutdown) and non-SIGINT `CancelledError` propagation

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2025-12-05T01:40:02Z

agentlightning/execution/client_server.py


+SIGINT_SEEN: bool = False
+



The global flag SIGINT_SEEN is not reset after use, which can cause incorrect behavior if _run_with_sigint is called multiple times in the same process. Once set to True, it will remain True for subsequent calls, causing non-SIGINT CancelledError exceptions to be incorrectly treated as SIGINT-triggered cancellations.

Consider using a context-local flag (e.g., passing a flag as a parameter or using a thread-local/task-local variable) or explicitly resetting the flag at the start of _run_with_sigint.

Suggested change

def _run_with_sigint(...):

global SIGINT_SEEN

SIGINT_SEEN = False

# ... rest of the function ...

Copilot · 2025-12-05T01:40:02Z

agentlightning/execution/client_server.py

+        try:
+            loop.run_until_complete(main_task)
+        except asyncio.CancelledError:
+            global SIGINT_SEEN


The global keyword is unnecessary here since you're only reading SIGINT_SEEN, not modifying it. The global declaration is only needed when assigning to a global variable within a function.

Remove the global SIGINT_SEEN declaration on line 256.

Suggested change

global SIGINT_SEEN

Copilot · 2025-12-05T01:40:02Z

agentlightning/execution/client_server.py

+            make_coro = lambda: self._execute_runner(runner, worker_id, store, stop_evt)
+        else:
+            assert algorithm is not None
+            make_coro = lambda: self._execute_algorithm(algorithm, store, stop_evt)
+
+        loop = asyncio.new_event_loop()
+        asyncio.set_event_loop(loop)
+
+        # Explicitly create the main task so the SIGINT handler can do cancel()
+        main_task: asyncio.Task[None] = loop.create_task(make_coro())


[nitpick] The lambda functions used here capture variables from the outer scope. Consider using a more explicit approach for better readability and to avoid potential closure-related issues:

if kind == "runner": assert runner is not None coro = self._execute_runner(runner, worker_id, store, stop_evt) else: assert algorithm is not None coro = self._execute_algorithm(algorithm, store, stop_evt) main_task: asyncio.Task[None] = loop.create_task(coro)

Suggested change

make_coro = lambda: self._execute_runner(runner, worker_id, store, stop_evt)

else:

assert algorithm is not None

make_coro = lambda: self._execute_algorithm(algorithm, store, stop_evt)

loop = asyncio.new_event_loop()

asyncio.set_event_loop(loop)

# Explicitly create the main task so the SIGINT handler can do cancel()

main_task: asyncio.Task[None] = loop.create_task(make_coro())

coro = self._execute_runner(runner, worker_id, store, stop_evt)

else:

assert algorithm is not None

coro = self._execute_algorithm(algorithm, store, stop_evt)

loop = asyncio.new_event_loop()

asyncio.set_event_loop(loop)

# Explicitly create the main task so the SIGINT handler can do cancel()

main_task: asyncio.Task[None] = loop.create_task(coro)

Copilot · 2025-12-05T01:40:03Z

tests/execution/test_client_server.py

+        assert p.pid is not None
+        os.kill(p.pid, signal.SIGINT)
+        p.join(timeout=5.0)
+        assert not p.is_alive()
+        assert p.exitcode == 0


The test may have a race condition. After sending SIGINT with os.kill(), there's no guarantee that the signal handler has completed setting the global flag before the process exits. Consider adding a small delay after sending the signal or checking that the process actually received and handled the signal.

Additionally, this test doesn't verify that SIGINT_SEEN was actually set in the child process, it only checks the exit code. If the global flag mechanism fails in a child process (e.g., in spawn mode), this test might still pass if the process exits cleanly for other reasons.

Copilot · 2025-12-05T01:40:03Z

agentlightning/execution/client_server.py

+            # the runner/algorithm receives asyncio.CancelledError.
+            main_task.cancel()
+
+        signal.signal(signal.SIGINT, _sigint_handler)


The signal handler doesn't restore the previous SIGINT handler after the function completes. If _run_with_sigint is called multiple times or if the caller expects to have their own SIGINT handler, this could cause issues.

Consider saving and restoring the previous handler:

previous_handler = signal.signal(signal.SIGINT, _sigint_handler) try: # ... existing code ... finally: signal.signal(signal.SIGINT, previous_handler) # ... existing cleanup code ...

Copilot · 2025-12-05T01:40:03Z

tests/execution/test_client_server.py


 import pytest

+from agentlightning.execution import client_server as cs_mod


Import of 'cs_mod' is not used.

Suggested change

from agentlightning.execution import client_server as cs_mod

ultmaster · 2025-12-05T02:11:25Z

agentlightning/execution/client_server.py

+            assert algorithm is not None
+            make_coro = lambda: self._execute_algorithm(algorithm, store, stop_evt)
+
+        loop = asyncio.new_event_loop()


Why do we need to create a new event loop here? what will happen if we already have a event loop?

ultmaster · 2025-12-05T02:12:31Z

agentlightning/execution/client_server.py

+            # the runner/algorithm receives asyncio.CancelledError.
+            main_task.cancel()
+
+        signal.signal(signal.SIGINT, _sigint_handler)


I think this registration is process wide. Let's be extra careful on that.

try: except KeyboardInterrupt might be a better idea if that works.

Kwanghoon-Choi · 2025-12-05T13:02:50Z

I removed the previous logic and shifted exception handling to the outer asyncio.run() wrapper in _spawn_runners & _spawn_algorithm_process.

When a SIGINT occurs, syncio.run() presents asyncio.CancelledError within the coroutine, but escalates it as a KeyboardInterrupt to the outer.

By raising CancelledError inside _execute_runner & _execute_algorithm, the outer handler can distinguish KeyboardInterrupt (graceful shutdown w/o traceback) from other crashes (w/ traceback).

Also added tests that verify if _spawn_runners distinguishes between SIGINT and asyncio.CancelledError.

ultmaster · 2025-12-05T13:37:12Z

@Kwanghoon-Choi have you tested what will happen if the algorithm crashes (in which case stop_evt is set)? will the runners emit a lot of logs and tracebacks?

Kwanghoon-Choi · 2025-12-06T07:33:05Z

Since the runner handles the stop_evt gracefully, no runner tracebacks are emitted when the algorithm crashes and stop_evt is set.

Tested by sending ctrl+c to the terminal running agl store to cause the algorithm to crash.

Kwanghoon-Choi added 4 commits December 5, 2025 02:11

add _run_with_sigint and SIGINT handling

03f08b4

route execute() through _run_with_sigint

2477c4b

add asyncio.CancelledError to runner/algorithm

5706da6

add _run_with_sigint tests

26cc501

Copilot AI review requested due to automatic review settings December 5, 2025 01:25

Copilot started reviewing on behalf of Kwanghoon-Choi December 5, 2025 01:26 View session

Copilot finished reviewing on behalf of Kwanghoon-Choi December 5, 2025 01:39

Copilot AI reviewed Dec 5, 2025

View reviewed changes

ultmaster reviewed Dec 5, 2025

View reviewed changes

ultmaster linked an issue Dec 5, 2025 that may be closed by this pull request

Hitting KeyboardInterrupt produces too many error logs #343

Open

Kwanghoon-Choi added 2 commits December 5, 2025 21:27

exception handling using asyncio.run()

2bf90cd

add SIGINT/CancelledError handling tests

c39a84d

+def _run_with_sigint(...):
+    global SIGINT_SEEN
+    SIGINT_SEEN = False
+    # ... rest of the function ...


		import pytest

		from agentlightning.execution import client_server as cs_mod

Fix redundant cancel tracebacks on ctrl+c (issue #343) #370

Are you sure you want to change the base?

Fix redundant cancel tracebacks on ctrl+c (issue #343) #370

Conversation

Kwanghoon-Choi commented Dec 5, 2025

Problem (#343)

Goal

Fix

Tests

Reproduce

Uh oh!

Kwanghoon-Choi commented Dec 5, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Dec 5, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 5, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 5, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 5, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 5, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 5, 2025

Choose a reason for hiding this comment

Uh oh!

ultmaster Dec 5, 2025

Choose a reason for hiding this comment

Uh oh!

ultmaster Dec 5, 2025

Choose a reason for hiding this comment

Uh oh!

Kwanghoon-Choi commented Dec 5, 2025

Uh oh!

ultmaster commented Dec 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Kwanghoon-Choi commented Dec 6, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ultmaster commented Dec 5, 2025 •

edited

Loading