Fix race condition causing stale pids in syn lookup by chrismccord · Pull Request #87 · ostinelli/syn

chrismccord · 2025-11-19T14:13:24Z

sync_register/sync_join messages from multicast_loop can arrive before ack_sync from gen_server since they're different senders (no ordering guarantee). When this happens, the message was dropped because the remote node wasn't in nodes_map yet, leaving stale data from ack_sync which is just about to arrive (containing stale data that lacks the raced registrations).

Fix: Include RemoteScopePid in broadcasts to allow inline discovery when sync arrives before ack_sync. Old message format still supported for rolling upgrades.

Note: I wasn't able to run the multinode tests regardless of OTP 25/26/28. ct_slave was failing to connect nodes for whatever reason.

The other option than including the scope pid in all broadcasts would be to buffer the received broadcasts for nodes that we are awaiting ack_sync, then "replay" them, but that seemed like a more complex change and would require cleanup/sweeping to avoid unbounded buffer if a node failed during the discover/ack handshake. Thanks!

sync_register/sync_join messages from multicast_loop can arrive before ack_sync from gen_server since they're different senders (no ordering guarantee). When this happens, the message was dropped because the remote node wasn't in nodes_map yet, leaving stale data from ack_sync. Fix: Include RemoteScopePid in broadcasts to allow inline discovery when sync arrives before ack_sync. Old message format still supported for rolling upgrades.

The previous fix attempted to handle sync_register arriving before ack_sync by including RemoteScopePid for inline discovery. However, the same race exists for sync_unregister vs ack_sync: 1. Node A sends ack_sync (direct from gen_server) 2. Process dies on Node A, broadcasts sync_unregister (via multicast_loop) 3. At Node B: sync_unregister arrives first (ignored - not in table yet) 4. ack_sync arrives second, adds the now-dead process 5. Stale entry persists forever Root cause: ack_sync and broadcasts use different senders (gen_server vs multicast_loop), so FIFO ordering is not guaranteed. Fix: Route ack_sync through multicast_loop via new send_to_node_ordered/3. All messages to remote nodes now flow through the same sender, guaranteeing FIFO delivery. This fixes the root cause rather than patching symptoms. The previous inline discovery mechanism is removed as it's no longer needed.

chrismccord added 2 commits November 18, 2025 23:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix race condition causing stale pids in syn lookup#87

Fix race condition causing stale pids in syn lookup#87
chrismccord wants to merge 2 commits intoostinelli:masterfrom
chrismccord:cm-fix

chrismccord commented Nov 19, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

chrismccord commented Nov 19, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant