Fix race condition causing stale pids in syn lookup#87
Open
chrismccord wants to merge 2 commits intoostinelli:masterfrom
Open
Fix race condition causing stale pids in syn lookup#87chrismccord wants to merge 2 commits intoostinelli:masterfrom
chrismccord wants to merge 2 commits intoostinelli:masterfrom
Conversation
sync_register/sync_join messages from multicast_loop can arrive before ack_sync from gen_server since they're different senders (no ordering guarantee). When this happens, the message was dropped because the remote node wasn't in nodes_map yet, leaving stale data from ack_sync. Fix: Include RemoteScopePid in broadcasts to allow inline discovery when sync arrives before ack_sync. Old message format still supported for rolling upgrades.
The previous fix attempted to handle sync_register arriving before ack_sync by including RemoteScopePid for inline discovery. However, the same race exists for sync_unregister vs ack_sync: 1. Node A sends ack_sync (direct from gen_server) 2. Process dies on Node A, broadcasts sync_unregister (via multicast_loop) 3. At Node B: sync_unregister arrives first (ignored - not in table yet) 4. ack_sync arrives second, adds the now-dead process 5. Stale entry persists forever Root cause: ack_sync and broadcasts use different senders (gen_server vs multicast_loop), so FIFO ordering is not guaranteed. Fix: Route ack_sync through multicast_loop via new send_to_node_ordered/3. All messages to remote nodes now flow through the same sender, guaranteeing FIFO delivery. This fixes the root cause rather than patching symptoms. The previous inline discovery mechanism is removed as it's no longer needed.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
sync_register/sync_join messages from multicast_loop can arrive before ack_sync from gen_server since they're different senders (no ordering guarantee). When this happens, the message was dropped because the remote node wasn't in nodes_map yet, leaving stale data from ack_sync which is just about to arrive (containing stale data that lacks the raced registrations).
Fix: Include RemoteScopePid in broadcasts to allow inline discovery when sync arrives before ack_sync. Old message format still supported for rolling upgrades.
Note: I wasn't able to run the multinode tests regardless of OTP 25/26/28. ct_slave was failing to connect nodes for whatever reason.
The other option than including the scope pid in all broadcasts would be to buffer the received broadcasts for nodes that we are awaiting ack_sync, then "replay" them, but that seemed like a more complex change and would require cleanup/sweeping to avoid unbounded buffer if a node failed during the discover/ack handshake. Thanks!