Skip to content

cluster_link: Schema Registry API-mode replication failover via paused#30984

Open
bartoszpiekny-redpanda wants to merge 6 commits into
redpanda-data:devfrom
bartoszpiekny-redpanda:core-16751-sr-failover-paused
Open

cluster_link: Schema Registry API-mode replication failover via paused#30984
bartoszpiekny-redpanda wants to merge 6 commits into
redpanda-data:devfrom
bartoszpiekny-redpanda:core-16751-sr-failover-paused

Conversation

@bartoszpiekny-redpanda

@bartoszpiekny-redpanda bartoszpiekny-redpanda commented Jul 1, 2026

Copy link
Copy Markdown
Contributor

Adds failover for API-mode SR replication. Topic-mode SR already fails over via the _schemas mirror status; API-mode had no equivalent, so after failover the sync task kept running and owned contexts stayed write-blocked.

A single user-settable paused field on the SR sync config drives both effects — stops the sync task and lifts the per-context client write block. rpk shadow failover --all sets it for API-mode links; topic-mode is unchanged (paused is inert there).

No new controller command or feature bit. paused rides the existing config-update command (serde envelope v1→v2, version-gated read → pre-v2 records default to not-paused), so there's no rolling-upgrade hazard. API-mode SR is already gated by the existing shadow_link_sr_api_sync feature, and since paused only matters on an API-mode link, it needs no gate of its own. feature_table is untouched.

Tests: unit (serde/copy, converter, write-block incl. multi-link, task pause/resume) + ducktape (failover unblocks & pauses, manual toggle, survives restart).

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v26.1.x
  • v25.3.x
  • v25.2.x

Release Notes

  • none

Add a user-settable `paused` field to SchemaRegistrySyncOptions, matching
the sibling sync options (topic-metadata, consumer-offset, security, role).
Pausing stops the Schema Registry sync task and, for API-mode shadowing,
lifts the per-context client write protection; it is also set when the
link's Schema Registry replication is failed over.

Regenerates the ducktape Python bindings.
Add a `paused` bool to schema_registry_sync_config and wire the admin
converter both directions (proto get_paused/set_paused), mirroring the
sibling sync configs. The serde envelope is bumped to v2 with a
version-gated read so pre-v2 records default to not-paused.

No behavior change yet: later commits consume `paused` to lift the
API-mode client write block and pause the sync task.
link_disables_client_writes now returns "not blocked" for an API-mode
Schema Registry link whose config is paused: replication has stopped, so
the contexts the link owned are handed back to clients. Topic-mode and
non-paused API-mode links are unchanged, and the any_of across links still
blocks a context owned by any non-paused link.
mirroring_task::is_enabled() now also requires the config to not be paused.
A paused API-mode link disables the task, so the base reconciler pauses it
(should_pause = !is_enabled && should_start_impl) while the shard still
leads _schemas/0, and resumes it when un-paused.
A full-link failover (empty shadow_topic_name) of an API-mode Schema
Registry link now also pauses its SR config via a get-modify-write config
update, so replication stops and the client write protection on the link's
contexts is lifted. The update is idempotent (skipped if already paused)
and reuses update_cluster_link, so no dedicated command is needed.
Topic-mode SR failover is unchanged (handled by failover_link_topics).

Verified end to end by ducktape (see the SR write-blocking suite).
Add end-to-end coverage to the SR write-blocking suite for the paused
flag introduced across this stack:
- full-link failover (rpk shadow failover --all) sets paused and lifts
  client write protection on the link's owned contexts;
- paused is user-settable via update_shadow_link and toggles the block;
- the paused flag survives a full target-cluster restart.
@bartoszpiekny-redpanda bartoszpiekny-redpanda force-pushed the core-16751-sr-failover-paused branch from 7a99b20 to d6c84ab Compare July 1, 2026 13:01
@bartoszpiekny-redpanda bartoszpiekny-redpanda marked this pull request as ready for review July 1, 2026 13:09
@bartoszpiekny-redpanda bartoszpiekny-redpanda requested review from a team as code owners July 1, 2026 13:09
@bartoszpiekny-redpanda bartoszpiekny-redpanda requested review from Copilot and pgellert and removed request for a team July 1, 2026 13:09

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a user-controlled paused flag for API-mode Schema Registry replication on cluster links, and uses it to implement failover behavior that both stops the SR sync task and lifts per-context client write blocking on the target cluster.

Changes:

  • Extend Schema Registry sync config (model + admin API/proto) with a durable paused flag, including serde v2 read/write behavior.
  • Update API-mode SR sync task enablement and write-blocking logic to respect paused, and set paused automatically on full-link failover.
  • Add/extend unit + ducktape coverage for pause/unpause, failover behavior, and durability across restart.

Reviewed changes

Copilot reviewed 14 out of 14 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
tests/rptest/tests/cluster_linking_schema_registry_write_blocking_test.py Adds ducktape coverage for API-mode SR failover pausing + write-unblocking, manual pause toggling, and persistence across restart.
tests/rptest/clients/admin/proto/redpanda/core/admin/v2/shadow_link_pb2.pyi Updates Python protobuf typings for new paused field.
tests/rptest/clients/admin/proto/redpanda/core/admin/v2/shadow_link_pb2.py Updates generated Python protobuf code for new paused field.
src/v/redpanda/admin/services/shadow_link/tests/converter_test.cc Verifies admin <-> metadata conversion includes paused for API-mode SR configs.
src/v/redpanda/admin/services/shadow_link/shadow_link.cc Sets paused on full-link failover for API-mode SR links (ends replication + lifts blocking).
src/v/redpanda/admin/services/shadow_link/converter.cc Converts paused between admin proto and internal model config.
src/v/cluster/cluster_link/tests/shadow_link_write_blocking.cc Adds unit tests verifying paused lifts client write blocking for API-mode contexts and doesn’t override other links’ ownership.
src/v/cluster/cluster_link/frontend.cc Makes API-mode client write-blocking logic return “allowed” when the link is paused.
src/v/cluster_link/schema_registry_sync/tests/mirroring_task_test.cc Adds unit test asserting paused config transitions SR mirroring task to paused and back to active.
src/v/cluster_link/schema_registry_sync/mirroring_task.cc Disables SR mirroring task when config is paused.
src/v/cluster_link/model/types.h Bumps schema registry sync config serde envelope to v2 and adds paused field.
src/v/cluster_link/model/types.cc Copies/serializes/deserializes paused and includes it in formatting.
src/v/cluster_link/model/tests/test_model.cc Adds serde/copy round-trip coverage for paused, including defaulting for older versions.
proto/redpanda/core/admin/v2/shadow_link.proto Adds paused to SchemaRegistrySyncOptions in the admin API.

Comment thread src/v/redpanda/admin/services/shadow_link/shadow_link.cc
Comment thread src/v/cluster_link/model/types.cc
@bartoszpiekny-redpanda bartoszpiekny-redpanda self-assigned this Jul 1, 2026
@vbotbuildovich

Copy link
Copy Markdown
Collaborator

CI test results

test results on build#86593
test_status test_class test_method test_arguments test_kind job_url passed reason test_history
FLAKY(PASS) ShadowLinkingReplicationTests test_auto_prefix_trimming {"source_cluster_spec": {"cluster_type": "redpanda"}, "storage_mode": "tiered", "with_failures": true} integration https://buildkite.com/redpanda/redpanda/builds/86593#019f1de3-2e27-47c7-a332-bad067ca2f1b 10/11 Test PASSES after retries.No significant increase in flaky rate(baseline=0.0399, p0=1.0000, reject_threshold=0.0100. adj_baseline=0.1150, p1=0.2946, trust_threshold=0.5000) https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=ShadowLinkingReplicationTests&test_method=test_auto_prefix_trimming
FLAKY(PASS) ShadowLinkingReplicationTests test_replication_basic {"shuffle_leadership": true, "source_cluster_spec": {"cluster_type": "redpanda"}, "storage_mode": "cloud"} integration https://buildkite.com/redpanda/redpanda/builds/86593#019f1de1-66ce-47b6-b64d-b57230b99297 10/11 Test PASSES after retries.No significant increase in flaky rate(baseline=0.0002, p0=1.0000, reject_threshold=0.0100. adj_baseline=0.1000, p1=0.3487, trust_threshold=0.5000) https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=ShadowLinkingReplicationTests&test_method=test_replication_basic
FLAKY(PASS) InternalTopicProtectionLargeClusterTest test_consumer_offset_topic null integration https://buildkite.com/redpanda/redpanda/builds/86593#019f1de1-66c9-4f05-b3de-bfda9cb0b5f9 10/11 Test PASSES after retries.No significant increase in flaky rate(baseline=0.0000, p0=1.0000, reject_threshold=0.0100. adj_baseline=0.1000, p1=0.3487, trust_threshold=0.5000) https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=InternalTopicProtectionLargeClusterTest&test_method=test_consumer_offset_topic

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants