Skip to content

datalake: Support nesting value fields#30917

Open
wdberkeley wants to merge 3 commits into
devfrom
wdb/nested-iceberg-value
Open

datalake: Support nesting value fields#30917
wdberkeley wants to merge 3 commits into
devfrom
wdb/nested-iceberg-value

Conversation

@wdberkeley

Copy link
Copy Markdown
Contributor

This series of changes adds support for a layout key in the value section of the expanded iceberg mode config, with possible values nested and flat (the default and current behavior). When layout=nested, value fields will be located inside a top-level value struct instead of at the top level (flat).

There are three commits:

  1. Config support and plumbing. No functional change from the config.
  2. Implementation of the nesting. Relies on the current structure of the translator which will change a bit when key translation is introduced (and that PR is waiting on Iceberg header translation #30866). Provisional, will change. The actual nesting code should remain the same though.
  3. A ducktape test for the nested vs. flat behavior. Should apply regardless of changes to commit 2.

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v26.1.x
  • v25.3.x
  • v25.2.x

Release Notes

Features

  • Adds a new layout key to the Iceberg mode config string, in the value section. layout supports two values: nested and flat. When layout=nested, translated value fields will be located nested in a value key at the top level of the row schema. When layout=flat, translated value fields will be located at the top-level, except collisions with names in the redpanda metadata struct will be relocated inside redpanda.data. layout=flat is the default and matches with the previous behavior.

Adds a value_layout enum (flat/nested) and a layout field to
value_config in the iceberg_mode DSL. The flat layout (default)
preserves existing behavior. The nested layout will eventually
wrap translated schema fields inside a "value" struct instead of
promoting them to the top level of the Iceberg row.

This commit adds the config plumbing only; the translation logic
is not yet wired up.
When iceberg_mode includes value:layout=nested, user schema fields are
wrapped in a top-level "value" struct rather than being promoted to the
top level alongside the "redpanda" system struct.

The plumbing in this change to handle flat vs. nested schemas will
conflict with refactoring in the key translation change, so this change
is somewhat of a placeholder or POC until that work is merged and this
can be rebased on top. The first and last commits should apply without
change, though.
test_iceberg_value_layout creates one flat-layout and one nested-layout
topic (both using value:mode=schema_id_prefix), produces one Avro record
with two fields to each, and verifies via pyiceberg and Spark SQL that
the rows have the expected structure and their values can be queried as
expected.
Copilot AI review requested due to automatic review settings June 24, 2026 22:41

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds an Iceberg “value layout” option to control whether translated value fields are emitted at the row’s top-level (flat, existing behavior) or nested under a top-level value struct (nested), and wires this through translation + coordinator table schema creation with accompanying unit and e2e tests.

Changes:

  • Extend iceberg_mode parsing/formatting and wire-compat gating to support value:layout={flat|nested}.
  • Implement nested layout in the structured record translator and plumb layout into datalake manager/coordinator schema creation.
  • Add/extend C++ unit tests and a Ducktape e2e test to validate schema shapes and Spark SQL queryability.

Reviewed changes

Copilot reviewed 18 out of 18 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
tests/rptest/tests/datalake/datalake_e2e_test.py Adds Ducktape coverage validating flat vs nested schema shape and Spark SQL access patterns.
src/v/model/tests/iceberg_mode_test.cc Adds unit tests for parsing/formatting/roundtripping the new value:layout option and feature gating.
src/v/model/model.cc Implements layout parsing in extended mode, formatting, and wire serialization compatibility behavior.
src/v/model/metadata.h Introduces value_layout enum and threads legacy-compatibility into feature gating.
src/v/datalake/translation/deps.cc Updates table creation plumbing to pass record_type and forward schema components.
src/v/datalake/tests/translation_task_test.cc Adjusts test construction for updated direct_table_creator API.
src/v/datalake/tests/test_utils.h Updates direct_table_creator interface to accept record_type and removes resolver dependency.
src/v/datalake/tests/test_utils.cc Uses provided record_type rather than resolving/building types internally.
src/v/datalake/tests/record_multiplexer_test.cc Updates fixture helpers for new table creator API and adds nested-layout schema assertion.
src/v/datalake/tests/record_multiplexer_bench.cc Adjusts benchmark fixture to new direct_table_creator constructor.
src/v/datalake/tests/gtest_record_multiplexer_test.cc Updates construction sites for direct_table_creator.
src/v/datalake/table_creator.h Changes ensure_table signature to accept record_type and updates includes accordingly.
src/v/datalake/record_translator.h Adds layout-aware structured translator constructor and stores layout choice.
src/v/datalake/record_translator.cc Implements nested layout schema construction and value placement during translation.
src/v/datalake/record_multiplexer.cc Passes record_type into table_creator::ensure_table.
src/v/datalake/datalake_manager.cc Constructs structured translator with the configured value.layout.
src/v/datalake/coordinator/coordinator.cc Ensures coordinator table schema generation uses the topic’s configured layout.
src/v/datalake/BUILD Updates Bazel deps to reflect new include dependency on record_translator.

Comment on lines 12 to 14
#include "base/format_to.h"
#include "datalake/schema_identifier.h"
#include "datalake/record_translator.h"

@vbotbuildovich

Copy link
Copy Markdown
Collaborator

CI test results

test results on build#86275
test_status test_class test_method test_arguments test_kind job_url passed reason test_history
FAIL DatalakeMultiplexerTest WritesDataFiles unit https://buildkite.com/redpanda/redpanda/builds/86275#019efbcc-bd42-4a51-beb0-520ecca502bd 0/1
FLAKY(INCONCLUSIVE) NodeWiseRecoveryTest test_node_wise_recovery {"dead_node_count": 2} integration https://buildkite.com/redpanda/redpanda/builds/86275#019efbe6-95e9-4914-9b82-35b3dfa2b7c9 8/11 Test is INCONCLUSIVE after retries.Inconclusive result before max retries(baseline=0.0215, p0=0.0185, reject_threshold=0.0100. adj_baseline=0.1000, p1=0.9298, trust_threshold=0.5000) https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=NodeWiseRecoveryTest&test_method=test_node_wise_recovery
FLAKY(PASS) NodeWiseRecoveryTest test_recovery_local_data_missing {"wait_for_final_manifest_uploads": false} integration https://buildkite.com/redpanda/redpanda/builds/86275#019efbe6-95e9-493b-9ad0-f333b648159e 9/11 Test PASSES after retries.No significant increase in flaky rate(baseline=0.0611, p0=0.4677, reject_threshold=0.0100. adj_baseline=0.1723, p1=0.4649, trust_threshold=0.5000) https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=NodeWiseRecoveryTest&test_method=test_recovery_local_data_missing

@nvartolomei nvartolomei self-requested a review June 30, 2026 17:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants