datalake: Support nesting value fields#30917
Open
wdberkeley wants to merge 3 commits into
Open
Conversation
Adds a value_layout enum (flat/nested) and a layout field to value_config in the iceberg_mode DSL. The flat layout (default) preserves existing behavior. The nested layout will eventually wrap translated schema fields inside a "value" struct instead of promoting them to the top level of the Iceberg row. This commit adds the config plumbing only; the translation logic is not yet wired up.
When iceberg_mode includes value:layout=nested, user schema fields are wrapped in a top-level "value" struct rather than being promoted to the top level alongside the "redpanda" system struct. The plumbing in this change to handle flat vs. nested schemas will conflict with refactoring in the key translation change, so this change is somewhat of a placeholder or POC until that work is merged and this can be rebased on top. The first and last commits should apply without change, though.
test_iceberg_value_layout creates one flat-layout and one nested-layout topic (both using value:mode=schema_id_prefix), produces one Avro record with two fields to each, and verifies via pyiceberg and Spark SQL that the rows have the expected structure and their values can be queried as expected.
Contributor
There was a problem hiding this comment.
Pull request overview
Adds an Iceberg “value layout” option to control whether translated value fields are emitted at the row’s top-level (flat, existing behavior) or nested under a top-level value struct (nested), and wires this through translation + coordinator table schema creation with accompanying unit and e2e tests.
Changes:
- Extend
iceberg_modeparsing/formatting and wire-compat gating to supportvalue:layout={flat|nested}. - Implement nested layout in the structured record translator and plumb layout into datalake manager/coordinator schema creation.
- Add/extend C++ unit tests and a Ducktape e2e test to validate schema shapes and Spark SQL queryability.
Reviewed changes
Copilot reviewed 18 out of 18 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| tests/rptest/tests/datalake/datalake_e2e_test.py | Adds Ducktape coverage validating flat vs nested schema shape and Spark SQL access patterns. |
| src/v/model/tests/iceberg_mode_test.cc | Adds unit tests for parsing/formatting/roundtripping the new value:layout option and feature gating. |
| src/v/model/model.cc | Implements layout parsing in extended mode, formatting, and wire serialization compatibility behavior. |
| src/v/model/metadata.h | Introduces value_layout enum and threads legacy-compatibility into feature gating. |
| src/v/datalake/translation/deps.cc | Updates table creation plumbing to pass record_type and forward schema components. |
| src/v/datalake/tests/translation_task_test.cc | Adjusts test construction for updated direct_table_creator API. |
| src/v/datalake/tests/test_utils.h | Updates direct_table_creator interface to accept record_type and removes resolver dependency. |
| src/v/datalake/tests/test_utils.cc | Uses provided record_type rather than resolving/building types internally. |
| src/v/datalake/tests/record_multiplexer_test.cc | Updates fixture helpers for new table creator API and adds nested-layout schema assertion. |
| src/v/datalake/tests/record_multiplexer_bench.cc | Adjusts benchmark fixture to new direct_table_creator constructor. |
| src/v/datalake/tests/gtest_record_multiplexer_test.cc | Updates construction sites for direct_table_creator. |
| src/v/datalake/table_creator.h | Changes ensure_table signature to accept record_type and updates includes accordingly. |
| src/v/datalake/record_translator.h | Adds layout-aware structured translator constructor and stores layout choice. |
| src/v/datalake/record_translator.cc | Implements nested layout schema construction and value placement during translation. |
| src/v/datalake/record_multiplexer.cc | Passes record_type into table_creator::ensure_table. |
| src/v/datalake/datalake_manager.cc | Constructs structured translator with the configured value.layout. |
| src/v/datalake/coordinator/coordinator.cc | Ensures coordinator table schema generation uses the topic’s configured layout. |
| src/v/datalake/BUILD | Updates Bazel deps to reflect new include dependency on record_translator. |
Comment on lines
12
to
14
| #include "base/format_to.h" | ||
| #include "datalake/schema_identifier.h" | ||
| #include "datalake/record_translator.h" | ||
|
|
Collaborator
CI test resultstest results on build#86275
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This series of changes adds support for a
layoutkey in thevaluesection of the expanded iceberg mode config, with possible valuesnestedandflat(the default and current behavior). Whenlayout=nested, value fields will be located inside a top-levelvaluestruct instead of at the top level (flat).There are three commits:
Backports Required
Release Notes
Features
layoutkey to the Iceberg mode config string, in the value section.layoutsupports two values:nestedandflat. Whenlayout=nested, translated value fields will be located nested in avaluekey at the top level of the row schema. Whenlayout=flat, translated value fields will be located at the top-level, except collisions with names in theredpandametadata struct will be relocated insideredpanda.data.layout=flatis the default and matches with the previous behavior.