Skip to content

[SPARK-57858][SQL] Emit BIN BY scaled DISTRIBUTE columns as produced attributes#56930

Open
vranes wants to merge 1 commit into
apache:masterfrom
vranes:bin-by-distribute-produced-attrs
Open

[SPARK-57858][SQL] Emit BIN BY scaled DISTRIBUTE columns as produced attributes#56930
vranes wants to merge 1 commit into
apache:masterfrom
vranes:bin-by-distribute-produced-attrs

Conversation

@vranes

@vranes vranes commented Jul 1, 2026

Copy link
Copy Markdown
Contributor

What changes were proposed in this pull request?

The BIN BY relation operator (SPARK-57133) proportionally rescales its DISTRIBUTE UNIFORM columns. The logical BinBy node carried those columns through child.output with the child's own ExprId, even though execution rewrites their values.

This PR makes the rescaled DISTRIBUTE columns produced attributes with fresh ExprIds (same names, types, nullability, and positions), shadowing the inputs, mirroring Generate.generatorOutput:

  • BinBy gains a scaledDistributeColumns field; output swaps each DISTRIBUTE input slot for its scaled counterpart in place, and producedAttributes includes them. The input distributeColumns stay on the node as the executor's read inputs but leave output.
  • BinBy.scaledDistributeAttributes mints the fresh attributes (qualifier and metadata dropped, matching expr AS value computed-value semantics).
  • ResolveBinBy mints them; DeduplicateRelations renews them in both phases so self-joins over a shared BinBy subtree resolve.

Why are the changes needed?

Catalyst relies on the invariant that the same ExprId everywhere implies the same value. No other operator edits a value under a retained child attribute (Generate / Window / Expand / Aggregate all mint fresh ids for changed columns). Carrying the rescaled DISTRIBUTE column under the child's ExprId violated that: any rule reasoning on ExprId (predicate pushdown, constraint propagation, common-subexpression elimination) could read the pre-scale value. It is harmless today only because no such rule lists BinBy, but that safety is incidental, not designed. Minting fresh identities restores the invariant and lets a filter or sort on a DISTRIBUTE column bind to the scaled output.

Does this PR introduce any user-facing change?

No. BIN BY is gated off by default (SPARK-57440) and its physical execution is still stubbed, so the operator is not usable end-to-end yet; this is an internal analyzer / plan-shape change. The output schema (column names, types, positions) is unchanged.

How was this patch tested?

ResolveBinBySuite (20 tests), including new cases that the rescaled DISTRIBUTE columns are produced attributes shadowing the input, that multiple DISTRIBUTE columns are each replaced in place with distinct fresh ids, and that qualifier/metadata are dropped on the produced column; plus the existing self-join deduplication regression.

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude Code (Anthropic)

@vranes vranes force-pushed the bin-by-distribute-produced-attrs branch 2 times, most recently from e417f4b to 6be4341 Compare July 1, 2026 17:41
@vranes vranes marked this pull request as ready for review July 1, 2026 17:42
@vranes vranes force-pushed the bin-by-distribute-produced-attrs branch from 6be4341 to 7e4c5c2 Compare July 1, 2026 20:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant