[SPARK-57858][SQL] Emit BIN BY scaled DISTRIBUTE columns as produced attributes by vranes · Pull Request #56930 · apache/spark

vranes · 2026-07-01T16:19:58Z

What changes were proposed in this pull request?

The BIN BY relation operator (SPARK-57133) proportionally rescales its DISTRIBUTE UNIFORM columns. The logical BinBy node carried those columns through child.output with the child's own ExprId, even though execution rewrites their values.

This PR makes the rescaled DISTRIBUTE columns produced attributes with fresh ExprIds (same names, types, nullability, and positions), shadowing the inputs, mirroring Generate.generatorOutput:

BinBy gains a scaledDistributeColumns field; output swaps each DISTRIBUTE input slot for its scaled counterpart in place, and producedAttributes includes them. The input distributeColumns stay on the node as the executor's read inputs but leave output.
BinBy.scaledDistributeAttributes mints the fresh attributes (qualifier and metadata dropped, matching expr AS value computed-value semantics).
ResolveBinBy mints them; DeduplicateRelations renews them in both phases so self-joins over a shared BinBy subtree resolve.

Why are the changes needed?

Catalyst relies on the invariant that the same ExprId everywhere implies the same value. No other operator edits a value under a retained child attribute (Generate / Window / Expand / Aggregate all mint fresh ids for changed columns). Carrying the rescaled DISTRIBUTE column under the child's ExprId violated that: any rule reasoning on ExprId (predicate pushdown, constraint propagation, common-subexpression elimination) could read the pre-scale value. It is harmless today only because no such rule lists BinBy, but that safety is incidental, not designed. Minting fresh identities restores the invariant and lets a filter or sort on a DISTRIBUTE column bind to the scaled output.

Does this PR introduce any user-facing change?

No. BIN BY is gated off by default (SPARK-57440) and its physical execution is still stubbed, so the operator is not usable end-to-end yet; this is an internal analyzer / plan-shape change. The output schema (column names, types, positions) is unchanged.

How was this patch tested?

ResolveBinBySuite (20 tests), including new cases that the rescaled DISTRIBUTE columns are produced attributes shadowing the input, that multiple DISTRIBUTE columns are each replaced in place with distinct fresh ids, and that qualifier/metadata are dropped on the produced column; plus the existing self-join deduplication regression.

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude Code (Anthropic)

…attributes

vranes force-pushed the bin-by-distribute-produced-attrs branch 2 times, most recently from e417f4b to 6be4341 Compare July 1, 2026 17:41

vranes marked this pull request as ready for review July 1, 2026 17:42

[SPARK-57858][SQL] Emit BIN BY scaled DISTRIBUTE columns as produced …

7e4c5c2

…attributes

vranes force-pushed the bin-by-distribute-produced-attrs branch from 6be4341 to 7e4c5c2 Compare July 1, 2026 20:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-57858][SQL] Emit BIN BY scaled DISTRIBUTE columns as produced attributes#56930

[SPARK-57858][SQL] Emit BIN BY scaled DISTRIBUTE columns as produced attributes#56930
vranes wants to merge 1 commit into
apache:masterfrom
vranes:bin-by-distribute-produced-attrs

vranes commented Jul 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

vranes commented Jul 1, 2026

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant