Skip to content

[SPARK-57867][CORE] Driver should not reserve off-heap memory in non-local mode#56945

Open
dongjoon-hyun wants to merge 1 commit into
apache:masterfrom
dongjoon-hyun:SPARK-57867
Open

[SPARK-57867][CORE] Driver should not reserve off-heap memory in non-local mode#56945
dongjoon-hyun wants to merge 1 commit into
apache:masterfrom
dongjoon-hyun:SPARK-57867

Conversation

@dongjoon-hyun

@dongjoon-hyun dongjoon-hyun commented Jul 1, 2026

Copy link
Copy Markdown
Member

What changes were proposed in this pull request?

This PR proposes to stop reserving off-heap memory pools (spark.memory.offHeap.size) in the driver's MemoryManager in non-local deployments. SparkEnv.initializeMemoryManager takes a new offHeapAllowed parameter, and SparkContext passes offHeapAllowed = isLocal for the driver. The executor path and local mode are unchanged.

Why are the changes needed?

Off-heap memory is accounted for only in executor resource sizing (ResourceProfile.OFFHEAP_MEM, YARN executor container size, K8s BasicExecutorFeatureStep). The driver's container memory request never includes spark.memory.offHeap.size. So, we should not allow it.

However, with spark.memory.offHeap.enabled=true, the Executors UI and REST API show the driver with spark.memory.offHeap.size of Off Heap Storage Memory like the following, which is very misleading.

BEFORE

Screenshot 2026-07-01 at 15 24 59

AFTER

Screenshot 2026-07-01 at 15 23 11

Does this PR introduce any user-facing change?

No. The driver in non-local deployments never uses it: it runs no tasks, stores no off-heap blocks.

How was this patch tested?

Pass the CIs.

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude Fable 5

@dongjoon-hyun

Copy link
Copy Markdown
Member Author

cc @cloud-fan , @HyukjinKwon , @viirya

@cloud-fan cloud-fan left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

0 blocking, 0 non-blocking, 0 nits.
Clean, well-scoped fix.

Verification

Verified independently: (1) both MEMORY_OFFHEAP_ENABLED=false and MEMORY_OFFHEAP_SIZE=0 are set on the local conf.clone — setting only the size while leaving enabled=true would trip MemoryManager.tungstenMemoryMode's require(size > 0), so setting both is necessary; (2) the disable is contained to the clone (a deep copy), so env.conf and the executor conf path (which sets off-heap size from the ResourceProfile) are unaffected; (3) Utils.isLocalMaster treats local-cluster as non-local, so the two tests exercise the intended branches — non-local asserts maxOffHeapStorageMemory === 0, local asserts > 0. The driver in non-local mode stores no off-heap blocks (broadcast uses on-heap MEMORY_AND_DISK).

@dongjoon-hyun

Copy link
Copy Markdown
Member Author

Thank you, @cloud-fan !

@viirya

viirya commented Jul 2, 2026

Copy link
Copy Markdown
Member

Thank you for the fix, @dongjoon-hyun. The motivation makes sense to me — the driver's container memory is never sized for spark.memory.offHeap.size, so reporting it in the Executors UI is misleading. Two issues, one of which I think needs to be addressed before merging:

1. This conflicts with SPARK-46947 (DriverPlugin memory override) and will break its test.

PluginContainerSuite."memory override in plugin" runs on local-cluster[2,1,1024] and its MemoryOverridePlugin.driverPlugin().init() sets spark.memory.offHeap.enabled=true / spark.memory.offHeap.size before the memory manager is created — SPARK-46947 delayed initializeMemoryManager until after the driver plugin loads precisely so this works. With this change, isLocal is false for local-cluster, so the clone overwrites the plugin's settings unconditionally and the driver's manager comes up with tungstenMemoryMode == ON_HEAP and maxOffHeapStorageMemory == 0, failing both driver-side assertions:

https://github.com/apache/spark/blob/master/core/src/test/scala/org/apache/spark/internal/plugin/PluginContainerSuite.scala#L247-L248

(The core-module test job was still in progress when I looked, which is probably why CI hasn't flagged it yet.)

Beyond the test itself, this is a behavior question we should decide explicitly: should a non-local driver ignore off-heap even when a DriverPlugin deliberately enables it? If yes, the test needs updating and the "no user-facing change" section should mention that plugins enabling off-heap on the driver are affected. If no, the override should be skipped when the plugin has touched these keys.

2. The cloned conf leaves env.conf and the memory manager disagreeing.

After this change, on a non-local driver env.conf still reports spark.memory.offHeap.enabled=true while the manager was built off-heap-disabled. I checked the current driver-side readers of MEMORY_OFFHEAP_ENABLED (column-vector path, HashedRelation, TorrentBroadcast) and none of them crosses the two sources today, so nothing breaks now — but any future driver-side code that reads the conf and then assumes the manager matches will be silently wrong. I realize mutating env.conf directly isn't an option since executors inherit sc.conf.getAll via RetrieveSparkAppConfig, so the clone is forced by this approach. An alternative that avoids the divergence entirely: thread the flag into the manager instead, e.g. UnifiedMemoryManager(conf, numUsableCores, offHeapAllowed), consulted in the pool sizing and tungstenMemoryMode. That keeps env.conf authoritative and models "this process doesn't allow off-heap" explicitly rather than via a falsified conf. It would also make the interaction with issue 1 visible instead of silent. Larger diff, so I leave the trade-off to you.

One small note: if the clone approach stays, a short comment on why both keys are set would help — MEMORY_OFFHEAP_SIZE=0 is what actually zeroes the pools (MemoryManager sizes them from the size regardless of the enabled flag), while MEMORY_OFFHEAP_ENABLED=false avoids the require(size > 0) in tungstenMemoryMode. A future reader may otherwise drop one of them, since HashedRelation gets away with setting only the enabled flag.

@dongjoon-hyun

dongjoon-hyun commented Jul 2, 2026

Copy link
Copy Markdown
Member Author

Thank you for the review, @viirya .

For the following, yes, because Spark didn't allocate the requested resources with that configuration. Driver JVM is already started. I revised the test case.

should a non-local driver ignore off-heap even when a DriverPlugin deliberately enables it?

For the second question and note, let me check more.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants