#811 Add read properties hash code as index key to avoid false cache #812

yruslan · 2025-12-22T08:06:32Z

…retrieval when read options are different.

Closes #811

Summary by CodeRabbit

Bug Fixes
- Added bounds to index size calculations for non‑compressed inputs, preventing excessive memory use.
- Improved cache invalidation to account for reader option changes, ensuring correct index retrieval when configurations differ.
Refactor
- Replaced unbounded in‑memory cache with a bounded LRU cache and switched to a composite cache key that includes reader options.
- Updated index-building API to accept additional context parameters.
Tests
- Added unit tests covering LRU cache behavior.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

…retrieval when read options are different.

coderabbitai · 2025-12-22T08:06:57Z

Walkthrough

Replaces an unbounded index cache with a size-bounded LRUCache keyed by an IndexKey that includes reader options hash, updates IndexBuilder to use the new cache and adjusted method signature, caps a fallback index-size calculation in VarLenNestedReader, and adds LRUCache tests and a new IndexKey type.

Changes

Cohort / File(s)	Summary
Cache implementation `spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/utils/LRUCache.scala`	New thread-safe LRUCache[K,V] with synchronized APIs: `apply`, `containsKey`, `get`, `getMap`, `put`, `remove`; evicts eldest entries when size > maxSize.
Cache key type `spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/source/index/IndexKey.scala`	Added `case class IndexKey(fileName: String, optionsHashCode: Long)` to represent file + reader-options identity for cache keys.
Cache usage & API `spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/source/index/IndexBuilder.scala`	Replaced ConcurrentHashMap with `LRUCache[IndexKey, Array[SparseIndexEntry]]`; compute `readerOptionsHashCode` and use `IndexKey` for contains/put/get; buildIndex signature updated to accept `sqlContext: SQLContext` (kept localityParams).
Index-size fallback `cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/reader/VarLenNestedReader.scala`	Adjusted `getSplitSizeMB` non-compressed branch to compute a capped default index size by multiplying `fsDefaultBlockSize` by `DEFAULT_FS_INDEX_SIZE_MULTIPLIER` and limiting the result (cap at 256), then fall back to `inputSplitSizeMB` or this capped default.
Tests — integration adjustment `spark-cobol/src/test/scala/za/co/absa/cobrix/spark/cobol/source/integration/Test37RecordLengthMappingSpec.scala`	Reworked test cache access to a simplified filename-keyed view (`indexCacheSimplified`) derived from `indexCache.getMap` and updated presence/length checks accordingly.
Tests — unit `spark-cobol/src/test/scala/za/co/absa/cobrix/spark/cobol/source/utils/LRUCacheSuite.scala`	New unit tests for LRUCache: capacity retention, LRU eviction behavior, access-touch preservation, removal/invalidation, and `getMap` verification.

Sequence Diagram

sequenceDiagram
    participant Client as Reader / Caller
    participant IndexBuilder as IndexBuilder
    participant Cache as LRUCache
    participant FS as Filesystem

    Client->>IndexBuilder: buildIndex(filesList, cobolReader, sqlContext, ...)
    IndexBuilder->>IndexBuilder: compute IndexKey(filePath, readerOptionsHashCode)
    IndexBuilder->>Cache: containsKey(IndexKey)?
    alt cache hit (same reader options)
        Cache-->>IndexBuilder: true
        IndexBuilder->>Cache: apply(IndexKey)
        Cache-->>IndexBuilder: cached SparseIndexEntry[]
        IndexBuilder-->>Client: return cached index
    else cache miss or different options
        IndexBuilder->>FS: read file contents
        FS-->>IndexBuilder: file data / parsed entries
        IndexBuilder->>IndexBuilder: build SparseIndexEntry[]
        IndexBuilder->>Cache: put(IndexKey, SparseIndexEntry[])
        IndexBuilder-->>Client: return new index
    end

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Pay attention to: IndexBuilder API change (call sites must match new signature), correctness of readerOptionsHashCode (ensure it covers relevant options), equality/hash behavior of IndexKey, thread-safety and eviction correctness in LRUCache, and updated tests that map old cache semantics to the new keying scheme.

Possibly related PRs

#809 Add support for reading compressed EBCDIC files #810 — Modifies the same VarLenNestedReader sizing/compression-related code; likely overlap on index-size and compression handling.
#805 Enable the option to cache VRL indexes for better performance when same files are processed multiple times #806 — Related prior work on index caching and buildIndex behavior; directly connected to cache refactor here.

Poem

🐰
I stash indexes neat, hop lightly through each key,
Old crumbs get nudged away — the freshest stay with me.
Reader options match, the cache knows what to keep,
Caps on sizes hum, no surprises in the heap.
Hop, tap, and index — cache tidy, fast, and sweet.

Pre-merge checks and finishing touches

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately describes the main change: adding reader properties hash code as an index cache key to prevent false cache hits.
Linked Issues check	✅ Passed	The PR fully implements the requirement from #811: reader parameters are now incorporated into the cache key via readerOptionsHashCode, ensuring cache hits only occur when reader options match.
Out of Scope Changes check	✅ Passed	All changes are directly related to the cache key enhancement objective. Minor adjustments to VarLenNestedReader are within scope as they refine related indexing logic.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✨ Finishing touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch feature/811-improve-index-caching

📜 Recent review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f29acc9 and 6b82b30.

📒 Files selected for processing (3)

cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/reader/VarLenNestedReader.scala
spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/utils/LRUCache.scala
spark-cobol/src/test/scala/za/co/absa/cobrix/spark/cobol/source/integration/Test37RecordLengthMappingSpec.scala

🚧 Files skipped from review as they are similar to previous changes (3)

cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/reader/VarLenNestedReader.scala
spark-cobol/src/test/scala/za/co/absa/cobrix/spark/cobol/source/integration/Test37RecordLengthMappingSpec.scala
spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/utils/LRUCache.scala

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)

GitHub Check: Spark 3.5.7 on Scala 2.12.20
GitHub Check: Spark 3.4.4 on Scala 2.12.20
GitHub Check: Spark 2.4.8 on Scala 2.11.12
GitHub Check: Spark 3.5.7 on Scala 2.13.17
GitHub Check: test (2.12.20, 2.12, 3.3.4, 0, 80, 20)

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

github-actions · 2025-12-22T08:09:21Z

JaCoCo code coverage report - 'cobol-parser'

Overall Project	91.51%	🍏
Files changed	85%	🍏

File	Coverage
VarLenNestedReader.scala	68.95% `-0.34%`	🍏

github-actions · 2025-12-22T08:09:22Z

JaCoCo code coverage report - 'spark-cobol'

File	Coverage [96.61%]	🍏
IndexKey.scala	100%	🍏
LRUCache.scala	97.85%	🍏
IndexBuilder.scala	96.4%	🍏

Total Project Coverage	81.96%	🍏

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (3)

cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/reader/VarLenNestedReader.scala (1)
224-226: Consider extracting the magic number 256 to a named constant.

The hardcoded value 256 represents the maximum split size (in MB) for uncompressed files based on the filesystem block size. Extracting it to a named constant (e.g., DEFAULT_MAX_FS_BASED_SPLIT_SIZE_MB) would improve readability and align with the existing pattern used for DEFAULT_INDEX_SIZE_COMPRESSED_FILES_MB on line 45.
🔎 Suggested refactor

Add the constant near line 46:
 private val DEFAULT_INDEX_SIZE_COMPRESSED_FILES_MB = 1024
 private val DEFAULT_FS_INDEX_SIZE_MULTIPLIER = 4
+private val DEFAULT_MAX_FS_BASED_SPLIT_SIZE_MB = 256
Then use it in the calculation:
       val defaultIndexSizeBofFsBlock = readerProperties.fsDefaultBlockSize.map { size =>
-        Math.min(size * DEFAULT_FS_INDEX_SIZE_MULTIPLIER, 256)
+        Math.min(size * DEFAULT_FS_INDEX_SIZE_MULTIPLIER, DEFAULT_MAX_FS_BASED_SPLIT_SIZE_MB)
       }
spark-cobol/src/test/scala/za/co/absa/cobrix/spark/cobol/source/integration/Test37RecordLengthMappingSpec.scala (1)
200-205: Consider using more idiomatic Option handling.

The cache transformation and assertions work correctly, but the null check on Line 204 could be more idiomatic.
🔎 Suggested refactor
     val indexCacheSimplified = IndexBuilder.indexCache.getMap.map {
       case (k, v) => (k.fileName, v)
     }

-    assert(indexCacheSimplified.get(pathNameAsCached) != null)
-    assert(indexCacheSimplified(pathNameAsCached).length == 2)
+    assert(indexCacheSimplified.contains(pathNameAsCached))
+    assert(indexCacheSimplified(pathNameAsCached).length == 2)
spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/utils/LRUCache.scala (1)
21-49: Solid LRU cache implementation with proper thread safety.

The implementation correctly uses LinkedHashMap with access-order mode and synchronized blocks for thread safety. The eviction logic (size() > maxSize) is the standard pattern for removeEldestEntry.
Minor optimization: Consider setting initial capacity closer to maxSize

With initialCapacity = 16 and maxSize = 10000, the internal map will resize many times as it grows. While not critical, you could optimize by using a larger initial capacity:
-  private val cache = new java.util.LinkedHashMap[K, V](16, loadFactor, true) {
+  private val cache = new java.util.LinkedHashMap[K, V](Math.min(maxSize, 1024), loadFactor, true) {
     override def removeEldestEntry(eldest: java.util.Map.Entry[K, V]): Boolean = size() > maxSize
   }
This reduces resize overhead during cache warm-up.

📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a0136c0 and f29acc9.

📒 Files selected for processing (6)

cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/reader/VarLenNestedReader.scala
spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/source/index/IndexBuilder.scala
spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/source/index/IndexKey.scala
spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/utils/LRUCache.scala
spark-cobol/src/test/scala/za/co/absa/cobrix/spark/cobol/source/integration/Test37RecordLengthMappingSpec.scala
spark-cobol/src/test/scala/za/co/absa/cobrix/spark/cobol/source/utils/LRUCacheSuite.scala

🧰 Additional context used

🧬 Code graph analysis (5)

spark-cobol/src/test/scala/za/co/absa/cobrix/spark/cobol/source/utils/LRUCacheSuite.scala (1)

spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/utils/LRUCache.scala (6)

LRUCache (21-49)

put (42-44)

get (34-36)

remove (46-48)

containsKey (30-32)

getMap (38-40)

spark-cobol/src/test/scala/za/co/absa/cobrix/spark/cobol/source/integration/Test37RecordLengthMappingSpec.scala (1)

spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/utils/LRUCache.scala (2)

getMap (38-40)

get (34-36)

spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/source/index/IndexBuilder.scala (2)

spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/utils/LRUCache.scala (3)

LRUCache (21-49)

containsKey (30-32)

put (42-44)

spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/source/index/IndexKey.scala (1)

IndexKey (19-19)

spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/source/index/IndexKey.scala (1)

spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/source/index/IndexBuilder.scala (11)

cobol (47-397)

cobol (76-115)

cobol (121-176)

cobol (181-196)

cobol (198-223)

cobol (225-253)

cobol (255-327)

cobol (329-349)

cobol (354-359)

cobol (364-376)

cobol (378-380)

spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/utils/LRUCache.scala (1)

spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/source/index/IndexBuilder.scala (11)

cobol (47-397)

cobol (76-115)

cobol (121-176)

cobol (181-196)

cobol (198-223)

cobol (225-253)

cobol (255-327)

cobol (329-349)

cobol (354-359)

cobol (364-376)

cobol (378-380)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)

GitHub Check: test (2.12.20, 2.12, 3.3.4, 0, 80, 20)
GitHub Check: Spark 3.4.4 on Scala 2.12.20
GitHub Check: Spark 2.4.8 on Scala 2.11.12
GitHub Check: Spark 3.5.7 on Scala 2.12.20
GitHub Check: Spark 3.5.7 on Scala 2.13.17

🔇 Additional comments (5)

spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/source/index/IndexKey.scala (1)

19-19: LGTM! Clean cache key implementation.

The case class provides proper equals and hashCode implementations automatically, making it suitable for use as a cache key. The combination of fileName and optionsHashCode correctly addresses the PR objective of preventing false cache hits when reader options differ.

spark-cobol/src/test/scala/za/co/absa/cobrix/spark/cobol/source/utils/LRUCacheSuite.scala (1)

22-91: Excellent test coverage for the LRU cache.

The test suite comprehensively validates:

Basic storage and retrieval

Capacity-based eviction (oldest entries removed)

Access-order preservation (frequently accessed entries retained)

Removal semantics and key presence checks

Map view exposure via getMap

All test scenarios are well-structured and assertions are correct.

spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/source/index/IndexBuilder.scala (3)

48-50: Good cache size limit to prevent unbounded growth.

Replacing the unbounded ConcurrentHashMap with a size-limited LRUCache prevents potential memory issues in long-running jobs with many unique file/option combinations.

131-131: Cache key usage correctly prevents false cache hits.

All cache operations consistently use IndexKey(filePath, readerOptionsHashCode), ensuring that cached indexes are only reused when both the file path and reader options match. This directly addresses the PR objective of preventing false cache retrieval when read options differ.

Also applies to: 162-162, 170-170

127-127: Verify that ReaderProperties.hashCode() is deterministic within the same JVM session.

The cache key implementation is sound. ReaderParameters is a case class with immutable fields, which means Scala automatically generates equals and hashCode methods, which let you compare objects and easily use them as keys in maps. The hashCode will be consistent for the same field values within a single JVM session. Ensure that reader configuration is not modified after initialization, as equals and hashCode algorithms that depend on mutable state can be problematic for users in collections.

cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/reader/VarLenNestedReader.scala

#811 Add read properties hash code as index key to avoid false cache …

f29acc9

…retrieval when read options are different.

coderabbitai bot reviewed Dec 22, 2025

View reviewed changes

cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/reader/VarLenNestedReader.scala Outdated Show resolved Hide resolved

#811 Fix PR suggestions (Thanks @coderabbitai).

6b82b30

yruslan merged commit 4f3f20d into master Dec 22, 2025
7 checks passed

yruslan deleted the feature/811-improve-index-caching branch December 22, 2025 08:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

#811 Add read properties hash code as index key to avoid false cache #812

#811 Add read properties hash code as index key to avoid false cache #812

Uh oh!

yruslan commented Dec 22, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Dec 22, 2025 •

edited

Loading

Uh oh!

github-actions bot commented Dec 22, 2025 •

edited

Loading

Uh oh!

github-actions bot commented Dec 22, 2025 •

edited

Loading

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

#811 Add read properties hash code as index key to avoid false cache #812

#811 Add read properties hash code as index key to avoid false cache #812

Uh oh!

Conversation

yruslan commented Dec 22, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Dec 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Possibly related PRs

Poem

Pre-merge checks and finishing touches

Uh oh!

github-actions bot commented Dec 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

JaCoCo code coverage report - 'cobol-parser'

Uh oh!

github-actions bot commented Dec 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

JaCoCo code coverage report - 'spark-cobol'

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

yruslan commented Dec 22, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Dec 22, 2025 •

edited

Loading

github-actions bot commented Dec 22, 2025 •

edited

Loading

github-actions bot commented Dec 22, 2025 •

edited

Loading