Skip to content

Conversation

@yruslan
Copy link
Collaborator

@yruslan yruslan commented Dec 22, 2025

…retrieval when read options are different.

Closes #811

Summary by CodeRabbit

  • Bug Fixes

    • Added bounds to index size calculations for non‑compressed inputs, preventing excessive memory use.
    • Improved cache invalidation to account for reader option changes, ensuring correct index retrieval when configurations differ.
  • Refactor

    • Replaced unbounded in‑memory cache with a bounded LRU cache and switched to a composite cache key that includes reader options.
    • Updated index-building API to accept additional context parameters.
  • Tests

    • Added unit tests covering LRU cache behavior.

✏️ Tip: You can customize this high-level summary in your review settings.

…retrieval when read options are different.
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Dec 22, 2025

Walkthrough

Replaces an unbounded index cache with a size-bounded LRUCache keyed by an IndexKey that includes reader options hash, updates IndexBuilder to use the new cache and adjusted method signature, caps a fallback index-size calculation in VarLenNestedReader, and adds LRUCache tests and a new IndexKey type.

Changes

Cohort / File(s) Summary
Cache implementation
spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/utils/LRUCache.scala
New thread-safe LRUCache[K,V] with synchronized APIs: apply, containsKey, get, getMap, put, remove; evicts eldest entries when size > maxSize.
Cache key type
spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/source/index/IndexKey.scala
Added case class IndexKey(fileName: String, optionsHashCode: Long) to represent file + reader-options identity for cache keys.
Cache usage & API
spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/source/index/IndexBuilder.scala
Replaced ConcurrentHashMap with LRUCache[IndexKey, Array[SparseIndexEntry]]; compute readerOptionsHashCode and use IndexKey for contains/put/get; buildIndex signature updated to accept sqlContext: SQLContext (kept localityParams).
Index-size fallback
cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/reader/VarLenNestedReader.scala
Adjusted getSplitSizeMB non-compressed branch to compute a capped default index size by multiplying fsDefaultBlockSize by DEFAULT_FS_INDEX_SIZE_MULTIPLIER and limiting the result (cap at 256), then fall back to inputSplitSizeMB or this capped default.
Tests — integration adjustment
spark-cobol/src/test/scala/za/co/absa/cobrix/spark/cobol/source/integration/Test37RecordLengthMappingSpec.scala
Reworked test cache access to a simplified filename-keyed view (indexCacheSimplified) derived from indexCache.getMap and updated presence/length checks accordingly.
Tests — unit
spark-cobol/src/test/scala/za/co/absa/cobrix/spark/cobol/source/utils/LRUCacheSuite.scala
New unit tests for LRUCache: capacity retention, LRU eviction behavior, access-touch preservation, removal/invalidation, and getMap verification.

Sequence Diagram

sequenceDiagram
    participant Client as Reader / Caller
    participant IndexBuilder as IndexBuilder
    participant Cache as LRUCache
    participant FS as Filesystem

    Client->>IndexBuilder: buildIndex(filesList, cobolReader, sqlContext, ...)
    IndexBuilder->>IndexBuilder: compute IndexKey(filePath, readerOptionsHashCode)
    IndexBuilder->>Cache: containsKey(IndexKey)?
    alt cache hit (same reader options)
        Cache-->>IndexBuilder: true
        IndexBuilder->>Cache: apply(IndexKey)
        Cache-->>IndexBuilder: cached SparseIndexEntry[]
        IndexBuilder-->>Client: return cached index
    else cache miss or different options
        IndexBuilder->>FS: read file contents
        FS-->>IndexBuilder: file data / parsed entries
        IndexBuilder->>IndexBuilder: build SparseIndexEntry[]
        IndexBuilder->>Cache: put(IndexKey, SparseIndexEntry[])
        IndexBuilder-->>Client: return new index
    end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

  • Pay attention to: IndexBuilder API change (call sites must match new signature), correctness of readerOptionsHashCode (ensure it covers relevant options), equality/hash behavior of IndexKey, thread-safety and eviction correctness in LRUCache, and updated tests that map old cache semantics to the new keying scheme.

Possibly related PRs

Poem

🐰
I stash indexes neat, hop lightly through each key,
Old crumbs get nudged away — the freshest stay with me.
Reader options match, the cache knows what to keep,
Caps on sizes hum, no surprises in the heap.
Hop, tap, and index — cache tidy, fast, and sweet.

Pre-merge checks and finishing touches

✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main change: adding reader properties hash code as an index cache key to prevent false cache hits.
Linked Issues check ✅ Passed The PR fully implements the requirement from #811: reader parameters are now incorporated into the cache key via readerOptionsHashCode, ensuring cache hits only occur when reader options match.
Out of Scope Changes check ✅ Passed All changes are directly related to the cache key enhancement objective. Minor adjustments to VarLenNestedReader are within scope as they refine related indexing logic.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
✨ Finishing touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch feature/811-improve-index-caching

📜 Recent review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f29acc9 and 6b82b30.

📒 Files selected for processing (3)
  • cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/reader/VarLenNestedReader.scala
  • spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/utils/LRUCache.scala
  • spark-cobol/src/test/scala/za/co/absa/cobrix/spark/cobol/source/integration/Test37RecordLengthMappingSpec.scala
🚧 Files skipped from review as they are similar to previous changes (3)
  • cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/reader/VarLenNestedReader.scala
  • spark-cobol/src/test/scala/za/co/absa/cobrix/spark/cobol/source/integration/Test37RecordLengthMappingSpec.scala
  • spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/utils/LRUCache.scala
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
  • GitHub Check: Spark 3.5.7 on Scala 2.12.20
  • GitHub Check: Spark 3.4.4 on Scala 2.12.20
  • GitHub Check: Spark 2.4.8 on Scala 2.11.12
  • GitHub Check: Spark 3.5.7 on Scala 2.13.17
  • GitHub Check: test (2.12.20, 2.12, 3.3.4, 0, 80, 20)

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions
Copy link

github-actions bot commented Dec 22, 2025

JaCoCo code coverage report - 'cobol-parser'

Overall Project 91.51% 🍏
Files changed 85% 🍏

File Coverage
VarLenNestedReader.scala 68.95% -0.34% 🍏

@github-actions
Copy link

github-actions bot commented Dec 22, 2025

JaCoCo code coverage report - 'spark-cobol'

File Coverage [96.61%] 🍏
IndexKey.scala 100% 🍏
LRUCache.scala 97.85% 🍏
IndexBuilder.scala 96.4% 🍏
Total Project Coverage 81.96% 🍏

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (3)
cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/reader/VarLenNestedReader.scala (1)

224-226: Consider extracting the magic number 256 to a named constant.

The hardcoded value 256 represents the maximum split size (in MB) for uncompressed files based on the filesystem block size. Extracting it to a named constant (e.g., DEFAULT_MAX_FS_BASED_SPLIT_SIZE_MB) would improve readability and align with the existing pattern used for DEFAULT_INDEX_SIZE_COMPRESSED_FILES_MB on line 45.

🔎 Suggested refactor

Add the constant near line 46:

 private val DEFAULT_INDEX_SIZE_COMPRESSED_FILES_MB = 1024
 private val DEFAULT_FS_INDEX_SIZE_MULTIPLIER = 4
+private val DEFAULT_MAX_FS_BASED_SPLIT_SIZE_MB = 256

Then use it in the calculation:

       val defaultIndexSizeBofFsBlock = readerProperties.fsDefaultBlockSize.map { size =>
-        Math.min(size * DEFAULT_FS_INDEX_SIZE_MULTIPLIER, 256)
+        Math.min(size * DEFAULT_FS_INDEX_SIZE_MULTIPLIER, DEFAULT_MAX_FS_BASED_SPLIT_SIZE_MB)
       }
spark-cobol/src/test/scala/za/co/absa/cobrix/spark/cobol/source/integration/Test37RecordLengthMappingSpec.scala (1)

200-205: Consider using more idiomatic Option handling.

The cache transformation and assertions work correctly, but the null check on Line 204 could be more idiomatic.

🔎 Suggested refactor
     val indexCacheSimplified = IndexBuilder.indexCache.getMap.map {
       case (k, v) => (k.fileName, v)
     }

-    assert(indexCacheSimplified.get(pathNameAsCached) != null)
-    assert(indexCacheSimplified(pathNameAsCached).length == 2)
+    assert(indexCacheSimplified.contains(pathNameAsCached))
+    assert(indexCacheSimplified(pathNameAsCached).length == 2)
spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/utils/LRUCache.scala (1)

21-49: Solid LRU cache implementation with proper thread safety.

The implementation correctly uses LinkedHashMap with access-order mode and synchronized blocks for thread safety. The eviction logic (size() > maxSize) is the standard pattern for removeEldestEntry.

Minor optimization: Consider setting initial capacity closer to maxSize

With initialCapacity = 16 and maxSize = 10000, the internal map will resize many times as it grows. While not critical, you could optimize by using a larger initial capacity:

-  private val cache = new java.util.LinkedHashMap[K, V](16, loadFactor, true) {
+  private val cache = new java.util.LinkedHashMap[K, V](Math.min(maxSize, 1024), loadFactor, true) {
     override def removeEldestEntry(eldest: java.util.Map.Entry[K, V]): Boolean = size() > maxSize
   }

This reduces resize overhead during cache warm-up.

📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a0136c0 and f29acc9.

📒 Files selected for processing (6)
  • cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/reader/VarLenNestedReader.scala
  • spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/source/index/IndexBuilder.scala
  • spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/source/index/IndexKey.scala
  • spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/utils/LRUCache.scala
  • spark-cobol/src/test/scala/za/co/absa/cobrix/spark/cobol/source/integration/Test37RecordLengthMappingSpec.scala
  • spark-cobol/src/test/scala/za/co/absa/cobrix/spark/cobol/source/utils/LRUCacheSuite.scala
🧰 Additional context used
🧬 Code graph analysis (5)
spark-cobol/src/test/scala/za/co/absa/cobrix/spark/cobol/source/utils/LRUCacheSuite.scala (1)
spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/utils/LRUCache.scala (6)
  • LRUCache (21-49)
  • put (42-44)
  • get (34-36)
  • remove (46-48)
  • containsKey (30-32)
  • getMap (38-40)
spark-cobol/src/test/scala/za/co/absa/cobrix/spark/cobol/source/integration/Test37RecordLengthMappingSpec.scala (1)
spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/utils/LRUCache.scala (2)
  • getMap (38-40)
  • get (34-36)
spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/source/index/IndexBuilder.scala (2)
spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/utils/LRUCache.scala (3)
  • LRUCache (21-49)
  • containsKey (30-32)
  • put (42-44)
spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/source/index/IndexKey.scala (1)
  • IndexKey (19-19)
spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/source/index/IndexKey.scala (1)
spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/source/index/IndexBuilder.scala (11)
  • cobol (47-397)
  • cobol (76-115)
  • cobol (121-176)
  • cobol (181-196)
  • cobol (198-223)
  • cobol (225-253)
  • cobol (255-327)
  • cobol (329-349)
  • cobol (354-359)
  • cobol (364-376)
  • cobol (378-380)
spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/utils/LRUCache.scala (1)
spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/source/index/IndexBuilder.scala (11)
  • cobol (47-397)
  • cobol (76-115)
  • cobol (121-176)
  • cobol (181-196)
  • cobol (198-223)
  • cobol (225-253)
  • cobol (255-327)
  • cobol (329-349)
  • cobol (354-359)
  • cobol (364-376)
  • cobol (378-380)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
  • GitHub Check: test (2.12.20, 2.12, 3.3.4, 0, 80, 20)
  • GitHub Check: Spark 3.4.4 on Scala 2.12.20
  • GitHub Check: Spark 2.4.8 on Scala 2.11.12
  • GitHub Check: Spark 3.5.7 on Scala 2.12.20
  • GitHub Check: Spark 3.5.7 on Scala 2.13.17
🔇 Additional comments (5)
spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/source/index/IndexKey.scala (1)

19-19: LGTM! Clean cache key implementation.

The case class provides proper equals and hashCode implementations automatically, making it suitable for use as a cache key. The combination of fileName and optionsHashCode correctly addresses the PR objective of preventing false cache hits when reader options differ.

spark-cobol/src/test/scala/za/co/absa/cobrix/spark/cobol/source/utils/LRUCacheSuite.scala (1)

22-91: Excellent test coverage for the LRU cache.

The test suite comprehensively validates:

  • Basic storage and retrieval
  • Capacity-based eviction (oldest entries removed)
  • Access-order preservation (frequently accessed entries retained)
  • Removal semantics and key presence checks
  • Map view exposure via getMap

All test scenarios are well-structured and assertions are correct.

spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/source/index/IndexBuilder.scala (3)

48-50: Good cache size limit to prevent unbounded growth.

Replacing the unbounded ConcurrentHashMap with a size-limited LRUCache prevents potential memory issues in long-running jobs with many unique file/option combinations.


131-131: Cache key usage correctly prevents false cache hits.

All cache operations consistently use IndexKey(filePath, readerOptionsHashCode), ensuring that cached indexes are only reused when both the file path and reader options match. This directly addresses the PR objective of preventing false cache retrieval when read options differ.

Also applies to: 162-162, 170-170


127-127: Verify that ReaderProperties.hashCode() is deterministic within the same JVM session.

The cache key implementation is sound. ReaderParameters is a case class with immutable fields, which means Scala automatically generates equals and hashCode methods, which let you compare objects and easily use them as keys in maps. The hashCode will be consistent for the same field values within a single JVM session. Ensure that reader configuration is not modified after initialization, as equals and hashCode algorithms that depend on mutable state can be problematic for users in collections.

@yruslan yruslan merged commit 4f3f20d into master Dec 22, 2025
7 checks passed
@yruslan yruslan deleted the feature/811-improve-index-caching branch December 22, 2025 08:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Make sure indexes from cache are used only if reader parameters are the same

2 participants