-
Notifications
You must be signed in to change notification settings - Fork 86
#811 Add read properties hash code as index key to avoid false cache #812
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…retrieval when read options are different.
WalkthroughReplaces an unbounded index cache with a size-bounded LRUCache keyed by an IndexKey that includes reader options hash, updates IndexBuilder to use the new cache and adjusted method signature, caps a fallback index-size calculation in VarLenNestedReader, and adds LRUCache tests and a new IndexKey type. Changes
Sequence DiagramsequenceDiagram
participant Client as Reader / Caller
participant IndexBuilder as IndexBuilder
participant Cache as LRUCache
participant FS as Filesystem
Client->>IndexBuilder: buildIndex(filesList, cobolReader, sqlContext, ...)
IndexBuilder->>IndexBuilder: compute IndexKey(filePath, readerOptionsHashCode)
IndexBuilder->>Cache: containsKey(IndexKey)?
alt cache hit (same reader options)
Cache-->>IndexBuilder: true
IndexBuilder->>Cache: apply(IndexKey)
Cache-->>IndexBuilder: cached SparseIndexEntry[]
IndexBuilder-->>Client: return cached index
else cache miss or different options
IndexBuilder->>FS: read file contents
FS-->>IndexBuilder: file data / parsed entries
IndexBuilder->>IndexBuilder: build SparseIndexEntry[]
IndexBuilder->>Cache: put(IndexKey, SparseIndexEntry[])
IndexBuilder-->>Client: return new index
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes
Possibly related PRs
Poem
Pre-merge checks and finishing touches✅ Passed checks (5 passed)
✨ Finishing touches🧪 Generate unit tests (beta)
📜 Recent review detailsConfiguration used: Organization UI Review profile: CHILL Plan: Pro 📒 Files selected for processing (3)
🚧 Files skipped from review as they are similar to previous changes (3)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
JaCoCo code coverage report - 'cobol-parser'
|
JaCoCo code coverage report - 'spark-cobol'
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (3)
cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/reader/VarLenNestedReader.scala (1)
224-226: Consider extracting the magic number 256 to a named constant.The hardcoded value
256represents the maximum split size (in MB) for uncompressed files based on the filesystem block size. Extracting it to a named constant (e.g.,DEFAULT_MAX_FS_BASED_SPLIT_SIZE_MB) would improve readability and align with the existing pattern used forDEFAULT_INDEX_SIZE_COMPRESSED_FILES_MBon line 45.🔎 Suggested refactor
Add the constant near line 46:
private val DEFAULT_INDEX_SIZE_COMPRESSED_FILES_MB = 1024 private val DEFAULT_FS_INDEX_SIZE_MULTIPLIER = 4 +private val DEFAULT_MAX_FS_BASED_SPLIT_SIZE_MB = 256Then use it in the calculation:
val defaultIndexSizeBofFsBlock = readerProperties.fsDefaultBlockSize.map { size => - Math.min(size * DEFAULT_FS_INDEX_SIZE_MULTIPLIER, 256) + Math.min(size * DEFAULT_FS_INDEX_SIZE_MULTIPLIER, DEFAULT_MAX_FS_BASED_SPLIT_SIZE_MB) }spark-cobol/src/test/scala/za/co/absa/cobrix/spark/cobol/source/integration/Test37RecordLengthMappingSpec.scala (1)
200-205: Consider using more idiomatic Option handling.The cache transformation and assertions work correctly, but the null check on Line 204 could be more idiomatic.
🔎 Suggested refactor
val indexCacheSimplified = IndexBuilder.indexCache.getMap.map { case (k, v) => (k.fileName, v) } - assert(indexCacheSimplified.get(pathNameAsCached) != null) - assert(indexCacheSimplified(pathNameAsCached).length == 2) + assert(indexCacheSimplified.contains(pathNameAsCached)) + assert(indexCacheSimplified(pathNameAsCached).length == 2)spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/utils/LRUCache.scala (1)
21-49: Solid LRU cache implementation with proper thread safety.The implementation correctly uses
LinkedHashMapwith access-order mode andsynchronizedblocks for thread safety. The eviction logic (size() > maxSize) is the standard pattern forremoveEldestEntry.Minor optimization: Consider setting initial capacity closer to maxSize
With
initialCapacity = 16andmaxSize = 10000, the internal map will resize many times as it grows. While not critical, you could optimize by using a larger initial capacity:- private val cache = new java.util.LinkedHashMap[K, V](16, loadFactor, true) { + private val cache = new java.util.LinkedHashMap[K, V](Math.min(maxSize, 1024), loadFactor, true) { override def removeEldestEntry(eldest: java.util.Map.Entry[K, V]): Boolean = size() > maxSize }This reduces resize overhead during cache warm-up.
📜 Review details
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (6)
cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/reader/VarLenNestedReader.scalaspark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/source/index/IndexBuilder.scalaspark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/source/index/IndexKey.scalaspark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/utils/LRUCache.scalaspark-cobol/src/test/scala/za/co/absa/cobrix/spark/cobol/source/integration/Test37RecordLengthMappingSpec.scalaspark-cobol/src/test/scala/za/co/absa/cobrix/spark/cobol/source/utils/LRUCacheSuite.scala
🧰 Additional context used
🧬 Code graph analysis (5)
spark-cobol/src/test/scala/za/co/absa/cobrix/spark/cobol/source/utils/LRUCacheSuite.scala (1)
spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/utils/LRUCache.scala (6)
LRUCache(21-49)put(42-44)get(34-36)remove(46-48)containsKey(30-32)getMap(38-40)
spark-cobol/src/test/scala/za/co/absa/cobrix/spark/cobol/source/integration/Test37RecordLengthMappingSpec.scala (1)
spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/utils/LRUCache.scala (2)
getMap(38-40)get(34-36)
spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/source/index/IndexBuilder.scala (2)
spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/utils/LRUCache.scala (3)
LRUCache(21-49)containsKey(30-32)put(42-44)spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/source/index/IndexKey.scala (1)
IndexKey(19-19)
spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/source/index/IndexKey.scala (1)
spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/source/index/IndexBuilder.scala (11)
cobol(47-397)cobol(76-115)cobol(121-176)cobol(181-196)cobol(198-223)cobol(225-253)cobol(255-327)cobol(329-349)cobol(354-359)cobol(364-376)cobol(378-380)
spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/utils/LRUCache.scala (1)
spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/source/index/IndexBuilder.scala (11)
cobol(47-397)cobol(76-115)cobol(121-176)cobol(181-196)cobol(198-223)cobol(225-253)cobol(255-327)cobol(329-349)cobol(354-359)cobol(364-376)cobol(378-380)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
- GitHub Check: test (2.12.20, 2.12, 3.3.4, 0, 80, 20)
- GitHub Check: Spark 3.4.4 on Scala 2.12.20
- GitHub Check: Spark 2.4.8 on Scala 2.11.12
- GitHub Check: Spark 3.5.7 on Scala 2.12.20
- GitHub Check: Spark 3.5.7 on Scala 2.13.17
🔇 Additional comments (5)
spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/source/index/IndexKey.scala (1)
19-19: LGTM! Clean cache key implementation.The case class provides proper
equalsandhashCodeimplementations automatically, making it suitable for use as a cache key. The combination offileNameandoptionsHashCodecorrectly addresses the PR objective of preventing false cache hits when reader options differ.spark-cobol/src/test/scala/za/co/absa/cobrix/spark/cobol/source/utils/LRUCacheSuite.scala (1)
22-91: Excellent test coverage for the LRU cache.The test suite comprehensively validates:
- Basic storage and retrieval
- Capacity-based eviction (oldest entries removed)
- Access-order preservation (frequently accessed entries retained)
- Removal semantics and key presence checks
- Map view exposure via
getMapAll test scenarios are well-structured and assertions are correct.
spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/source/index/IndexBuilder.scala (3)
48-50: Good cache size limit to prevent unbounded growth.Replacing the unbounded
ConcurrentHashMapwith a size-limitedLRUCacheprevents potential memory issues in long-running jobs with many unique file/option combinations.
131-131: Cache key usage correctly prevents false cache hits.All cache operations consistently use
IndexKey(filePath, readerOptionsHashCode), ensuring that cached indexes are only reused when both the file path and reader options match. This directly addresses the PR objective of preventing false cache retrieval when read options differ.Also applies to: 162-162, 170-170
127-127: Verify that ReaderProperties.hashCode() is deterministic within the same JVM session.The cache key implementation is sound. ReaderParameters is a case class with immutable fields, which means Scala automatically generates equals and hashCode methods, which let you compare objects and easily use them as keys in maps. The hashCode will be consistent for the same field values within a single JVM session. Ensure that reader configuration is not modified after initialization, as equals and hashCode algorithms that depend on mutable state can be problematic for users in collections.
…retrieval when read options are different.
Closes #811
Summary by CodeRabbit
Bug Fixes
Refactor
Tests
✏️ Tip: You can customize this high-level summary in your review settings.