Skip to content

Conversation

@Shekharrajak
Copy link
Contributor

[core] Support generating splits with finer granularity than file level

Purpose

Linked issue: close #5012

This PR implements support for generating splits at finer granularity than file level (e.g., row groups for Parquet, stripes for ORC) to significantly enhance concurrency when reading large files. This follows the proven pattern used by Spark and Flink, where files are split at natural boundaries (row groups/stripes) rather than file boundaries.

The implementation leverages existing RawFile infrastructure with offset and length fields, ensuring backward compatibility while enabling improved parallelism for large file reads.

Tests

Unit tests and integration tests are needed for:

  • ParquetMetadataReader: Verify row group boundary extraction
  • OrcMetadataReader: Verify stripe boundary extraction
  • FineGrainedSplitGenerator: Verify split generation logic
  • ParquetReaderFactory.createReader(offset, length): Verify range-based reading
  • OrcReaderFactory.createReader(offset, length): Verify range-based reading

API and Format

New Configuration Options:

  • source.split.file-enabled: Enable finer-grained file splitting (default: false)
  • source.split.file-threshold: Minimum file size to consider splitting (default: 128MB)
  • source.split.file-max-splits: Maximum splits per file (default: 100)

New Interfaces/Classes:

  • FormatMetadataReader: Interface for reading format-specific metadata
  • FileSplitBoundary: Represents split boundaries (offset, length, rowCount)
  • ParquetMetadataReader: Extracts row group boundaries from Parquet files
  • OrcMetadataReader: Extracts stripe boundaries from ORC files
  • FineGrainedSplitGenerator: Decorator for SplitGenerator that enables fine-grained splitting

Extended Interfaces:

  • FormatReaderFactory.createReader(Context, offset, length): Now implemented for Parquet and ORC
  • DataSplit: Added transient fileSplitBoundaries field (not serialized for backward compatibility)

Storage Format: No changes to storage format. This is a read-time optimization that doesn't affect how data is written or stored.

Documentation

This change introduces a new feature that should be documented:

  1. Configuration Guide: Document the new source.split.file-enabled and related options
  2. Performance Tuning Guide: Explain when and how to use fine-grained splitting for optimal performance
  3. API Documentation: Document the new FormatMetadataReader interface and implementations

The feature is disabled by default to maintain backward compatibility. Users can enable it by setting source.split.file-enabled=true in their table options.

@Shekharrajak Shekharrajak changed the title Feature/finer granularity splits Support generating splits with finer granularity than file level Dec 28, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature] Support generating splits with finer granularity than file level

1 participant