Support generating splits with finer granularity than file level #6917

Shekharrajak · 2025-12-28T16:48:19Z

[core] Support generating splits with finer granularity than file level

Purpose

Linked issue: close #5012

This PR implements support for generating splits at finer granularity than file level (e.g., row groups for Parquet, stripes for ORC) to significantly enhance concurrency when reading large files. This follows the proven pattern used by Spark and Flink, where files are split at natural boundaries (row groups/stripes) rather than file boundaries.

The implementation leverages existing RawFile infrastructure with offset and length fields, ensuring backward compatibility while enabling improved parallelism for large file reads.

Tests

Unit tests and integration tests are needed for:

ParquetMetadataReader: Verify row group boundary extraction
OrcMetadataReader: Verify stripe boundary extraction
FineGrainedSplitGenerator: Verify split generation logic
ParquetReaderFactory.createReader(offset, length): Verify range-based reading
OrcReaderFactory.createReader(offset, length): Verify range-based reading

API and Format

New Configuration Options:

source.split.file-enabled: Enable finer-grained file splitting (default: false)
source.split.file-threshold: Minimum file size to consider splitting (default: 128MB)
source.split.file-max-splits: Maximum splits per file (default: 100)

New Interfaces/Classes:

FormatMetadataReader: Interface for reading format-specific metadata
FileSplitBoundary: Represents split boundaries (offset, length, rowCount)
ParquetMetadataReader: Extracts row group boundaries from Parquet files
OrcMetadataReader: Extracts stripe boundaries from ORC files
FineGrainedSplitGenerator: Decorator for SplitGenerator that enables fine-grained splitting

Extended Interfaces:

FormatReaderFactory.createReader(Context, offset, length): Now implemented for Parquet and ORC
DataSplit: Added transient fileSplitBoundaries field (not serialized for backward compatibility)

Storage Format: No changes to storage format. This is a read-time optimization that doesn't affect how data is written or stored.

Documentation

This change introduces a new feature that should be documented:

Configuration Guide: Document the new source.split.file-enabled and related options
Performance Tuning Guide: Explain when and how to use fine-grained splitting for optimal performance
API Documentation: Document the new FormatMetadataReader interface and implementations

The feature is disabled by default to maintain backward compatibility. Users can enable it by setting source.split.file-enabled=true in their table options.

Shekharrajak added 7 commits December 28, 2025 22:14

Add FormatMetadataReader interface and FileSplitBoundary class

7ada34d

Implement Parquet and ORC metadata readers for finer-grained splitting

589aa7f

Add offset/length support to Parquet and ORC readers

eb27b49

Add configuration options for fine-grained file splitting

6f60f44

Add FineGrainedSplitGenerator decorator for row group/stripe splitting

a419d9c

Add file split boundaries support to DataSplit

de9da59

Integrate fine-grained splitting with SnapshotReader

4d64892

Shekharrajak changed the title ~~Feature/finer granularity splits~~ Support generating splits with finer granularity than file level Dec 28, 2025

Support fine-grained file splitting by row groups/stripes

75c3988

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Support generating splits with finer granularity than file level #6917

Support generating splits with finer granularity than file level #6917

Uh oh!

Shekharrajak commented Dec 28, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Support generating splits with finer granularity than file level #6917

Are you sure you want to change the base?

Support generating splits with finer granularity than file level #6917

Uh oh!

Conversation

Shekharrajak commented Dec 28, 2025

Purpose

Tests

API and Format

Documentation

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant