-
Notifications
You must be signed in to change notification settings - Fork 3
perf/direct zarr io #92
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
d-v-b
commented
Dec 8, 2025
- eagerly compute multiscales
- directly copy chunk bytes and metadata documents
…to perf/direct-zarr-io
|
Codecov Report❌ Patch coverage is 📢 Thoughts on this report? Let us know! |
…to perf/direct-zarr-io
| s2_parser.add_argument( | ||
| "--omit-nodes", help="The names of groups or arrays to skip.", default="", type=str | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this argument solves #81. You would pass --omit-nodes "quality/l2a_quicklook" to omit that group
…to perf/direct-zarr-io
…odel into perf/direct-zarr-io
|
@emmanuelmathot this is ready for review. at a high level, the conversion process in this PR is redesigned to use a more functional and explicit pattern with less reliance on xarray APIs, which are not very transparent about how data is being moved. ArchitectureI created a new module that contains utilities specific to Zarr IO operations. That module contains functions that all work toward the goal of re-encoding Zarr v2 groups into zarr v3. The main routine is because we are not relying on xarray for the basic copy procedure, we have to do more work on the encoding / attributes side, which is reflected in the array reencoder used for s2 conversion PerformanceMemory usage is improved on this branch, with peak memory down to ~4.5 GB from ~11 GB. Downsampling only adds a few GB of peak memory, which isn't too surprising. TestingI added quite a few tests but we need to see how the new output composes with the consuming code. @emmanuelmathot if you could try this branch out and check the output I would greatly appreciate it. |
|
@emmanuelmathot the redundant multiscales calculation is now fixed, and the chunk sizes / sharding are now consistent with the design goal (use as few objects as possible). On my local system, using dask for rechunking was much slower than what I am currently doing (plain assignment via zarr python). |
|
since we are re-encoding the zarr groups here, I can also handle the NaN conversion in this branch, unless that's better in a separate branch @emmanuelmathot |
|
with a1375b7 we have an option (defaulting to false) of allowing invalid values (nan and inf) in the output. When set to |
…to perf/direct-zarr-io
…odel into perf/direct-zarr-io
|
Just tested the last version of this PR: https://api.explorer.eopf.copernicus.eu/raster/collections/sentinel-2-l2a-staging/items/S2B_MSIL2A_20251115T091139_N0511_R050_T35SLU_20251115T111807/viewer |
|
i'll have a look later today! |