GeoZarr Specification Contribution¶
This document outlines our contribution to the GeoZarr specification based on our implementation experience with the EOPF GeoZarr data model.
Overview¶
Our implementation of GeoZarr-compliant data conversion for Earth Observation data has revealed several areas where the current specification could be improved to better support scientific use cases. We have contributed feedback to the GeoZarr specification development process through detailed GitHub issues.
Key Issues Identified and Reported¶
1. Arbitrary Coordinate Systems Support¶
Issue: zarr-developers/geozarr-spec#81
Problem: The current specification has an implicit bias toward web mapping tile schemes (WebMercatorQuad), which may discourage scientific applications that work with native coordinate reference systems.
Our Solution: Our implementation successfully demonstrates:
- Creation of "Native CRS Tile Matrix Sets" for arbitrary projections
- Multiscale pyramids working with UTM and other scientific projections
- Proper scale denominator calculations for non-web CRS
- Chunking strategies optimized for native coordinate systems
Impact: This is critical for Earth observation data that often comes in UTM zones, polar stereographic, or other scientific projections where preserving the native CRS maintains scientific accuracy.
2. Chunking Performance Optimization¶
Issue: zarr-developers/geozarr-spec#82
Problem: The specification requires strict 1:1 mapping between Zarr chunks and tile matrix tiles, which prevents optimal chunking strategies for different data types and storage backends.
Our Solution: We implemented sophisticated chunk alignment logic:
def calculate_aligned_chunk_size(dimension_size: int, target_chunk_size: int) -> int:
"""Calculate a chunk size that divides evenly into the dimension size."""
if target_chunk_size >= dimension_size:
return dimension_size
# Find the largest divisor that is <= target_chunk_size
for chunk_size in range(target_chunk_size, 0, -1):
if dimension_size % chunk_size == 0:
return chunk_size
return 1
Impact: This approach prevents chunk overlap issues with Dask while optimizing for actual data dimensions rather than arbitrary tile sizes, significantly improving performance.
3. Multiscale Hierarchy Structure Clarification¶
Issue: zarr-developers/geozarr-spec#83
Problem: The specification describes multiscale encoding but doesn't clearly define the exact hierarchical structure and relationship between parent groups and zoom level children.
Our Solution: We implemented a clear hierarchy structure:
/measurements/r10m/ # Parent group with multiscales metadata
├── 0/ # Native resolution (zoom level 0)
│ ├── band1
│ ├── band2
│ └── spatial_ref
├── 1/ # First overview level
│ ├── band1
│ ├── band2
│ └── spatial_ref
└── 2/ # Second overview level
├── band1
├── band2
└── spatial_ref
Impact: This provides a concrete, tested pattern for implementing multiscale hierarchies that other implementations can follow.
Implementation Evidence¶
Our implementation provides concrete evidence for these improvements:
Native CRS Preservation¶
- Function:
create_native_crs_tile_matrix_set()
- Purpose: Creates custom tile matrix sets for arbitrary coordinate reference systems
- Benefit: Maintains scientific accuracy without unnecessary reprojection
Robust Processing¶
- Function:
write_dataset_band_by_band_with_validation()
- Purpose: Handles large datasets with retry logic and validation
- Benefit: Production-ready robustness for real-world data processing
Comprehensive Metadata Handling¶
- Function:
_add_coordinate_metadata()
- Purpose: Handles diverse coordinate types (time, angle, band, detector)
- Benefit: Supports the full range of Earth observation data structures
Cloud Storage Optimization¶
- Features: S3 support with credential validation, storage options handling
- Benefit: Enables cloud-native workflows with proper error handling
Specification Sections Addressed¶
Our contributions target specific sections of the GeoZarr specification:
- Section 9.7.3 (Tile Matrix Set Representation) - Native CRS support
- Section 9.7.4 (Chunk Layout Alignment) - Flexible chunking
- Section 9.7.1 (Hierarchical Layout) - Clear structure definition
- Section 9.7.2 (Metadata Encoding) - Metadata placement guidance
Benefits for the Earth Observation Community¶
These contributions specifically benefit Earth observation and scientific data applications:
- Scientific Accuracy: Preserving native CRS prevents distortion from unnecessary reprojections
- Performance: Optimized chunking improves processing speed and reduces memory usage
- Clarity: Clear hierarchy definitions enable consistent implementations
- Robustness: Production patterns support real-world deployment scenarios
Future Work¶
We continue to monitor the specification development and will contribute additional feedback as our implementation evolves. Areas for potential future contribution include:
- Cloud storage optimization patterns
- Coordinate variable handling for diverse data types
- Integration with STAC metadata standards
- Guidance for time dimension handling
Related Documentation¶
- Converter Documentation - Technical details of our implementation
- Architecture - Technical architecture and design principles
- API Reference - Complete Python API documentation