Frequently Asked Questions¶

Common questions and solutions for using the EOPF GeoZarr library.

General Questions¶

What is EOPF GeoZarr?¶

EOPF GeoZarr is a Python library that converts EOPF (Earth Observation Processing Framework) datasets to GeoZarr-spec 0.4 compliant format. It maintains scientific accuracy while optimizing for cloud-native workflows and performance.

What makes this different from standard Zarr?¶

GeoZarr is a specification that extends Zarr with geospatial metadata standards. Our library specifically:

Ensures GeoZarr 0.4 specification compliance
Preserves native coordinate reference systems
Creates multiscale pyramids for efficient visualization
Maintains CF conventions for scientific interoperability
Optimizes chunking for Earth observation data patterns

Which satellite missions are supported?¶

Currently, the library is optimized for:

Sentinel-2 (L1C and L2A products)
Sentinel-1 (planned support)

The architecture is designed to support additional missions with minimal modifications.

Installation and Setup¶

Why do I need Python 3.11 or higher?¶

The library uses modern Python features and depends on recent versions of scientific libraries (xarray, zarr, dask) that require Python 3.11+.

Can I use conda instead of pip?¶

While the library is primarily distributed via PyPI, you can install it in a conda environment:

conda create -n eopf-geozarr python=3.11
conda activate eopf-geozarr
pip install eopf-geozarr

How do I set up AWS credentials?¶

Multiple options are available:

# Option 1: AWS CLI
aws configure

# Option 2: Environment variables
export AWS_ACCESS_KEY_ID=your_key
export AWS_SECRET_ACCESS_KEY=your_secret
export AWS_DEFAULT_REGION=us-east-1

# Option 3: IAM roles (for EC2/ECS)
# No setup needed - automatic detection

Usage Questions¶

How do I know which groups to convert?¶

Inspect your EOPF dataset first:

# test: skip
import xarray as xr

dt = xr.open_datatree("input.zarr", engine="zarr")
print(dt)  # Shows the full structure

# Common Sentinel-2 groups:
groups = [
    "/measurements/r10m",  # 10m bands: B02, B03, B04, B08
    "/measurements/r20m",  # 20m bands: B05, B06, B07, B8A, B11, B12
    "/measurements/r60m"   # 60m bands: B01, B09, B10
]

What chunk size should I use?¶

The optimal chunk size depends on your data and use case:

# test: skip
from eopf_geozarr.conversion.utils import calculate_aligned_chunk_size

# For typical Sentinel-2 data (10980x10980)
optimal_chunk = calculate_aligned_chunk_size(10980, 4096)
print(optimal_chunk)  # Returns 3660

# General guidelines:
# - 4096: Good default for most cases
# - 2048: Better for memory-constrained environments
# - 8192: For high-memory systems and large datasets

How do I process very large datasets?¶

Use Dask for distributed processing:

# test: skip
from dask.distributed import Client

# Start Dask cluster
client = Client('scheduler-address:8786')

# Use smaller chunks for distributed processing
dt_geozarr = create_geozarr_dataset(
    dt_input=dt,
    groups=["/measurements/r10m"],
    output_path="output.zarr",
    spatial_chunk=2048  # Smaller chunks work better with Dask
)

Can I convert only specific bands?¶

Currently, the library processes all bands within a group. To process specific bands, you would need to create a subset of your input dataset first:

# test: skip
# Create subset with specific bands
dt_subset = dt.copy()
ds_10m = dt_subset["/measurements/r10m"].ds
ds_subset = ds_10m[["b02", "b03", "b04"]]  # Only RGB bands
dt_subset["/measurements/r10m"].ds = ds_subset

# Then convert
dt_geozarr = create_geozarr_dataset(
    dt_input=dt_subset,
    groups=["/measurements/r10m"],
    output_path="rgb_only.zarr"
)

Error Troubleshooting¶

"ImportError: No module named 'eopf_geozarr'"¶

Cause: Library not installed or wrong Python environment.

Solutions:

# test: skip
# Verify installation
pip list | grep eopf-geozarr

# Reinstall if missing
pip install eopf-geozarr

# Check Python environment
which python
python --version

"ValueError: Invalid groups specified"¶

Cause: Specified groups don't exist in the input dataset.

Solution:

# test: skip
# Check available groups
dt = xr.open_datatree("input.zarr", engine="zarr")
print("Available groups:", list(dt.groups))

# Use correct group paths
groups = [g for g in dt.groups if "measurements" in g]

"MemoryError" during conversion¶

Cause: Dataset too large for available memory.

Solutions:

# test: skip
# 1. Use smaller chunks
dt_geozarr = create_geozarr_dataset(
    dt_input=dt,
    groups=["/measurements/r10m"],
    output_path="output.zarr",
    spatial_chunk=1024  # Smaller chunks
)

# 2. Use Dask for out-of-core processing
from dask.distributed import Client
client = Client()

# 3. Process groups one at a time
for group in ["/measurements/r10m", "/measurements/r20m"]:
    dt_geozarr = create_geozarr_dataset(
        dt_input=dt,
        groups=[group],
        output_path=f"output_{group.split('/')[-1]}.zarr"
    )

"PermissionError" with S3¶

Cause: Insufficient S3 permissions or incorrect credentials.

Solutions:

# test: skip
# 1. Verify credentials
from eopf_geozarr.conversion.fs_utils import get_s3_credentials_info
print(get_s3_credentials_info())

# 2. Test S3 access
from eopf_geozarr.conversion.fs_utils import validate_s3_access
is_valid, error = validate_s3_access("s3://your-bucket/test.zarr")
print(f"Valid: {is_valid}, Error: {error}")

# 3. Check IAM permissions (need s3:GetObject, s3:PutObject, s3:ListBucket)

"zarr.errors.ArrayNotFoundError"¶

Cause: Corrupted or incomplete Zarr dataset.

Solutions:

# test: skip
# 1. Validate input dataset
try:
    dt = xr.open_datatree("input.zarr", engine="zarr")
    print("Dataset loaded successfully")
except Exception as e:
    print(f"Dataset error: {e}")

# 2. Check for missing arrays
import zarr
store = zarr.open("input.zarr", mode="r")
print("Available arrays:", list(store.array_keys()))

# 3. Consolidate metadata if needed
zarr.consolidate_metadata("input.zarr")

"CRS not found" or projection errors¶

Cause: Missing or invalid coordinate reference system information.

Solutions:

# test: skip
# Check CRS information
ds = dt["/measurements/r10m"].ds
print("CRS variables:", [v for v in ds.data_vars if 'crs' in v.lower() or 'spatial_ref' in v])

# Check coordinate attributes
print("X coord attrs:", ds.x.attrs)
print("Y coord attrs:", ds.y.attrs)

# Verify rioxarray can read CRS
import rioxarray
try:
    crs = ds.rio.crs
    print(f"CRS: {crs}")
except Exception as e:
    print(f"CRS error: {e}")

Performance Questions¶

Why is conversion slow?¶

Several factors affect performance:

Chunk size: Too small = many operations, too large = memory issues
Network: S3 operations depend on bandwidth and latency
CPU: Overview generation is CPU-intensive
Memory: Insufficient RAM causes swapping

Optimization strategies:

# test: skip
# 1. Optimal chunking
chunk_size = calculate_aligned_chunk_size(data_width, 4096)

# 2. Use Dask for parallelization
from dask.distributed import Client
client = Client(n_workers=4, threads_per_worker=2)

# 3. Process in batches
for group in groups:
    # Process one group at a time
    pass

# 4. Use SSD storage for temporary files
import tempfile
import os
os.environ['TMPDIR'] = '/path/to/fast/storage'

How can I monitor progress?¶

Enable verbose logging:

# test: skip
import logging
logging.basicConfig(level=logging.INFO)

# Or use the CLI with verbose flag
# eopf-geozarr convert input.zarr output.zarr --verbose

What's the expected output size?¶

GeoZarr datasets are typically larger than input due to:

Multiscale overviews (adds ~33% for 2 overview levels)
Additional metadata
Chunk alignment padding

Estimation:

# test: skip
# Rough estimate: input_size * 1.4 (with 2 overview levels)
# For Sentinel-2 10m band: ~400MB input → ~560MB GeoZarr

Cloud Storage Questions¶

Which cloud providers are supported?¶

AWS S3: Full support
S3-compatible: MinIO, DigitalOcean Spaces, etc.
Google Cloud Storage: Via S3 compatibility mode
Azure Blob Storage: Via S3 compatibility (limited)

How do I optimize S3 performance?¶

# test: skip
# 1. Use appropriate region
os.environ['AWS_DEFAULT_REGION'] = 'us-west-2'  # Close to your data

# 2. Configure multipart uploads
s3_config = {
    'config_kwargs': {
        'max_pool_connections': 50,
        'multipart_threshold': 64 * 1024 * 1024,  # 64MB
        'multipart_chunksize': 16 * 1024 * 1024   # 16MB
    }
}

# 3. Use VPC endpoints for EC2 instances
# 4. Consider S3 Transfer Acceleration for global access

Can I use different storage for input and output?¶

Yes, the library supports mixed storage:

# test: skip
# Local input, S3 output
dt = xr.open_datatree("local_input.zarr", engine="zarr")
dt_geozarr = create_geozarr_dataset(
    dt_input=dt,
    groups=["/measurements/r10m"],
    output_path="s3://bucket/output.zarr"
)

# S3 input, local output
dt = xr.open_datatree("s3://bucket/input.zarr", engine="zarr")
dt_geozarr = create_geozarr_dataset(
    dt_input=dt,
    groups=["/measurements/r10m"],
    output_path="local_output.zarr"
)

Validation and Quality¶

How do I verify the conversion worked correctly?¶

# test: skip
# 1. Use built-in validation
from eopf_geozarr.cli import validate_command
import argparse

args = argparse.Namespace()
args.input_path = "output.zarr"
args.verbose = True
validate_command(args)

# 2. Manual checks
dt = xr.open_datatree("output.zarr", engine="zarr")

# Check multiscales metadata
print("Multiscales:", dt.attrs.get('multiscales', 'Missing'))

# Check overview levels
for level in ['0', '1', '2']:
    path = f"/measurements/r10m/{level}"
    if path in dt.groups:
        ds = dt[path].ds
        print(f"Level {level}: {dict(ds.dims)}")

# Check required attributes
ds = dt["/measurements/r10m/0"].ds
for var_name in ds.data_vars:
    var = ds[var_name]
    print(f"{var_name}: grid_mapping={var.attrs.get('grid_mapping', 'Missing')}")

What should I do if validation fails?¶

Check the error message - it usually indicates the specific issue
Verify input data - ensure the EOPF dataset is complete and valid
Check dependencies - ensure all required libraries are up to date
Try with verbose logging - get more detailed error information
Report issues - if it seems like a bug, please report it

How do I compare input and output data?¶

# test: skip
# Load both datasets
dt_input = xr.open_datatree("input.zarr", engine="zarr")
dt_output = xr.open_datatree("output.zarr", engine="zarr")

# Compare native resolution data
ds_input = dt_input["/measurements/r10m"].ds
ds_output = dt_output["/measurements/r10m/0"].ds

# Check data values (should be identical)
import numpy as np
for band in ["b02", "b03", "b04"]:
    if band in ds_input and band in ds_output:
        diff = np.abs(ds_input[band].values - ds_output[band].values)
        print(f"{band} max difference: {diff.max()}")
        # Should be 0 or very close to 0

Integration Questions¶

Can I use this with STAC?¶

Yes, you can create STAC items for GeoZarr datasets. See the Examples section for detailed code.

How does this work with Jupyter notebooks?¶

The library works well in Jupyter environments. See Examples for interactive visualization patterns.

Can I integrate this into my processing pipeline?¶

Absolutely! The library is designed for integration:

# test: skip
# Example pipeline integration
def process_sentinel2_scene(input_path: str, output_path: str):
    """Process a single Sentinel-2 scene to GeoZarr."""
    try:
        dt = xr.open_datatree(input_path, engine="zarr")
        dt_geozarr = create_geozarr_dataset(
            dt_input=dt,
            groups=["/measurements/r10m", "/measurements/r20m"],
            output_path=output_path,
            spatial_chunk=4096
        )
        return True, "Success"
    except Exception as e:
        return False, str(e)

# Use in batch processing
results = []
for scene in scene_list:
    success, message = process_sentinel2_scene(scene.input, scene.output)
    results.append((scene.id, success, message))

Getting Help¶

Where can I get more help?¶

Documentation: Check the User Guide and API Reference
Examples: See Examples for common use cases
GitHub Issues: Report bugs or request features at the GitHub repository
Community: Join discussions in the GeoZarr community

How do I report a bug?¶

When reporting issues, please include:

Version information:

# test: skip
eopf-geozarr --version
python --version
pip list | grep -E "(xarray|zarr|dask)"

Error message: Full traceback if available
Minimal example: Code that reproduces the issue
Environment: OS, Python version, installation method
Data information: Dataset type, size, structure (if shareable)

How can I contribute?¶

Contributions are welcome! See the project repository for contribution guidelines. Areas where help is needed:

Additional satellite mission support
Performance optimizations
Documentation improvements
Test coverage expansion
Bug fixes and feature enhancements