Performance Analysis
For measured data, see Benchmarks. This page explains why operations have their observed characteristics and provides scaling guidance.
Why Full Volume is Slower Than Raw NIfTI
RadiObject full volume load (705ms axial, 99ms isotropic) is slower than nibabel (33ms uncompressed) because TileDB adds tile assembly and metadata overhead. This is the trade-off for enabling random access.
- Full-volume sequential processing has similar performance across formats
- RadiObject's advantage is partial reads (200-560x faster for slices/patches)
- Use RadiObject for I/O, then apply MONAI/TorchIO transforms to loaded data
Why Axial Tiling Gives 200-600x Speedup for Slices
With axial tiling (tile_extent=[X, Y, 1]), each axial slice is one contiguous tile (~230 KB). NIfTI requires decompressing the entire 35.6 MB volume.
NIfTI: Load 35,600 KB → decompress → extract 230 KB
RadiObject: Load 230 KB tile directly
Ratio: ~155x theoretical, ~310x observed vs MONAI (includes decompression savings)
Why Isotropic is Best for 3D Patches
Isotropic tiling (tile_extent=[64, 64, 64]) matches typical patch sizes. A 64^3 patch spans 1-8 isotropic tiles vs 64 axial tiles.
| operation | axial_tiles_read | isotropic_tiles_read |
|---|---|---|
| axial_slice | 1 | ~16 |
| 64^3_patch | 64 | 1-8 |
| full_volume | 155 | ~60 |
This is a tile-count analysis — see Benchmarks: Tiling Strategy Impact for measured numbers.
Tiling Strategy Guide

| use_case | tiling | reason |
|---|---|---|
| 2D slice viewer | Axial | Single tile per slice (3.4ms) |
| 3D patch training | Isotropic | Patches match tile size (1.5ms) |
| Full volume analysis | Isotropic | Fewer total tiles (99ms) |
| Mixed workload | Isotropic | Best general-purpose |
Why S3 is ~14x Slower for Full Volumes
Each tile request incurs ~50-150ms S3 round-trip latency. For 155 axial tiles, this compounds to ~8.3s.
Mitigation: Use patch-based reads (~203-238ms per patch from S3) or parallel workers to amortize latency.
Why Multi-Worker DataLoaders Slow Down Small Datasets
For small datasets, num_workers=0 beats multi-worker because IPC serialization (~100-200ms per 35MB tensor) and process spawn overhead (~1-2s) dominate.
Recommendation: num_workers=0 for <100 volumes, num_workers=4-8 for >1000 volumes.
See Benchmarks: Multi-Worker Scaling for measured numbers.
Scaling Guidance
What Scales Well
-
Metadata operations: O(1) —
iloc[i],loc["id"],obs_subject_idsare constant-time regardless of dataset size. -
Selective access: O(accessed data) — Loading 1 volume from 10,000 subjects costs the same as from 3 subjects.
-
Parallel writes — Throughput scales linearly (~140 MB/s per worker on local SSD). See Benchmarks: Parallel Write Scaling.
Recommendations by Scale
| scale | storage | approach |
|---|---|---|
| <1,000 subjects | Local SSD | Direct access, in-memory processing |
| 1,000-10,000 | S3 + local cache | Metadata-first filtering, parallel workers |
| >10,000 | S3 | Streaming, patch-based training |
TileDB vs Zarr: Chunked Format Comparison
Both TileDB and Zarr v3 support chunk-based partial reads with ZSTD compression. The benchmarks use identical settings (ZSTD clevel=3, same chunk shapes) to isolate format-level differences.
Key architectural differences:
- Storage layer: TileDB uses its own VFS (Virtual File System) layer with built-in S3 support and tile caching. Zarr uses fsspec/s3fs for remote storage.
- Metadata model: TileDB uses a fragment-based metadata model optimized for concurrent writes. Zarr v3 stores chunks as files within a directory hierarchy, with array metadata in
zarr.json. - Caching: TileDB maintains a built-in LRU tile cache (
sm.tile_cache_size) within its context, amortizing repeated reads — 85-95% hit rates for sequential access. Zarr v3 has no built-in chunk cache: local reads benefit from OS page cache, but S3 reads re-fetch every chunk on every access. For training loops with overlapping patches or repeated epoch access, TileDB's cache gives a structural advantage. - Partial reads: Both formats read only the chunks/tiles intersecting a query region. TileDB's VFS can issue parallel range requests; Zarr issues one request per chunk via fsspec.
- Raw throughput: For single isolated reads (no caching benefit), Zarr is slightly faster due to lower per-chunk overhead — e.g. 1.4ms vs 3.4ms for axial slice, 138 vs 128 samples/sec in DataLoader benchmarks.
See Benchmarks: Framework Speedups and Cache Hit Rates for measured numbers.