Architecture
Inspired by the SOMA specification, RadiObject is a hierarchical composition of entities aligned on shared indexes. The hierarchy maps to TileDB primitives (Groups and Arrays).
Organisation
RadiObject (TileDB Group)
│
├── obs_meta (Sparse Array)
│ dim: obs_subject_id
│ attrs: obs_ids (system), age, sex, diagnosis, ...
│
└── collections/
├── T1w (VolumeCollection Group)
│ ├── obs (Sparse Array)
│ │ dims: obs_subject_id (FK), obs_id (unique)
│ │ attrs: series_type, voxel_spacing, dimensions, ...
│ └── volumes/
│ ├── 0 (Dense Array: x, y, z [, t] → voxels)
│ ├── 1 ...
│ └── N ...
├── FLAIR (VolumeCollection) ...
└── seg (VolumeCollection) ...
Relationships: obs_meta → VolumeCollection.obs is 1:N via obs_subject_id (one subject, many volumes). Each obs row maps 1:1 to a Volume via obs_id.
Mapping to Radiology Standards
RadiObject's data model maps directly to established radiology data standards:
| RadiObject | DICOM | BIDS |
|---|---|---|
obs_subject_id |
PatientID | sub-XX |
VolumeCollection |
Series Description / Modality | Suffix (T1w, FLAIR, seg) |
obs_id |
SeriesInstanceUID (unique) | Full filename stem |
obs_meta |
Patient-level demographics | participants.tsv |
obs |
Series-level metadata | Sidecar JSON |
obs_id uniqueness: Like DICOM's SeriesInstanceUID, obs_id is globally unique across the entire RadiObject — not just within a single collection. The formula is {obs_subject_id}_{collection_name} (e.g., sub-01_T1w, sub-01_seg). This enables unambiguous single-key lookup across all collections while obs_subject_id handles the subject-level grouping (analogous to PatientID linking multiple series).
VolumeCollections as layers: Each collection represents a distinct imaging "layer" for the same set of subjects — analogous to how a DICOM study contains multiple series (CT, segmentation, MR) for one patient, or how BIDS organizes different suffixes (T1w, FLAIR, bold) under the same subject.
Component Summary
| Component | TileDB Type | Dimensions (Indexes) | Attributes (Data) |
|---|---|---|---|
| RadiObject | Group | — | metadata: subject_count, n_collections |
| obs_meta | Sparse Array | obs_subject_id |
obs_ids (system), user-defined (age, labels, etc.) |
| VolumeCollection | Group | — | metadata: n_volumes, name, [shape]? |
| obs | Sparse Array | obs_subject_id, obs_id |
series_type, voxel_spacing, dimensions, etc. |
| Volume | Dense Array | x, y, z [, t] |
voxels (intensity values) |
Index Design
Index is an immutable, named dataclass providing bidirectional mapping between string IDs and integer positions:
- RadiObject.index:
Index(name="obs_subject_id")— subject-level ordering - VolumeCollection.index:
Index(name="obs_id")— volume-level ordering - VolumeCollection.subjects:
Index(name="obs_subject_id")— deduplicated subject IDs
Index supports set algebra (&, |, -, ^) with order preservation from the left operand, positional selection (take, mask), alignment checking (is_aligned), and subset/superset comparison (<=, >=):
radi.T1w.subjects.is_aligned(radi.seg.subjects) # True
common = radi.T1w.subjects & radi.seg.subjects
train.index | val.index # all subjects
train.index & val.index # empty = no overlap
The standalone align(*indexes) function computes the intersection of multiple indexes, preserving order from the first.
Shapes:
- Uniform collections:
x_dim, y_dim, z_dimstored in group metadata;is_uniform=True - Heterogeneous collections: No shape in group metadata; each volume's shape stored in
obs.dimensions - 4D volumes: Temporal dimension (
t) is per-volume; not tracked at collection level
Shapes
Radiology dimensions are irregular across datasets (different scanners, protocols, preprocessing). VolumeCollection groups volumes with consistent spatial (X/Y/Z) dimensions — 4D volumes with different time dimensions but the same spatial grid share a collection. RadiObject organizes heterogeneous collections (e.g., T1w at 1mm^3, fMRI at 3mm^3) under a unified structure.
Composition
The TileDB entities are a public property of each entity. This allows direct access to the TileDB object for power users, while presenting a simple API surface. Individual entities are stateless — file handles are not cached in memory (preventing file handle exhaustion).
Anatomical Orientation
Medical images encode spatial orientation via an affine matrix mapping voxel indices to physical (world) coordinates. RadiObject preserves this information and optionally standardizes orientation during ingestion.
Orientation is described by three-letter codes (RAS, LPS, LAS) indicating which anatomical direction each axis points. See Lexicon: Coordinate Systems for terminology.
Tile orientation vs anatomical orientation — distinct concepts:
- Anatomical orientation (
orientation_info): Physical coordinate system (RAS/LPS) - Tile orientation (
tile_orientation): Storage chunking strategy for I/O performance
Choose tile orientation based on access patterns (see Benchmarks), not anatomical convention.
For reorientation configuration, see Ingest Data: Handling Orientation. For tile options, see Configuration: TileConfig.
Concurrency Model
RadiObject operates across four concurrency layers:
┌─────────────────────────────────────────────────────────────────┐
│ Layer 4: PyTorch DataLoader Workers (PROCESSES via fork) │
│ num_workers=4, persistent_workers=True │
├─────────────────────────────────────────────────────────────────┤
│ Layer 3: Python ThreadPoolExecutor (THREADS) │
│ max_workers from ReadConfig (default: 4) │
├─────────────────────────────────────────────────────────────────┤
│ Layer 2: TileDB Internal Threads │
│ sm.compute_concurrency_level, sm.io_concurrency_level │
├─────────────────────────────────────────────────────────────────┤
│ Layer 1: S3/VFS Level │
│ vfs.s3.max_parallel_ops (default: 8) │
└─────────────────────────────────────────────────────────────────┘
Global State Management
get_tiledb_ctx() lazily initializes a global TileDB context from configure() settings. All data objects accept an optional ctx parameter — if None, they fall back to the global context.
Context Injection
class Volume:
def __init__(self, uri: str, ctx: tiledb.Ctx | None = None):
self._ctx = ctx # None = use global
def _effective_ctx(self) -> tiledb.Ctx:
return self._ctx if self._ctx else get_tiledb_ctx()
Threads vs Processes
From TileDB Wiki: libtiledb is thread-safe, and sharing one Ctx across a thread pool is optimal because schema and fragment metadata is cached per-Ctx.
RadiObject provides two semantically distinct context functions:
def ctx_for_threads(ctx=None) -> tiledb.Ctx:
"""Return context for thread pool workers. Shares caching."""
return ctx if ctx else get_tiledb_ctx()
def ctx_for_process(base_ctx=None) -> tiledb.Ctx:
"""Create new context for forked process. Isolated memory."""
if base_ctx is not None:
return tiledb.Ctx(base_ctx.config())
return get_radiobject_config().to_tiledb_ctx()
| Scenario | Function | Behavior |
|---|---|---|
ThreadPoolExecutor |
ctx_for_threads() |
Returns same context (shared caching) |
multiprocessing.Pool |
ctx_for_process() |
Creates new isolated context |
DataLoader (num_workers>0) |
ctx_for_process() |
Creates new isolated context |
For practical tuning recipes, see ML Training: Performance Tuning.
Consistency Model
RadiObject writes span multiple TileDB groups and arrays (volumes, obs DataFrame, group metadata). TileDB guarantees per-array atomicity but does not provide cross-array transactions. This has practical implications:
| Operation | Consistency | Notes |
|---|---|---|
Volume.from_numpy() |
Atomic | Single dense array write |
VolumeCollectionWriter |
Best-effort | Volumes written individually; obs flushed on __exit__. A crash mid-write leaves partial volumes without obs rows. |
RadiObject.append() |
Best-effort | Modifies collections, obs_meta, and group metadata in sequence. Not transactional across these steps. |
RadiObject.from_images() |
Best-effort | Creates collections sequentially; group metadata updated last. |
Implications for concurrent access:
- Single-writer: RadiObject is designed for single-writer, multiple-reader workflows. Concurrent writers to the same URI are not supported and may corrupt group metadata.
- Crash recovery: Use
validate()after unexpected interruptions to detect inconsistencies (orphan volumes, missing obs rows, metadata count mismatches). - Idempotent re-creation: If a write fails, delete the URI and re-create rather than attempting partial recovery.
Why not transactional? TileDB Groups are metadata containers, not transactional databases. Cross-group atomicity would require a write-ahead log or two-phase commit, adding complexity disproportionate to the current use case (batch ETL, not OLTP).
Cache
RadiObject delegates data caching to TileDB Embedded. There is no application-level data cache — Python @cached_property is used only for immutable schema and metadata objects.
TileDB's Caching Layers
TileDB Embedded maintains two in-memory caches, both scoped to a tiledb.Ctx:
| Layer | Config Parameter | Default | Scope | Purpose |
|---|---|---|---|---|
| Tile cache | sm.tile_cache_size |
512 MB (RadiObject) | Per-context, across queries | LRU cache of uncompressed tiles — avoids re-reading and re-decompressing on repeated access |
| VFS read-ahead | vfs.read_ahead_cache_size |
10 MB | Per-context | Cloud-only (S3/GCS/Azure). Speculatively fetches extra bytes on small reads to reduce round-trips |
Tile Cache vs Memory Budget
These are distinct mechanisms:
sm.tile_cache_size(default 512 MB): LRU cache that persists uncompressed tiles across queries within the same context. Repeated reads of the same tile region hit this cache instead of disk/S3.sm.memory_budget(default 1 GB): Per-query throttle that caps how much data TileDB will fetch in a single read. Prevents OOM on large subarray reads. Does not cache anything for reuse.
Context Sharing and Cache Lifetime
The tile cache lives inside tiledb.Ctx. This is why context sharing matters:
- Threads (
ctx_for_threads): Workers share one context, so they share one tile cache. A tile read by thread A is available to thread B. - Processes (
ctx_for_process): Each forked process gets an isolated context with its own empty cache. No cross-process cache sharing.
See Concurrency Model for context injection details.