Architecture

Inspired by the SOMA specification, RadiObject is a hierarchical composition of entities aligned on shared indexes. The hierarchy maps to TileDB primitives (Groups and Arrays).

Organisation

RadiObject (TileDB Group)
│
├── obs_meta (Sparse Array)
│   dim: obs_subject_id
│   attrs: obs_ids (system), age, sex, diagnosis, ...
│
└── collections/
    ├── T1w (VolumeCollection Group)
    │   ├── obs (Sparse Array)
    │   │   dims: obs_subject_id (FK), obs_id (unique)
    │   │   attrs: series_type, voxel_spacing, dimensions, ...
    │   └── volumes/
    │       ├── 0 (Dense Array: x, y, z [, t] → voxels)
    │       ├── 1 ...
    │       └── N ...
    ├── FLAIR (VolumeCollection) ...
    └── seg (VolumeCollection) ...

Relationships: obs_meta → VolumeCollection.obs is 1:N via obs_subject_id (one subject, many volumes). Each obs row maps 1:1 to a Volume via obs_id.

Mapping to Radiology Standards

RadiObject's data model maps directly to established radiology data standards:

RadiObject	DICOM	BIDS
`obs_subject_id`	PatientID	`sub-XX`
`VolumeCollection`	Series Description / Modality	Suffix (T1w, FLAIR, seg)
`obs_id`	SeriesInstanceUID (unique)	Full filename stem
`obs_meta`	Patient-level demographics	`participants.tsv`
`obs`	Series-level metadata	Sidecar JSON

obs_id uniqueness: Like DICOM's SeriesInstanceUID, obs_id is globally unique across the entire RadiObject — not just within a single collection. The formula is {obs_subject_id}_{collection_name} (e.g., sub-01_T1w, sub-01_seg). This enables unambiguous single-key lookup across all collections while obs_subject_id handles the subject-level grouping (analogous to PatientID linking multiple series).

VolumeCollections as layers: Each collection represents a distinct imaging "layer" for the same set of subjects — analogous to how a DICOM study contains multiple series (CT, segmentation, MR) for one patient, or how BIDS organizes different suffixes (T1w, FLAIR, bold) under the same subject.

Component Summary

Component	TileDB Type	Dimensions (Indexes)	Attributes (Data)
RadiObject	Group	—	metadata: subject_count, n_collections
obs_meta	Sparse Array	`obs_subject_id`	obs_ids (system), user-defined (age, labels, etc.)
VolumeCollection	Group	—	metadata: n_volumes, name, [shape]?
obs	Sparse Array	`obs_subject_id`, `obs_id`	series_type, voxel_spacing, dimensions, etc.
Volume	Dense Array	`x`, `y`, `z` [, `t`]	`voxels` (intensity values)

Index Design

Index is an immutable, named dataclass providing bidirectional mapping between string IDs and integer positions:

RadiObject.index: Index(name="obs_subject_id") — subject-level ordering
VolumeCollection.index: Index(name="obs_id") — volume-level ordering
VolumeCollection.subjects: Index(name="obs_subject_id") — deduplicated subject IDs

Index supports set algebra (&, |, -, ^) with order preservation from the left operand, positional selection (take, mask), alignment checking (is_aligned), and subset/superset comparison (<=, >=):

radi.T1w.subjects.is_aligned(radi.seg.subjects)  # True
common = radi.T1w.subjects & radi.seg.subjects
train.index | val.index  # all subjects
train.index & val.index  # empty = no overlap

The standalone align(*indexes) function computes the intersection of multiple indexes, preserving order from the first.

Shapes:

Uniform collections: x_dim, y_dim, z_dim stored in group metadata; is_uniform=True
Heterogeneous collections: No shape in group metadata; each volume's shape stored in obs.dimensions
4D volumes: Temporal dimension (t) is per-volume; not tracked at collection level

Shapes

Radiology dimensions are irregular across datasets (different scanners, protocols, preprocessing). VolumeCollection groups volumes with consistent spatial (X/Y/Z) dimensions — 4D volumes with different time dimensions but the same spatial grid share a collection. RadiObject organizes heterogeneous collections (e.g., T1w at 1mm^3, fMRI at 3mm^3) under a unified structure.

Composition

The TileDB entities are a public property of each entity. This allows direct access to the TileDB object for power users, while presenting a simple API surface. Individual entities are stateless — file handles are not cached in memory (preventing file handle exhaustion).

Anatomical Orientation

Medical images encode spatial orientation via an affine matrix mapping voxel indices to physical (world) coordinates. RadiObject preserves this information and optionally standardizes orientation during ingestion.

Orientation is described by three-letter codes (RAS, LPS, LAS) indicating which anatomical direction each axis points. See Lexicon: Coordinate Systems for terminology.

Tile orientation vs anatomical orientation — distinct concepts:

Anatomical orientation (orientation_info): Physical coordinate system (RAS/LPS)
Tile orientation (tile_orientation): Storage chunking strategy for I/O performance

Choose tile orientation based on access patterns (see Benchmarks), not anatomical convention.

For reorientation configuration, see Ingest Data: Handling Orientation. For tile options, see Configuration: TileConfig.

Concurrency Model

RadiObject operates across four concurrency layers:

┌─────────────────────────────────────────────────────────────────┐
│ Layer 4: PyTorch DataLoader Workers (PROCESSES via fork)       │
│          num_workers=4, persistent_workers=True                 │
├─────────────────────────────────────────────────────────────────┤
│ Layer 3: Python ThreadPoolExecutor (THREADS)                   │
│          max_workers from ReadConfig (default: 4)              │
├─────────────────────────────────────────────────────────────────┤
│ Layer 2: TileDB Internal Threads                               │
│          sm.compute_concurrency_level, sm.io_concurrency_level │
├─────────────────────────────────────────────────────────────────┤
│ Layer 1: S3/VFS Level                                          │
│          vfs.s3.max_parallel_ops (default: 8)                  │
└─────────────────────────────────────────────────────────────────┘

Global State Management

get_tiledb_ctx() lazily initializes a global TileDB context from configure() settings. All data objects accept an optional ctx parameter — if None, they fall back to the global context.

Context Injection

class Volume:
    def __init__(self, uri: str, ctx: tiledb.Ctx | None = None):
        self._ctx = ctx  # None = use global

    def _effective_ctx(self) -> tiledb.Ctx:
        return self._ctx if self._ctx else get_tiledb_ctx()

Threads vs Processes

From TileDB Wiki: libtiledb is thread-safe, and sharing one Ctx across a thread pool is optimal because schema and fragment metadata is cached per-Ctx.

RadiObject provides two semantically distinct context functions:

def ctx_for_threads(ctx=None) -> tiledb.Ctx:
    """Return context for thread pool workers. Shares caching."""
    return ctx if ctx else get_tiledb_ctx()

def ctx_for_process(base_ctx=None) -> tiledb.Ctx:
    """Create new context for forked process. Isolated memory."""
    if base_ctx is not None:
        return tiledb.Ctx(base_ctx.config())
    return get_radiobject_config().to_tiledb_ctx()

Scenario	Function	Behavior
`ThreadPoolExecutor`	`ctx_for_threads()`	Returns same context (shared caching)
`multiprocessing.Pool`	`ctx_for_process()`	Creates new isolated context
DataLoader (`num_workers>0`)	`ctx_for_process()`	Creates new isolated context

For practical tuning recipes, see ML Training: Performance Tuning.

Consistency Model

RadiObject writes span multiple TileDB groups and arrays (volumes, obs DataFrame, group metadata). TileDB guarantees per-array atomicity but does not provide cross-array transactions. This has practical implications:

Operation	Consistency	Notes
`Volume.from_numpy()`	Atomic	Single dense array write
`VolumeCollectionWriter`	Best-effort	Volumes written individually; obs flushed on `__exit__`. A crash mid-write leaves partial volumes without obs rows.
`RadiObject.append()`	Best-effort	Modifies collections, obs_meta, and group metadata in sequence. Not transactional across these steps.
`RadiObject.from_images()`	Best-effort	Creates collections sequentially; group metadata updated last.

Implications for concurrent access:

Single-writer: RadiObject is designed for single-writer, multiple-reader workflows. Concurrent writers to the same URI are not supported and may corrupt group metadata.
Crash recovery: Use validate() after unexpected interruptions to detect inconsistencies (orphan volumes, missing obs rows, metadata count mismatches).
Idempotent re-creation: If a write fails, delete the URI and re-create rather than attempting partial recovery.

Why not transactional? TileDB Groups are metadata containers, not transactional databases. Cross-group atomicity would require a write-ahead log or two-phase commit, adding complexity disproportionate to the current use case (batch ETL, not OLTP).

Cache

RadiObject delegates data caching to TileDB Embedded. There is no application-level data cache — Python @cached_property is used only for immutable schema and metadata objects.

TileDB's Caching Layers

TileDB Embedded maintains two in-memory caches, both scoped to a tiledb.Ctx:

Layer	Config Parameter	Default	Scope	Purpose
Tile cache	`sm.tile_cache_size`	512 MB (RadiObject)	Per-context, across queries	LRU cache of uncompressed tiles — avoids re-reading and re-decompressing on repeated access
VFS read-ahead	`vfs.read_ahead_cache_size`	10 MB	Per-context	Cloud-only (S3/GCS/Azure). Speculatively fetches extra bytes on small reads to reduce round-trips

Tile Cache vs Memory Budget

These are distinct mechanisms:

sm.tile_cache_size (default 512 MB): LRU cache that persists uncompressed tiles across queries within the same context. Repeated reads of the same tile region hit this cache instead of disk/S3.
sm.memory_budget (default 1 GB): Per-query throttle that caps how much data TileDB will fetch in a single read. Prevents OOM on large subarray reads. Does not cache anything for reuse.

The tile cache lives inside tiledb.Ctx. This is why context sharing matters:

Threads (ctx_for_threads): Workers share one context, so they share one tile cache. A tile read by thread A is available to thread B.
Processes (ctx_for_process): Each forked process gets an isolated context with its own empty cache. No cross-process cache sharing.

See Concurrency Model for context injection details.