Ingest Data
RadiObject ingests NIfTI and DICOM files into TileDB arrays for efficient storage and retrieval. For terminology, see the Lexicon.
Choosing an Ingestion Method
| Method | Best For | Memory |
|---|---|---|
RadiObject.from_images() |
Small-medium datasets (<1000 subjects) | Fits in memory |
VolumeCollectionWriter |
Large single-collection datasets | Constant |
RadiObjectWriter |
Complex multi-collection builds | Controlled batches |
Decision guide:
- Can all data fit in memory? Use
RadiObject.from_images() - Single collection, too large for memory? Use
VolumeCollectionWriter - Multiple collections with custom logic? Use
RadiObjectWriter
Creating a RadiObject
from_images() auto-detects NIfTI and DICOM sources per collection:
from radiobject import RadiObject
radi = RadiObject.from_images(
uri="./my-dataset",
images={
"CT": "./imagesTr/*.nii.gz", # Glob pattern (NIfTI)
"seg": "./labelsTr", # Directory path (NIfTI)
},
validate_alignment=True, # Ensure matching subject IDs
obs_meta=metadata_df, # Optional subject metadata
progress=True,
)
Source formats (values in images dict):
| Format | Auto-detected as | Example |
|---|---|---|
| Glob pattern | NIfTI | "./data/*.nii.gz" |
Directory (contains .nii) |
NIfTI | "./data/imagesTr" |
| Directory (contains DICOM subdirs) | DICOM | "./data/dicom_series" |
| Pre-resolved list | Inspects first path | [(path, subject_id), ...] |
Options:
| Parameter | Description |
|---|---|
validate_alignment |
Verify all collections have matching subject IDs |
obs_meta |
DataFrame with subject-level metadata (must have obs_subject_id column) |
reorient |
Reorient volumes to canonical orientation during ingestion |
format_hint |
Dict mapping collection names to "nifti" or "dicom" for ambiguous sources |
progress |
Show progress bar |
DICOM Sources
DICOM directories work through the same images dict. Each immediate subdirectory is treated as one DICOM series:
radi = RadiObject.from_images(
uri="./dicom-dataset",
images={
"CT_head": [
("./dicom/sub01/CT_HEAD", "sub-01"),
("./dicom/sub02/CT_HEAD", "sub-02"),
],
},
progress=True,
)
For ambiguous directories, use format_hint:
radi = RadiObject.from_images(
uri="./study",
images={"CT": "/ambiguous/directory/"},
format_hint={"CT": "dicom"},
)
Streaming Writes
For datasets too large to fit in memory, use streaming writers to build RadiObjects incrementally.
VolumeCollectionWriter
Incremental writes to a single VolumeCollection:
from radiobject.writers import VolumeCollectionWriter
with VolumeCollectionWriter("./large-collection", name="CT") as writer:
for path, subject_id in discover_niftis("./source"):
data = load_nifti(path)
writer.write_volume(
data=data,
obs_id=f"{subject_id}_CT",
obs_subject_id=subject_id,
voxel_spacing=(1.0, 1.0, 1.0),
)
collection = VolumeCollection("./large-collection")
write_volume() accepts (X, Y, Z) or (X, Y, Z, T) numpy arrays. Additional keyword arguments become volume-level obs attributes. write_batch() accepts a list of (data, obs_id, obs_subject_id, attrs_dict) tuples.
RadiObjectWriter
Builds complete RadiObjects with multiple collections:
from radiobject.writers import RadiObjectWriter
import pandas as pd
with RadiObjectWriter("./large-dataset") as writer:
writer.write_obs_meta(pd.DataFrame({
"obs_subject_id": ["sub-01", "sub-02"],
"age": [45, 52],
"diagnosis": ["tumor", "healthy"],
}))
with writer.add_collection("T1w") as t1_writer:
for subject_id, data in load_t1w_data():
t1_writer.write_volume(data, obs_id=f"{subject_id}_T1w", obs_subject_id=subject_id)
with writer.add_collection("FLAIR") as flair_writer:
for subject_id, data in load_flair_data():
flair_writer.write_volume(data, obs_id=f"{subject_id}_FLAIR", obs_subject_id=subject_id)
radi = RadiObject("./large-dataset")
Streaming writers work with S3 URIs. For S3 write performance, configure parallel uploads:
from radiobject import configure, S3Config
configure(s3=S3Config(max_parallel_ops=16, multipart_part_size_mb=100))
Appending Data
Add new subjects to existing RadiObjects without rebuilding:
radi = RadiObject("./my-dataset")
new_obs_meta = pd.DataFrame({
"obs_subject_id": ["sub-100", "sub-101"],
"age": [38, 45],
"diagnosis": ["healthy", "tumor"],
})
radi.append(
images={
"T1w": [
("sub-100_T1w.nii.gz", "sub-100"),
("sub-101_T1w.nii.gz", "sub-101"),
],
},
obs_meta=new_obs_meta,
progress=True,
)
Appends are atomic: both obs_meta and volumes are written together. Cached properties are automatically invalidated.
Append directly to a VolumeCollection:
Constraints:
- Append adds to existing collections only. To add new collections, use
add_collection(). - Appended
obs_subject_idvalues must not already exist in the RadiObject. - New subjects require corresponding metadata in the
obs_metaDataFrame.
Handling Orientation
Medical images use coordinate systems (RAS, LPS) to map voxels to physical space. RadiObject can reorient volumes during ingestion.
from radiobject import configure, WriteConfig, OrientationConfig
configure(write=WriteConfig(
orientation=OrientationConfig(canonical_target="RAS", reorient_on_load=True)
))
radi = RadiObject.from_images(uri, images={"CT": "./data"})
| Scenario | Recommendation |
|---|---|
| Multi-site studies with inconsistent orientations | Reorient |
| Preserving original acquisition geometry | Don't reorient |
| ML training (standardized input expected) | Reorient |
| Clinical review (native orientation expected) | Don't reorient |
See Lexicon: Coordinate Systems for terminology.
4D Data (fMRI, DTI)
RadiObject natively supports 4D NIfTI volumes. No special configuration needed:
radi = RadiObject.from_images(
uri="./fmri-study",
images={"bold": "./func/*_bold.nii.gz"},
)
vol = radi.bold.iloc[0]
print(vol.shape) # (64, 64, 32, 200) — 200 timepoints
How dimensions work:
- Collection shape reports spatial dimensions only:
(64, 64, 32) - Volume shape reports the full shape including time:
(64, 64, 32, 200) - Shape validation compares spatial dims only — volumes with different timepoint counts pass
validate_dimensions=True
Managing Existing Data
Check If Data Exists
from radiobject import uri_exists
if uri_exists("s3://bucket/my-dataset"):
print("Dataset already exists")
Delete and Reingest
from radiobject import delete_tiledb_uri
delete_tiledb_uri("s3://bucket/my-dataset")
radi = RadiObject.from_images("s3://bucket/my-dataset", images={"CT": "./data"})
Create from Existing VolumeCollections
Assemble a RadiObject from pre-existing VolumeCollections:
new_radi = RadiObject.from_collections(
uri="./new-dataset",
collections={"T1w": existing_collection},
obs_meta=metadata_df,
)
For full API details, see RadiObject API.