BraTS 2020 Data Ingestion¶
Ingests BraTS 2020 challenge data into a RadiObject. Run this once before notebooks 01-04.
- Check if RadiObject exists (skip if so)
- Create RadiObject with 5 collections: FLAIR, T1w, T1gd, T2w, seg
- Include subject metadata: age, survival days, resection status
Data Source: BraTS 2020 Challenge via Kaggle. Requires Kaggle API setup (instructions). For a no-auth alternative, set with_metadata=False in the download cell to use the public MSD bucket (no clinical metadata).
Configuration: See S3 Setup for cloud storage options.
import json
from pathlib import Path
import pandas as pd
from radiobject import (
CompressionConfig,
Compressor,
RadiObject,
S3Config,
SliceOrientation,
TileConfig,
WriteConfig,
configure,
uri_exists,
)
from radiobject.data import get_brats_nifti_path
# ── Storage URI ──────────────────────────────────────────────────
# Default: S3 (requires AWS credentials)
BRATS_URI = "s3://souzy-scratch/radiobject/brats-tutorial"
# For local storage, comment out the line above and uncomment:
# BRATS_URI = "./data/brats_radiobject"
# ─────────────────────────────────────────────────────────────────
print(f"Target URI: {BRATS_URI}")
Target URI: s3://souzy-scratch/radiobject/brats-tutorial
# Configure TileDB storage
configure(
s3=S3Config(region="us-east-2"),
write=WriteConfig(
tile=TileConfig(orientation=SliceOrientation.AXIAL),
compression=CompressionConfig(algorithm=Compressor.ZSTD, level=3),
),
)
if uri_exists(BRATS_URI):
print(f"RadiObject already exists at {BRATS_URI}")
print("Skipping ingestion. Delete the URI to re-ingest.")
SKIP_INGESTION = True
else:
print(f"No RadiObject found at {BRATS_URI}")
print("Proceeding with ingestion...")
SKIP_INGESTION = False
RadiObject already exists at s3://souzy-scratch/radiobject/brats-tutorial Skipping ingestion. Delete the URI to re-ingest.
if not SKIP_INGESTION:
# Get BraTS 2020 data with metadata (downloads from Kaggle if not cached)
# Set with_metadata=False to use MSD version without Kaggle API
NIFTI_DIR = get_brats_nifti_path(with_metadata=True)
# Load manifest - contains paths and metadata for each subject
manifest_path = NIFTI_DIR / "manifest.json"
with open(manifest_path) as f:
manifest = json.load(f)
print(f"Found {len(manifest)} BraTS subjects")
print("Sample entry:")
print(json.dumps(manifest[0], indent=2))
Subject Metadata (obs_meta)¶
The obs_meta DataFrame provides subject-level metadata — one row per patient.
obs_subject_id: Unique subject identifier (required) - links volumes across modalities- Additional columns: Age, survival days, resection status from BraTS challenge
obs_ids: System-managed JSON list of volume obs_ids per subject (auto-populated after ingestion)
The BraTS 2020 dataset includes real clinical metadata for survival prediction tasks.
if not SKIP_INGESTION:
# Filter to subjects with complete data (all modalities + segmentation files exist)
def has_complete_files(entry: dict, base_dir: Path) -> bool:
"""Check that all required NIfTI files exist for this subject."""
required_keys = ["t1_path", "t1ce_path", "t2_path", "flair_path", "seg_path"]
for key in required_keys:
if key not in entry:
return False
if not (base_dir / entry[key]).exists():
return False
return True
complete_entries = [e for e in manifest if has_complete_files(e, NIFTI_DIR)]
print(f"Using {len(complete_entries)} subjects with complete data")
# Build obs_meta from manifest metadata
obs_meta = pd.DataFrame(
{
"obs_subject_id": [entry["sample_id"] for entry in complete_entries],
"age": [entry.get("age") for entry in complete_entries],
"survival_days": [entry.get("survival_days") for entry in complete_entries],
"resection_status": [
entry.get("resection_status", "") or "" for entry in complete_entries
],
"dataset": "BraTS2020",
}
)
# Convert numeric columns
obs_meta["age"] = pd.to_numeric(obs_meta["age"], errors="coerce")
obs_meta["survival_days"] = pd.to_numeric(obs_meta["survival_days"], errors="coerce")
print(f"Created obs_meta with {len(obs_meta)} subjects")
print("Metadata summary:")
age_col = obs_meta["age"].dropna()
print(f" Age: {age_col.min():.0f} - {age_col.max():.0f} years")
resection_counts = obs_meta["resection_status"].value_counts().to_dict()
print(f" Resection status: {resection_counts}")
display(obs_meta.head(10))
if not SKIP_INGESTION:
# Build images dict mapping collection names to (path, subject_id) lists
images = {
"T1w": [(NIFTI_DIR / entry["t1_path"], entry["sample_id"]) for entry in complete_entries],
"T1gd": [
(NIFTI_DIR / entry["t1ce_path"], entry["sample_id"]) for entry in complete_entries
],
"T2w": [(NIFTI_DIR / entry["t2_path"], entry["sample_id"]) for entry in complete_entries],
"FLAIR": [
(NIFTI_DIR / entry["flair_path"], entry["sample_id"]) for entry in complete_entries
],
"seg": [(NIFTI_DIR / entry["seg_path"], entry["sample_id"]) for entry in complete_entries],
}
print("Collections to ingest:")
for name, paths in images.items():
print(f" {name}: {len(paths)} volumes")
if not SKIP_INGESTION:
print(f"Creating RadiObject at: {BRATS_URI}")
radi = RadiObject.from_images(
uri=BRATS_URI,
images=images,
obs_meta=obs_meta,
validate_alignment=True,
progress=True,
)
print(f"Created: {radi}")
if not SKIP_INGESTION:
radi.validate()
print("Validation passed")
print(f"Collections: {radi.collection_names}")
print(f"Subjects: {len(radi)}")
# Load from URI (works whether we just created it or it already existed)
radi = RadiObject(BRATS_URI)
print(f"Loaded: {radi}")
print(f"Collections: {radi.collection_names}")
print(f"Subjects: {len(radi)}")
# Quick data check
vol = radi.FLAIR.iloc[0]
print(f"Sample volume: {vol}")
print(f"Axial slice shape: {vol.axial(z=77).shape}")
Loaded: RadiObject(368 subjects, 5 collections: [seg, T2w, FLAIR, T1gd, T1w])
Collections: ('seg', 'T2w', 'FLAIR', 'T1gd', 'T1w')
Subjects: 368
Sample volume: Volume(shape=240x240x155, dtype=int16, obs_id='BraTS20_Training_001_FLAIR')
Axial slice shape: (240, 240)
obs_meta vs obs: Subject vs Volume Metadata¶
RadiObject has two levels of metadata:
| Level | Accessor | Scope | Example Fields |
|---|---|---|---|
| Subject | radi.obs_meta |
One row per patient | obs_subject_id, age, survival_days, obs_ids (system-managed) |
| Volume | radi.FLAIR.obs |
One row per volume | obs_id, obs_subject_id, voxel_spacing, dimensions |
The obs_subject_id column links these levels - each subject can have multiple volumes across collections. The obs_ids column in obs_meta is a JSON list of all volume obs_ids linked to that subject (auto-populated).
# Subject-level metadata (one row per patient)
print("Subject metadata (obs_meta):")
display(radi.obs_meta.read().head())
# Volume-level metadata (one row per volume in a collection)
print("Volume metadata (FLAIR.obs):")
display(
radi.FLAIR.obs.read(columns=["obs_id", "obs_subject_id", "dimensions", "voxel_spacing"]).head()
)
Subject metadata (obs_meta):
| obs_subject_id | age | survival_days | resection_status | dataset | obs_ids | |
|---|---|---|---|---|---|---|
| 0 | BraTS20_Training_001 | 60.463 | 289.0 | GTR | BraTS2020 | ["BraTS20_Training_001_FLAIR", "BraTS20_Traini... |
| 1 | BraTS20_Training_002 | 52.263 | 616.0 | GTR | BraTS2020 | ["BraTS20_Training_002_FLAIR", "BraTS20_Traini... |
| 2 | BraTS20_Training_003 | 54.301 | 464.0 | GTR | BraTS2020 | ["BraTS20_Training_003_FLAIR", "BraTS20_Traini... |
| 3 | BraTS20_Training_004 | 39.068 | 788.0 | GTR | BraTS2020 | ["BraTS20_Training_004_FLAIR", "BraTS20_Traini... |
| 4 | BraTS20_Training_005 | 68.493 | 465.0 | GTR | BraTS2020 | ["BraTS20_Training_005_FLAIR", "BraTS20_Traini... |
Volume metadata (FLAIR.obs):
| obs_subject_id | obs_id | dimensions | voxel_spacing | |
|---|---|---|---|---|
| 0 | BraTS20_Training_001 | BraTS20_Training_001_FLAIR | (240, 240, 155) | (1.0, 1.0, 1.0) |
| 1 | BraTS20_Training_002 | BraTS20_Training_002_FLAIR | (240, 240, 155) | (1.0, 1.0, 1.0) |
| 2 | BraTS20_Training_003 | BraTS20_Training_003_FLAIR | (240, 240, 155) | (1.0, 1.0, 1.0) |
| 3 | BraTS20_Training_004 | BraTS20_Training_004_FLAIR | (240, 240, 155) | (1.0, 1.0, 1.0) |
| 4 | BraTS20_Training_005 | BraTS20_Training_005_FLAIR | (240, 240, 155) | (1.0, 1.0, 1.0) |
# Filter subjects by clinical metadata
# Example: subjects over 50 with gross total resection (GTR)
filtered = radi.filter("age > 50 and resection_status == 'GTR'")
print(f"Subjects over 50 with GTR: {len(filtered)}")
subject_ids = filtered.obs_subject_ids[:5]
print(f"Subject IDs: {subject_ids}...")
Subjects over 50 with GTR: 101 Subject IDs: ['BraTS20_Training_001', 'BraTS20_Training_002', 'BraTS20_Training_003', 'BraTS20_Training_005', 'BraTS20_Training_006']...
Next Steps¶
The RadiObject is now available at BRATS_URI. Proceed to the tutorial notebooks:
- 01_explore_data.ipynb - Explore RadiObject, collections, and volumes
- 02_configuration.ipynb - Write settings, read tuning, S3 config