Performance & memory safety
tile_process is built so a run adapts to whatever machine it lands on and
can't run out of RAM/VRAM or freeze the box — without you tuning anything.
Automatic, machine-aware concurrency
The staging step (running your fn once per tile to a temp store) and the
merge step are sized to the host automatically:
- GPU (
use_gpu=True) → one tile at a time, so concurrent evaluations can never exhaust VRAM. - CPU → as many tiles in flight as fit 80 % of available RAM (estimated from the tile size), and always leaving one core free so the machine stays responsive — it never pins every core.
The RAM figure is read live via psutil; without it, a conservative default is
used instead of guessing high.
Live progress dashboard (GPU runs)
A single-GPU run still gets a Dask dashboard: patchworks spins up a tiny 1-worker / 1-thread in-process cluster, which keeps GPU evaluations serial (no VRAM contention) while exposing the dashboard so you can watch tiles stream through. The URL is logged at the start of staging:
This needs distributed (and bokeh for the UI) installed; if they are
missing, patchworks logs a warning and falls back to the threaded scheduler
(no dashboard, same result). A cluster you start yourself
(make_local_cluster) is used as-is instead.
Overriding the worker count
from patchworks import tile_process
# let patchworks pick (recommended)
tile_process("scan.zarr", fn)
# or cap it yourself (staging threads + merge processes)
tile_process("scan.zarr", fn, max_workers=8)
max_workers bounds both staging and merging. A running distributed client
manages its own concurrency, so the override is skipped there — configure the
cluster's memory limits instead.
Why it won't OOM or freeze
| Resource | Guard |
|---|---|
| RAM | concurrent tiles × tile size × overhead ≤ 80 % of available RAM |
| VRAM | GPU path runs one tile at a time |
| CPU | always leaves at least one core free |
| Disk I/O | each pyramid/stage level is streamed chunk-by-chunk; no whole volume in memory |
The staging graph itself is kept small — a single fused map_overlap
(halo → fn → trim) rather than three separate passes — and there is no
extra read-back of the staged data.
Getting more speed
tile_shape="auto"sizes tiles to free RAM (or VRAM withuse_gpu=True).skip_empty=Truewithestimate_empty_tiles()skips background tiles.- A Dask distributed cluster (
make_local_cluster) parallelises across workers/GPUs; patchworks then defers concurrency to the cluster.
What doesn't help here
The merge and relabel steps are already vectorised NumPy + SciPy (C-level)
with no per-voxel Python loop, and the pipeline is I/O-bound — so numba,
cupy, arrow and xarray bring essentially nothing. The real levers are
tile size, concurrency (above) and zarr chunking.