Performance & memory safety

tile_process is built so a run adapts to whatever machine it lands on and can't run out of RAM/VRAM or freeze the box — without you tuning anything.

Automatic, allocation-aware concurrency

The staging step (running your fn once per tile to a temp store) and the merge step are sized automatically:

GPU (use_gpu=True) → one tile at a time, so concurrent evaluations can never exhaust VRAM.
CPU → as many tiles in flight as fit 80 % of available RAM (estimated from the tile size), and always leaving one core free so the machine stays responsive — it never pins every core.

"Available" means available to this process, not to the machine. On a shared cluster node those differ wildly — a 32-core, 128 GB job on a 128-core, 512 GB node would otherwise size itself for the whole box and get OOM-killed while its own accounting said it had room. patchworks takes the smallest of SLURM_MEM_PER_NODE, SLURM_MEM_PER_CPU × cpus, the cgroup limit (the one that actually triggers the kill) and psutil's free RAM, and reads the core count from SLURM_CPUS_PER_TASK or the process' CPU affinity mask.

Without any of those signals, a conservative default is used instead of guessing high.

Live progress dashboard (GPU runs)

A single-GPU run still gets a Dask dashboard: patchworks spins up a tiny 1-worker / 1-thread in-process cluster, which keeps GPU evaluations serial (no VRAM contention) while exposing the dashboard so you can watch tiles stream through. The URL is logged at the start of staging:

INFO:patchworks._core:Dask dashboard for this run: http://127.0.0.1:8787/status

This needs distributed (and bokeh for the UI) installed; if they are missing, patchworks logs a warning and falls back to the threaded scheduler (no dashboard, same result). A cluster you start yourself (make_local_cluster) is used as-is instead.

Overriding the worker count

from patchworks import tile_process

# let patchworks pick (recommended)
tile_process("scan.zarr", fn)

# or cap it yourself (staging threads + merge processes)
tile_process("scan.zarr", fn, max_workers=8)

max_workers bounds both staging and merging. A running distributed client manages its own concurrency, so the override is skipped there — configure the cluster's memory limits instead.

Why it won't OOM or freeze

Resource	Guard
RAM	concurrent tiles × tile size × overhead ≤ 80 % of available RAM
VRAM	GPU path runs one tile at a time
CPU	always leaves at least one core free
Disk I/O	each pyramid/stage level is streamed chunk-by-chunk; no whole volume in memory

The staging graph itself is kept small — a single fused map_overlap (halo → fn → trim) rather than three separate passes — and there is no extra read-back of the staged data.

Getting more speed

tile_shape="auto" sizes tiles to free RAM (or VRAM with use_gpu=True).
skip_empty=True with estimate_empty_tiles() skips background tiles. That preview samples a centred window, so use build_occupancy_map() + tile_occupancy() when the result becomes an authoritative skip list — they are exact, and it is what the Snakemake workflow uses.
A Dask distributed cluster (make_local_cluster) parallelises across workers/GPUs; patchworks then defers concurrency to the cluster.

What doesn't help here

The merge and relabel steps are already vectorised NumPy + SciPy (C-level) with no per-voxel Python loop, and the pipeline is I/O-bound — so numba, cupy, arrow and xarray bring essentially nothing. The real levers are tile size, concurrency (above) and zarr chunking.