Skip to content

Choosing a metric

A run yields up to four numbers, not one. peak is the default and the headline; allocated and allocations often catch what peak hides; and rss — opt-in — is the whole-process physical peak. Pick by the question you're asking:

Metric What it is Reach for it when
peak high-water of live bytes — the most held at once headline footprint; "how big did it get?"
allocated sum of every allocation over the run churn / temporary spikes peak smooths over
allocations count of allocation calls a near-deterministic, low-noise CI tripwire
rss (opt-in) whole-process resident high-water, OS-level — interpreter baseline included "will it fit in RAM?" — OOM/capacity headroom, the physical number a logical heap can't give

The first three come from one memray pass — your code's allocator demand, in-process and byte-exact, so they see native (numpy / C-extension) allocations, not just Python objects. rss is the OS counterpart: the whole process's resident peak (interpreter + arenas + extensions), the figure the OOM killer actually watches. It rides a separate isolated pass and is opt-in per test — mark the build-plus-operate benchmarks you want it on with @pytest.mark.benchmem(isolate=True) (there's deliberately no suite-wide flag; see the marker reference for the picklable-callable rule).

Three readings of one run

Take a workload that allocates a lot of temporary memory but holds little at its peak — the place peak and allocated diverge most:

# test_churn.py
def test_churn(benchmark):
    def work():
        total = 0
        for _ in range(200):
            total += sum([i * i for i in range(20_000)])
        return total
    benchmark(work)

Show all three in the table at once with --benchmark-memory-columns:

pytest test_churn.py --benchmark-only --benchmark-memory --benchmark-memory-columns=peak,allocated,allocations Name (time in ms) Min │ peak (MiB) allocated (MiB) allocations
──────────────────────────────────────────────────────────────────────────────
test_churn 73.9735 │ 1.16 294.07 9,001

memory (right of │): a separate, untimed pass, not the timed rounds

peak stays small (one list lives at a time) while allocated is far larger (every list summed) and allocations counts the calls. A peak gate would wave this churn through; allocated catches it.

Where to go next

  • See these metrics in a delta table or plot, and gate CI on them → Compare & gate CI
  • Every flag and the blob schema → Reference

Going further

Picking one for a CI gate

allocations is often the best tripwire — it's near-deterministic, so a change there is almost always real behaviour, not measurement noise. peak answers the capacity question; allocated catches churn regressions a peak gate would miss. You can gate on several at once — see Compare & gate CI.

Repeats & adaptive sampling

Before any measured pass, an untracked warmup run sheds one-time costs; each benchmark is then memory-profiled more than once, and the headline peak is the minimum across those passes — the cleanest floor. How many passes run is adaptive by default: pytest-benchmem keeps measuring until that floor settles, then stops.

Warmup, then the minimum

Peak memory has two sources of run-to-run variation, and they're handled separately:

  • One-time cold-start — lazy imports, first-touch caches, the initial arena allocation — inflates only the very first run in a process (it can be much larger — e.g. ~2.7 GiB vs a ~1.4 GiB steady write). An untracked warmup run pays it before measuring, so it never lands in the recorded series (keeping max / whiskers / --stat honest).
  • A cold → warm plateau. Even after warmup, the first measured write lands on a clean heap (≈ the fresh-process cost) while repeated writes grow/fragment the allocator's arenas and plateau ~7% higher. These passes are not i.i.d. samples of one number — there are two regimes — so a central stat (mean or median) reports the warm plateau, ~7% above the cold cost, and that bias grows with the pass count. The minimum picks the cold regime: the reproducible, fresh-process floor, independent of how many passes ran.

Want the warm plateau?

The warm steady-state (what a long-lived process that reuses the operation actually holds) is a legitimate but different quantity — and it's still right there in the series. Ask for it with --stat mean / --stat median. The headline just defaults to the floor because that's the reproducible one for regression gating.

How it decides

The loop runs a pass, then asks whether to run another:

  • Always take at least 2. A single pass can't show any run-to-run spread.
  • Stop when the floor settles — once 2 consecutive passes set no new low, the min has converged and more passes won't lower it.
  • Never exceed 10 passes (a hard cap), or an optional --benchmark-memory-max-time wall-clock budget, whichever comes first.

In practice: deterministic code settles in ~3 passes; noisy code runs more, up to the cap — exactly where extra passes pay off.

The warmup defaults to one run; tune it suite-wide with --benchmark-memory-warmup=N (or per test with @pytest.mark.benchmem(warmup=N)), and set 0 to disable it.

Adaptive sampling, not calibration

The UX echoes pytest-benchmark — you don't pick a count — but the mechanism is different, so it's worth being precise. pytest-benchmark calibrates: it runs the function many times to tune its measurement against the timer's coarse resolution. memray has no such problem — a single pass gives the exact peak for that run. So passes here aren't fighting measurement error; they're sequential sampling to find the floor and measure its spread. That's why the minimum is just 2, not pytest-benchmark's min_rounds=5. A fixed count would be simultaneously too many for deterministic code (every pass identical → wasted runtime) and too few for noisy code (a rare floor needs many tries to catch) — hence: adapt.

Forcing a fixed count

For reproducible numbers — CI gating against a saved baseline, where the pass count must not vary run-to-run — pin it:

pytest --benchmark-memory --benchmark-memory-repeats=5     # suite-wide, exactly 5 passes
@pytest.mark.benchmem(repeats=5)        # per test; overrides the suite-wide flag
def test_build(benchmark_memory):
    ...

An explicit count runs exactly that many passes — no adaptation, no cap, no time budget.

Bounding cost

Adaptive sampling runs the action up to 10 times, so its cost is up to 10× the action's runtime. For a benchmark that's both slow and noisy (so it never settles and keeps hitting the cap), cap the wall-clock instead:

pytest --benchmark-memory --benchmark-memory-max-time=3    # ≤3 s of adaptive sampling per benchmark

Noisy peak? Remove the noise, don't average it

A wide peak·minpeak·max spread usually means the workload allocates many short-lived temporaries whose high-water mark depends on timing — classic with multithreaded allocators (e.g. a library that builds output through polars, whose worker pool holds a varying number of buffers at once). Pinning the source of non-determinism (POLARS_MAX_THREADS=1) collapses the spread far more effectively, and more honestly, than piling on repeats. Extra passes only sharpen the min; they can't make a genuinely bursty workload reproducible.

A systematic gap is different: if peak·min sits near the cold floor and peak·max is a roughly constant step above it (rather than scattered), that's the cold→warm arena plateau from Warmup, then the minimum — deterministic, not noise, so thread-pinning won't close it and it needs no fixing. The headline min is still the floor.

Reading the distribution

Every pass is kept as a flat series in the blob, so any stat is available after the fact. Ask for one over the series with --stat:

benchmem compare base.json head.json --columns peak --stat stddev   # how noisy is peak?
benchmem compare v1.json v2.json --columns allocated --stat mean

--stat takes min / max / mean / median / stddev and applies to any metric. Peak is the noisy one (GC timing, page cache); stddev tells you how much.

The terminal table shows the spread too: when a benchmark ran more than once, every shown metric expands into min / mean / max columns (peak·min, peak·mean, peak·max) — always, so the columns don't shift between runs; a single (forced) pass stays one column. The table shows peak only by default; add the rest with --benchmark-memory-columns=peak,allocated,allocations and pick the spread stats with --benchmark-memory-stats=min,stddev.

Scaling plots show the spread too: benchmem plot --view scaling draws each point at the headline (the min floor) with minmax whiskers up to the worst pass, so a noisy series is visible at a glance. It's on where there's spread; --band minmax forces it, --band none hides it.

The raw blob

Each measured benchmark stores all three as flat per-repeat series under extra_info["benchmem"] — one entry per memray pass. Every reported number derives from these:

{"peak_bytes": [1221536], "allocations": [9001], "total_bytes": [308357376]}