Compare & gate CI¶
Two (or more) saved runs in, one comparison out — a table (benchmem compare) or an
interactive view (benchmem plot). The table shows time and peak across every stat by
default (pick metrics with --columns, a stat with --stat); the plot takes one
(--columns). Both group by the dims your tests carry.
A regression to catch¶
A benchmark builds a table of (i, i²) rows. On main, each row is a lightweight tuple:
@pytest.mark.parametrize("n", [10_000, 50_000, 200_000, 500_000])
def test_build_rows(benchmark, n):
benchmark(lambda: [(i, i * i) for i in range(n)])
A branch switches the rows to dicts for readability — a classic memory regression, since a dict is several times heavier than a 2-tuple:
@pytest.mark.parametrize("n", [10_000, 50_000, 200_000, 500_000])
def test_build_rows(benchmark, n):
benchmark(lambda: [{"x": i, "sq": i * i} for i in range(n)])
Run each on its branch and save the --benchmark-json — here, baseline.json (main) and
candidate.json (the branch).
benchmem compare — the comparison table¶
Modelled on pytest-benchmark's own table: one row per (benchmark × run), columns are
metric × stat, and each cell carries a relative (N.NN) multiplier vs the column's best
run (best green, worst red). Rows are grouped into sub-tables by --group-by (default
fullname, so each benchmark's runs sit together and the multiplier reads as the cross-run
ratio). By default it shows time and peak, each across the full stat spread
(min/max/mean/median/stddev) — so no single statistic is privileged:
!benchmem compare {baseline} {candidate}
test_rows.py::test_build_rows[10000]
time (s) time (s) time (s) time (s) time (s) peak (KiB) peak (KiB) peak (KiB) peak (KiB) peak (B) name min max mean median stddev │ min max mean median stddev ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── (baseline) 0.001167 (1.0) 0.001525 (1.0) 0.001208 (1.0) 0.001193 (1.0) 4.264e-05 (1.0) │ 83.12 (1.0) 83.12 (1.0) 83.12 (1.0) 83.12 (1.0) 0 (candidate) 0.001787 (1.53) 0.002441 (1.60) 0.002061 (1.71) 0.002047 (1.72) 6.096e-05 (1.43) │ 2,131.12 (25.64) 2,131.12 (25.64) 2,131.12 (25.64) 2,131.12 (25.64) 0 test_rows.py::test_build_rows[200000] time (s) time (s) time (s) time (s) time (s) peak (MiB) peak (MiB) peak (MiB) peak (MiB) peak (B) name min max mean median stddev │ min max mean median stddev ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── (baseline) 0.03563 (1.0) 0.0368 (1.0) 0.03614 (1.0) 0.03608 (1.0) 0.0003251 (1.0) │ 23.55 (1.0) 23.55 (1.0) 23.55 (1.0) 23.55 (1.0) 0 (candidate) 0.05537 (1.55) 0.05739 (1.56) 0.05604 (1.55) 0.05587 (1.55) 0.000582 (1.79) │ 46.55 (1.98) 46.55 (1.98) 46.55 (1.98) 46.55 (1.98) 0 test_rows.py::test_build_rows[500000] time (s) time (s) time (s) time (s) time (s) peak (MiB) peak (MiB) peak (MiB) peak (MiB) peak (KiB) name min max mean median stddev │ min max mean median stddev ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── (baseline) 0.09234 (1.0) 0.09472 (1.0) 0.09369 (1.0) 0.09367 (1.0) 0.000823 (1.0) │ 60.97 (1.0) 60.97 (1.0) 60.97 (1.0) 60.97 (1.0) 0.00 (candidate) 0.1446 (1.57) 0.1472 (1.55) 0.1454 (1.55) 0.145 (1.55) 0.000942 (1.14) │ 119.97 (1.97) 120.97 (1.98) 120.22 (1.97) 119.97 (1.97) 443.41 test_rows.py::test_build_rows[50000] time (s) time (s) time (s) time (s) time (s) peak (MiB) peak (MiB) peak (MiB) peak (MiB) peak (B) name min max mean median stddev │ min max mean median stddev ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── (baseline) 0.007577 (1.0) 0.008282 (1.0) 0.007768 (1.0) 0.007745 (1.0) 0.0001223 (1.0) │ 4.42 (1.0) 4.42 (1.0) 4.42 (1.0) 4.42 (1.0) 0 (candidate) 0.01185 (1.56) 0.01284 (1.55) 0.01204 (1.55) 0.01199 (1.55) 0.0001808 (1.48) │ 10.42 (2.36) 10.42 (2.36) 10.42 (2.36) 10.42 (2.36) 0
Pick the metrics with --columns (a comma list of time / peak / allocated /
allocations; a metric absent from every run is dropped) and the stat with --stat
(min / max / mean / median / stddev, or all):
!benchmem compare {baseline} {candidate} --columns peak --stat min
test_rows.py::test_build_rows[10000] peak (KiB) name min ──────────────────────────────── (baseline) 83.12 (1.0) (candidate) 2,131.12 (25.64) test_rows.py::test_build_rows[200000] peak (MiB) name min ──────────────────────────── (baseline) 23.55 (1.0) (candidate) 46.55 (1.98) test_rows.py::test_build_rows[500000] peak (MiB) name min ───────────────────────────── (baseline) 60.97 (1.0) (candidate) 119.97 (1.97) test_rows.py::test_build_rows[50000] peak (MiB) name min ──────────────────────────── (baseline) 4.42 (1.0) (candidate) 10.42 (2.36)
--group-by follows pytest-benchmark's grammar (fullname | name | func | group |
module | class | param:NAME, comma-composable); pass it to cluster param-variants
(--group-by func) or collapse everything into one table.
One run, two configs — --pivot¶
Everything above compares run-files: compare a.json b.json lays one file against
another. But the file is just the series axis — the thing whose values sit side by side and
get the (N.NN) multiplier. The series axis is a dim, and the run-file is only
the default one.
--pivot DIM re-points it at a data dim, so a single combined run is A/B'd along that dim —
the role a file-pair plays today, without splitting into N runs. Say semantics is a param
(so it's in the node id, test_build[legacy-100], and a dim):
@pytest.mark.parametrize("semantics", ["legacy", "v1"])
@pytest.mark.parametrize("n", [100, 200])
def test_build(benchmark, semantics, n):
benchmark(build, semantics=semantics, n=n)
One pytest invocation produces one build.json; --pivot param:semantics folds it so rows
that differ only in semantics pair up (legacy ↔ v1) per n:
pytest benchmarks/ --benchmark-only --benchmark-memory --benchmark-json=build.json
benchmem compare build.json --pivot param:semantics --columns time,peak # the A/B table
benchmem plot build.json --pivot param:semantics --columns peak --view compare # the A/B bars
benchmem plot build.json --x n --facet semantics --columns peak # --pivot optional here
Don't confuse it with --group-by, which touches the other axis: --group-by partitions
rows into sub-tables along a dim while the compared series stays the run-files (legacy and
v1 would sit in separate sections, never set against each other); --pivot makes the dim
itself the compared series, folding rows that differ only in it into one ranked A/B row. They
compose on orthogonal axes — --pivot param:semantics --group-by node.func folds legacy ↔ v1
and clusters the folded rows into per-function sub-tables. In short: --group-by says which
rows belong together; --pivot says what you're comparing.
--pivot accepts the same dims as --group-by/--facet — param:NAME or a bare extra_info
name — and composes with --columns, --stat, --facet, --where, and --sort unchanged.
It's mutually exclusive with multiple runs (an A/B view has one series axis: comparing files
and folding a dim would be a 2-D matrix), and only the paired views consume it — on plot
it applies to --view compare/scatter; for scaling/sweep the dim is a normal --x /
--facet axis instead. One combined run then drives the A/B table, the A/B plot, the scaling
plot, and an external per-id gate (e.g. CodSpeed, which wants the config value in the node id
of one run) — no file-splitting, no per-config reruns.
--fail-on follows the same axis: it normally gates the first run vs the last, but with
--pivot it gates the first dim value vs the last (e.g. legacy → v1) out of the one
run — so the combined run also gates itself in CI, no second file required:
# fail if v1's peak grew >10% over legacy, for any benchmark, from one run:
pytest benchmarks/ --benchmark-only --benchmark-memory --benchmark-json=build.json
benchmem compare build.json --pivot param:semantics --fail-on peak:10%
The base (the --fail-on reference, and the first row) is the dim's first value in
parametrize / collection order — the --pivot analogue of file argument order. So
parametrize("semantics", ["legacy", "v1"]) gates v1 against legacy; reorder the list to
flip it. (The table's (1.0) annotation is separate — it marks each column's best value,
green, exactly as the run-file table does, not the base.)
A
--pivotparam is lifted out of the id when rows fold (test_build[legacy-100]→test_build[100]), matched by value. A param given a custompytest.param(id=…)whose label differs from its value won't fold — use a plain parametrize value, or anextra_infodim (which isn't in the id at all), for the comparison axis.
Gate CI on a regression¶
--fail-on exits non-zero past a threshold — drop it into CI after the run. Thresholds are
percent (peak:10%) or absolute (peak:5MiB), on peak, allocated, or allocations
(repeatable):
# on the PR branch, against a baseline saved from main:
pytest --benchmark-only --benchmark-memory --benchmark-json=pr.json
benchmem compare main.json pr.json --fail-on peak:10% --fail-on allocations:5%
The dict rows blow past the threshold on every size, so the offending ids print and it exits
1 — failing the CI job:
!benchmem compare {baseline} {candidate} --columns peak --fail-on peak:10% --fail-on allocations:5%; echo "exit: $?"
test_rows.py::test_build_rows[10000] peak (KiB) peak (KiB) peak (KiB) peak (KiB) peak (B) name min max mean median stddev ──────────────────────────────────────────────────────────────────────────────────────────────────── (baseline) 83.12 (1.0) 83.12 (1.0) 83.12 (1.0) 83.12 (1.0) 0 (candidate) 2,131.12 (25.64) 2,131.12 (25.64) 2,131.12 (25.64) 2,131.12 (25.64) 0 test_rows.py::test_build_rows[200000] peak (MiB) peak (MiB) peak (MiB) peak (MiB) peak (B) name min max mean median stddev ──────────────────────────────────────────────────────────────────────────────────── (baseline) 23.55 (1.0) 23.55 (1.0) 23.55 (1.0) 23.55 (1.0) 0 (candidate) 46.55 (1.98) 46.55 (1.98) 46.55 (1.98) 46.55 (1.98) 0 test_rows.py::test_build_rows[500000] peak (MiB) peak (MiB) peak (MiB) peak (MiB) peak (KiB) name min max mean median stddev ────────────────────────────────────────────────────────────────────────────────────────── (baseline) 60.97 (1.0) 60.97 (1.0) 60.97 (1.0) 60.97 (1.0) 0.00 (candidate) 119.97 (1.97) 120.97 (1.98) 120.22 (1.97) 119.97 (1.97) 443.41 test_rows.py::test_build_rows[50000] peak (MiB) peak (MiB) peak (MiB) peak (MiB) peak (B) name min max mean median stddev ──────────────────────────────────────────────────────────────────────────────────── (baseline) 4.42 (1.0) 4.42 (1.0) 4.42 (1.0) 4.42 (1.0) 0 (candidate) 10.42 (2.36) 10.42 (2.36) 10.42 (2.36) 10.42 (2.36) 0 8 regression(s) over threshold: peak test_build_rows[10000]: 83.1 KiB → 2.08 MiB (+2463.8%) peak test_build_rows[200000]: 23.5 MiB → 46.5 MiB (+97.7%) peak test_build_rows[500000]: 61 MiB → 120 MiB (+96.8%) peak test_build_rows[50000]: 4.42 MiB → 10.4 MiB (+135.6%) allocations test_build_rows[10000]: 39 → 41 (+5.1%) allocations test_build_rows[200000]: 86 → 109 (+26.7%) allocations test_build_rows[500000]: 129 → 188 (+45.7%) allocations test_build_rows[50000]: 57 → 63 (+10.5%)
exit: 1
allocations is usually the steadiest tripwire — see Choosing a metric.
benchmem plot — interactive views¶
benchmem plot writes an interactive plotly view to standalone HTML, picking the view by run
count. Scaling (one run) draws cost vs. input size — the baseline's peak-memory curve:
benchmem plot baseline.json --columns peak -o scaling.html
Scatter (two runs) puts baseline cost on x (log) and the candidate/baseline ratio on y, colour = absolute Δ. The top-right corner is "big and got bigger" — where a regression actually costs you:
benchmem plot baseline.json candidate.json --columns peak -o scatter.html
Where to go next¶
- Which metric to gate on → Choosing a metric
- Compare across versions of a package → Cross-version sweeps
- Every CLI flag and option → Reference
Going further¶
For timing comparisons you can also use pytest-benchmark's own tooling directly —
pytest-benchmark compare,--benchmark-histogram. pytest-benchmem doesn't reimplement those; it adds the memory-aware, dims-aware views. Addtimeto--columns(or use--columns timeon the plot) to put both on the same footing.
More from compare¶
Order rows with --sort (name | value — largest last-run first — | change), and write
raw numbers for another tool with --csv out.csv:
benchmem compare baseline.json candidate.json --columns peak --sort value --csv peak.csv
Gating without separate files¶
The approach above keeps two JSON files. Alternatively, gate inline against
pytest-benchmark's own storage — save a baseline once, then fail the next run against it.
--benchmark-memory-compare-fail implies --benchmark-memory-compare:
# on main — record the baseline into .benchmarks/ storage:
pytest --benchmark-only --benchmark-memory --benchmark-save=main
# on the PR branch — fail if peak grows >10% vs that baseline:
pytest --benchmark-only --benchmark-memory --benchmark-memory-compare-fail=peak:10%
Without a prior saved run, the inline gate is a no-op — it prints "no prior run with memory to compare against" and passes. Save a baseline first.
Profile the offenders¶
A peak +20% number says that a benchmark regressed, not where. Add
--benchmark-memory-profile DIR to keep the memray profile (.bin) for each regressing id,
so you can render the allocating call paths after the fact:
pytest --benchmark-only --benchmark-memory \
--benchmark-memory-compare-fail=peak:10% \
--benchmark-memory-profile profiles/
# -> profiles/<id>.bin for every id over threshold; clean ids get nothing
The run prints a ready-to-paste command per saved profile:
benchmem: saved 1 memory profile(s) to profiles/ — render with:
memray flamegraph profiles/test_solve.bin
The .bin is memray's raw capture, so the same file also feeds memray tree / summary /
stats — pick the lens you want. Off by default (retaining .bins costs disk), and in CI
it's the natural artifact to upload and render locally on the PR.
One-step render — benchmem flamegraph¶
Instead of finding the right .bin and remembering the memray subcommand, point benchmem flamegraph at the profile dir and name the test (an exact id or a unique substring):
benchmem flamegraph profiles/ test_solve # → profiles/test_solve.flamegraph.html
benchmem flamegraph profiles/ --worst peak --open # auto-pick the heaviest, open it
benchmem flamegraph profiles/ test_solve --report tree # terminal lens instead of HTML
--worst peak|allocated|allocations reads the metric straight from each .bin and renders the
heaviest, so you don't have to look up the id. --report passes through to any memray reporter
(flamegraph default, plus table / tree / summary / stats); HTML reports land next to
the .bin (override with -o, overwrite with -f). --native asserts the profile actually
carries native traces (see below) and errors with the fix if it doesn't.
Native-backed workloads: attribute the C/Rust memory¶
By default the capture records Python frames only. For a native-backed workload
(polars/Rust, numpy/C, solver bindings) the bulk of peak memory is allocated inside the
extension, so the flamegraph collapses it into one unresolved ??? at ??? bucket — exactly
the part you wanted to localize. Add --benchmark-memory-profile-native to also capture native
stacks:
pytest --benchmark-only --benchmark-memory \
--benchmark-memory-profile profiles/ \
--benchmark-memory-profile-native
Now the flamegraph attributes memory to the real frames (e.g. jemalloc _rjem_je_* under
rayon workers, reached via the polars write path) instead of an opaque native bucket. It's
opt-in — native traces cost runtime and produce bigger .bins — and only applies on the
--benchmark-memory-profile path (without a kept profile there's nothing to enrich, so the
flag errors rather than silently doing nothing). Scope it to one test with
@pytest.mark.benchmem(profile_native=True) instead of the suite-wide flag.
!!! note "Symbols sharpen native frames"
Native traces resolve against interpreter/library symbols. On a stripped build memray warns
No symbol information was found for the Python interpreter; frames then show as mangled
Rust/<unknown> but stay attributable by symbol name (_rjem_je_*, rayon_*). A
debug-symbol interpreter sharpens the picture.
Which benchmarks get a profile follows the gate:
- with
--benchmark-memory-compare-fail→ only the regressing ids (keep the failing run cheap and the output small); - without a fail-gate → every measured benchmark — drop the gate and keep
--benchmark-memory-profile DIRalone to archive them all, regressing or not.
A minimal GitHub Actions job using the two-file approach, caching the baseline across runs:
- uses: actions/cache@v4
with:
path: main.json
key: benchmem-baseline-${{ github.base_ref }}
- run: pytest --benchmark-only --benchmark-memory --benchmark-json=pr.json
- run: benchmem compare main.json pr.json --fail-on peak:10% --fail-on allocations:5%
The other two views¶
plot auto-selects by run count; override with --view:
| Runs | Default view | Answers |
|---|---|---|
| 1 | scaling |
how does cost grow with input size? |
| 2 | scatter |
which ids moved, and were they already big? |
| 2 | compare (--view compare) |
ranked — what moved most, in native units? |
| 3+ | sweep |
fold-change across versions, one cell per (id, run) |
--pivot re-points the series axis of the paired views (compare/scatter) from the run-file
onto a dim, folding a single run along it (see One run, two configs).
--facet splits any view into small multiples by a dim (including node.*),
--where keeps only rows matching a dim=value filter (repeatable, AND-combined),
--free-axes unmatches a faceted axis from the shared default, and --label/-l names
the series per run (defaulting to the file stems):
benchmem plot run.json --columns peak --facet node.func # one panel per operation
benchmem plot run.json --facet node.func --free-axes y # ...each on its own cost scale
benchmem plot run.json --where axis=n # one sweep at a time
benchmem plot v1.json v2.json v3.json -l 0.6 -l 0.7 -l 0.8 # name series, not file stems
Faceted panels share axes by default — right when they're commensurable (the same n
grid across functions). Two cases want them unmatched, and they free different axes:
- Different cost scales —
--facet node.funcwhere one function is far costlier flattens the cheap panels on a shared y.--free-axes ygives each its own cost range. - Incommensurable sweeps — a run mixing sizes
n(10–10⁴) and a 0–100severityunder one numeric dim squishes both onto one x.--free-axes x(or filter to one with--where axis=n).
--free-axes both frees each panel entirely. --where is the cleaner reach when you only
care about one slice — it drops the rest rather than just rescaling.