Compare & gate CI¶

Two (or more) saved runs in, one comparison out — a table (benchmem compare) or an interactive view (benchmem plot). The table shows time and peak across every stat by default (pick metrics with --columns, a stat with --stat); the plot takes one (--columns). Both group by the dims your tests carry.

A regression to catch¶

A benchmark builds a table of (i, i²) rows. On main, each row is a lightweight tuple:

@pytest.mark.parametrize("n", [10_000, 50_000, 200_000, 500_000])
def test_build_rows(benchmark, n):
    benchmark(lambda: [(i, i * i) for i in range(n)])

A branch switches the rows to dicts for readability — a classic memory regression, since a dict is several times heavier than a 2-tuple:

@pytest.mark.parametrize("n", [10_000, 50_000, 200_000, 500_000])
def test_build_rows(benchmark, n):
    benchmark(lambda: [{"x": i, "sq": i * i} for i in range(n)])

Run each on its branch and save the --benchmark-json — here, baseline.json (main) and candidate.json (the branch).

`benchmem compare` — the comparison table¶

Modelled on pytest-benchmark's own table: one row per (benchmark × run), columns are metric × stat, and each cell carries a relative (N.NN) multiplier vs the column's best run (best green, worst red). Rows are grouped into sub-tables by --group-by (default fullname, so each benchmark's runs sit together and the multiplier reads as the cross-run ratio). By default it shows time and peak, each across the full stat spread (min/max/mean/median/stddev) — so no single statistic is privileged:

In [2]:

Copied!

!benchmem compare {baseline} {candidate}
!benchmem compare {baseline} {candidate}

test_rows.py::test_build_rows[10000]

                      time (s)          time (s)          time (s)          time (s)           time (s)             peak (KiB)         peak (KiB)         peak (KiB)         peak (KiB)   peak (B) 
 name                      min               max              mean            median             stddev   │                min                max               mean             median     stddev 
───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
 (baseline)     0.001167 (1.0)    0.001525 (1.0)    0.001208 (1.0)    0.001193 (1.0)    4.264e-05 (1.0)   │        83.12 (1.0)        83.12 (1.0)        83.12 (1.0)        83.12 (1.0)          0 
 (candidate)   0.001787 (1.53)   0.002441 (1.60)   0.002061 (1.71)   0.002047 (1.72)   6.096e-05 (1.43)   │   2,131.12 (25.64)   2,131.12 (25.64)   2,131.12 (25.64)   2,131.12 (25.64)          0 

test_rows.py::test_build_rows[200000]
                     time (s)         time (s)         time (s)         time (s)          time (s)         peak (MiB)     peak (MiB)     peak (MiB)     peak (MiB)   peak (B) 
 name                     min              max             mean           median            stddev   │            min            max           mean         median     stddev 
──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
 (baseline)     0.03563 (1.0)     0.0368 (1.0)    0.03614 (1.0)    0.03608 (1.0)   0.0003251 (1.0)   │    23.55 (1.0)    23.55 (1.0)    23.55 (1.0)    23.55 (1.0)          0 
 (candidate)   0.05537 (1.55)   0.05739 (1.56)   0.05604 (1.55)   0.05587 (1.55)   0.000582 (1.79)   │   46.55 (1.98)   46.55 (1.98)   46.55 (1.98)   46.55 (1.98)          0 

test_rows.py::test_build_rows[500000]
                    time (s)        time (s)        time (s)        time (s)          time (s)          peak (MiB)      peak (MiB)      peak (MiB)      peak (MiB)   peak (KiB) 
 name                    min             max            mean          median            stddev   │             min             max            mean          median       stddev 
────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
 (baseline)    0.09234 (1.0)   0.09472 (1.0)   0.09369 (1.0)   0.09367 (1.0)    0.000823 (1.0)   │     60.97 (1.0)     60.97 (1.0)     60.97 (1.0)     60.97 (1.0)         0.00 
 (candidate)   0.1446 (1.57)   0.1472 (1.55)   0.1454 (1.55)    0.145 (1.55)   0.000942 (1.14)   │   119.97 (1.97)   120.97 (1.98)   120.22 (1.97)   119.97 (1.97)       443.41 

test_rows.py::test_build_rows[50000]
                     time (s)         time (s)         time (s)         time (s)           time (s)         peak (MiB)     peak (MiB)     peak (MiB)     peak (MiB)   peak (B) 
 name                     min              max             mean           median             stddev   │            min            max           mean         median     stddev 
───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
 (baseline)    0.007577 (1.0)   0.008282 (1.0)   0.007768 (1.0)   0.007745 (1.0)    0.0001223 (1.0)   │     4.42 (1.0)     4.42 (1.0)     4.42 (1.0)     4.42 (1.0)          0 
 (candidate)   0.01185 (1.56)   0.01284 (1.55)   0.01204 (1.55)   0.01199 (1.55)   0.0001808 (1.48)   │   10.42 (2.36)   10.42 (2.36)   10.42 (2.36)   10.42 (2.36)          0

Pick the metrics with --columns (a comma list of time / peak / allocated / allocations; a metric absent from every run is dropped) and the stat with --stat (min / max / mean / median / stddev, or all):

In [3]:

Copied!

!benchmem compare {baseline} {candidate} --columns peak --stat min
!benchmem compare {baseline} {candidate} --columns peak --stat min

test_rows.py::test_build_rows[10000]
                     peak (KiB) 
 name                       min 
────────────────────────────────
 (baseline)         83.12 (1.0) 
 (candidate)   2,131.12 (25.64) 

test_rows.py::test_build_rows[200000]
                 peak (MiB) 
 name                   min 
────────────────────────────
 (baseline)     23.55 (1.0) 
 (candidate)   46.55 (1.98) 

test_rows.py::test_build_rows[500000]
                  peak (MiB) 
 name                    min 
─────────────────────────────
 (baseline)      60.97 (1.0) 
 (candidate)   119.97 (1.97) 

test_rows.py::test_build_rows[50000]
                 peak (MiB) 
 name                   min 
────────────────────────────
 (baseline)      4.42 (1.0) 
 (candidate)   10.42 (2.36)

One run, two configs — `--pivot`¶

Everything above compares run-files: compare a.json b.json lays one file against another. But the file is just the series axis — the thing whose values sit side by side and get the (N.NN) multiplier. The series axis is a dim, and the run-file is only the default one.

--pivot DIM re-points it at a data dim, so a single combined run is A/B'd along that dim — the role a file-pair plays today, without splitting into N runs. Say semantics is a param (so it's in the node id, test_build[legacy-100], and a dim):

@pytest.mark.parametrize("semantics", ["legacy", "v1"])
@pytest.mark.parametrize("n", [100, 200])
def test_build(benchmark, semantics, n):
    benchmark(build, semantics=semantics, n=n)

One pytest invocation produces one build.json; --pivot param:semantics folds it so rows that differ only in semantics pair up (legacy ↔ v1) per n:

pytest benchmarks/ --benchmark-only --benchmark-memory --benchmark-json=build.json
benchmem compare build.json --pivot param:semantics --columns time,peak   # the A/B table
benchmem plot    build.json --pivot param:semantics --columns peak --view compare   # the A/B bars
benchmem plot    build.json --x n --facet semantics --columns peak     # --pivot optional here

Don't confuse it with --group-by, which touches the other axis: --group-by partitions rows into sub-tables along a dim while the compared series stays the run-files (legacy and v1 would sit in separate sections, never set against each other); --pivot makes the dim itself the compared series, folding rows that differ only in it into one ranked A/B row. They compose on orthogonal axes — --pivot param:semantics --group-by node.func folds legacy ↔ v1 and clusters the folded rows into per-function sub-tables. In short: --group-by says which rows belong together; --pivot says what you're comparing.

--pivot accepts the same dims as --group-by/--facet — param:NAME or a bare extra_info name — and composes with --columns, --stat, --facet, --where, and --sort unchanged. It's mutually exclusive with multiple runs (an A/B view has one series axis: comparing files and folding a dim would be a 2-D matrix), and only the paired views consume it — on plot it applies to --view compare/scatter; for scaling/sweep the dim is a normal --x / --facet axis instead. One combined run then drives the A/B table, the A/B plot, the scaling plot, and an external per-id gate (e.g. CodSpeed, which wants the config value in the node id of one run) — no file-splitting, no per-config reruns.

--fail-on follows the same axis: it normally gates the first run vs the last, but with --pivot it gates the first dim value vs the last (e.g. legacy → v1) out of the one run — so the combined run also gates itself in CI, no second file required:

# fail if v1's peak grew >10% over legacy, for any benchmark, from one run:
pytest benchmarks/ --benchmark-only --benchmark-memory --benchmark-json=build.json
benchmem compare build.json --pivot param:semantics --fail-on peak:10%

The base (the --fail-on reference, and the first row) is the dim's first value in parametrize / collection order — the --pivot analogue of file argument order. So parametrize("semantics", ["legacy", "v1"]) gates v1 against legacy; reorder the list to flip it. (The table's (1.0) annotation is separate — it marks each column's best value, green, exactly as the run-file table does, not the base.)

A --pivot param is lifted out of the id when rows fold (test_build[legacy-100] → test_build[100]), matched by value. A param given a custom pytest.param(id=…) whose label differs from its value won't fold — use a plain parametrize value, or an extra_info dim (which isn't in the id at all), for the comparison axis.

Gate CI on a regression¶

--fail-on exits non-zero past a threshold — drop it into CI after the run. Thresholds are percent (peak:10%) or absolute (peak:5MiB), on peak, allocated, or allocations (repeatable):

# on the PR branch, against a baseline saved from main:
pytest --benchmark-only --benchmark-memory --benchmark-json=pr.json
benchmem compare main.json pr.json --fail-on peak:10% --fail-on allocations:5%

The dict rows blow past the threshold on every size, so the offending ids print and it exits 1 — failing the CI job:

In [4]:

Copied!

!benchmem compare {baseline} {candidate} --columns peak --fail-on peak:10% --fail-on allocations:5%; echo "exit: $?"
!benchmem compare {baseline} {candidate} --columns peak --fail-on peak:10% --fail-on allocations:5%; echo "exit: $?"

test_rows.py::test_build_rows[10000]
                     peak (KiB)         peak (KiB)         peak (KiB)         peak (KiB)   peak (B) 
 name                       min                max               mean             median     stddev 
────────────────────────────────────────────────────────────────────────────────────────────────────
 (baseline)         83.12 (1.0)        83.12 (1.0)        83.12 (1.0)        83.12 (1.0)          0 
 (candidate)   2,131.12 (25.64)   2,131.12 (25.64)   2,131.12 (25.64)   2,131.12 (25.64)          0 

test_rows.py::test_build_rows[200000]
                 peak (MiB)     peak (MiB)     peak (MiB)     peak (MiB)   peak (B) 
 name                   min            max           mean         median     stddev 
────────────────────────────────────────────────────────────────────────────────────
 (baseline)     23.55 (1.0)    23.55 (1.0)    23.55 (1.0)    23.55 (1.0)          0 
 (candidate)   46.55 (1.98)   46.55 (1.98)   46.55 (1.98)   46.55 (1.98)          0 

test_rows.py::test_build_rows[500000]
                  peak (MiB)      peak (MiB)      peak (MiB)      peak (MiB)   peak (KiB) 
 name                    min             max            mean          median       stddev 
──────────────────────────────────────────────────────────────────────────────────────────
 (baseline)      60.97 (1.0)     60.97 (1.0)     60.97 (1.0)     60.97 (1.0)         0.00 
 (candidate)   119.97 (1.97)   120.97 (1.98)   120.22 (1.97)   119.97 (1.97)       443.41 

test_rows.py::test_build_rows[50000]
                 peak (MiB)     peak (MiB)     peak (MiB)     peak (MiB)   peak (B) 
 name                   min            max           mean         median     stddev 
────────────────────────────────────────────────────────────────────────────────────
 (baseline)      4.42 (1.0)     4.42 (1.0)     4.42 (1.0)     4.42 (1.0)          0 
 (candidate)   10.42 (2.36)   10.42 (2.36)   10.42 (2.36)   10.42 (2.36)          0 

8 regression(s) over threshold:
  peak         test_build_rows[10000]: 83.1 KiB → 2.08 MiB (+2463.8%)
  peak         test_build_rows[200000]: 23.5 MiB → 46.5 MiB (+97.7%)
  peak         test_build_rows[500000]: 61 MiB → 120 MiB (+96.8%)
  peak         test_build_rows[50000]: 4.42 MiB → 10.4 MiB (+135.6%)
  allocations  test_build_rows[10000]: 39 → 41 (+5.1%)
  allocations  test_build_rows[200000]: 86 → 109 (+26.7%)
  allocations  test_build_rows[500000]: 129 → 188 (+45.7%)
  allocations  test_build_rows[50000]: 57 → 63 (+10.5%)

exit: 1

allocations is usually the steadiest tripwire — see Choosing a metric.

`benchmem plot` — interactive views¶

benchmem plot writes an interactive plotly view to standalone HTML, picking the view by run count. Scaling (one run) draws cost vs. input size — the baseline's peak-memory curve:

benchmem plot baseline.json --columns peak -o scaling.html

Scatter (two runs) puts baseline cost on x (log) and the candidate/baseline ratio on y, colour = absolute Δ. The top-right corner is "big and got bigger" — where a regression actually costs you:

benchmem plot baseline.json candidate.json --columns peak -o scatter.html

Where to go next¶

Which metric to gate on → Choosing a metric
Compare across versions of a package → Cross-version sweeps
Every CLI flag and option → Reference

Going further¶

For timing comparisons you can also use pytest-benchmark's own tooling directly — pytest-benchmark compare, --benchmark-histogram. pytest-benchmem doesn't reimplement those; it adds the memory-aware, dims-aware views. Add time to --columns (or use --columns time on the plot) to put both on the same footing.

More from `compare`¶

Order rows with --sort (name | value — largest last-run first — | change), and write raw numbers for another tool with --csv out.csv:

benchmem compare baseline.json candidate.json --columns peak --sort value --csv peak.csv

Gating without separate files¶

The approach above keeps two JSON files. Alternatively, gate inline against pytest-benchmark's own storage — save a baseline once, then fail the next run against it. --benchmark-memory-compare-fail implies --benchmark-memory-compare:

# on main — record the baseline into .benchmarks/ storage:
pytest --benchmark-only --benchmark-memory --benchmark-save=main

# on the PR branch — fail if peak grows >10% vs that baseline:
pytest --benchmark-only --benchmark-memory --benchmark-memory-compare-fail=peak:10%

Without a prior saved run, the inline gate is a no-op — it prints "no prior run with memory to compare against" and passes. Save a baseline first.

Profile the offenders¶

A peak +20% number says that a benchmark regressed, not where. Add --benchmark-memory-profile DIR to keep the memray profile (.bin) for each regressing id, so you can render the allocating call paths after the fact:

pytest --benchmark-only --benchmark-memory \
       --benchmark-memory-compare-fail=peak:10% \
       --benchmark-memory-profile profiles/
# -> profiles/<id>.bin for every id over threshold; clean ids get nothing

The run prints a ready-to-paste command per saved profile:

benchmem: saved 1 memory profile(s) to profiles/ — render with:
    memray flamegraph profiles/test_solve.bin

The .bin is memray's raw capture, so the same file also feeds memray tree / summary / stats — pick the lens you want. Off by default (retaining .bins costs disk), and in CI it's the natural artifact to upload and render locally on the PR.

One-step render — `benchmem flamegraph`¶

Instead of finding the right .bin and remembering the memray subcommand, point benchmem flamegraph at the profile dir and name the test (an exact id or a unique substring):

benchmem flamegraph profiles/ test_solve          # → profiles/test_solve.flamegraph.html
benchmem flamegraph profiles/ --worst peak --open # auto-pick the heaviest, open it
benchmem flamegraph profiles/ test_solve --report tree   # terminal lens instead of HTML

--worst peak|allocated|allocations reads the metric straight from each .bin and renders the heaviest, so you don't have to look up the id. --report passes through to any memray reporter (flamegraph default, plus table / tree / summary / stats); HTML reports land next to the .bin (override with -o, overwrite with -f). --native asserts the profile actually carries native traces (see below) and errors with the fix if it doesn't.

Native-backed workloads: attribute the C/Rust memory¶

By default the capture records Python frames only. For a native-backed workload (polars/Rust, numpy/C, solver bindings) the bulk of peak memory is allocated inside the extension, so the flamegraph collapses it into one unresolved ??? at ??? bucket — exactly the part you wanted to localize. Add --benchmark-memory-profile-native to also capture native stacks:

pytest --benchmark-only --benchmark-memory \
       --benchmark-memory-profile profiles/ \
       --benchmark-memory-profile-native

Now the flamegraph attributes memory to the real frames (e.g. jemalloc _rjem_je_* under rayon workers, reached via the polars write path) instead of an opaque native bucket. It's opt-in — native traces cost runtime and produce bigger .bins — and only applies on the --benchmark-memory-profile path (without a kept profile there's nothing to enrich, so the flag errors rather than silently doing nothing). Scope it to one test with @pytest.mark.benchmem(profile_native=True) instead of the suite-wide flag.

!!! note "Symbols sharpen native frames" Native traces resolve against interpreter/library symbols. On a stripped build memray warns No symbol information was found for the Python interpreter; frames then show as mangled Rust/<unknown> but stay attributable by symbol name (_rjem_je_*, rayon_*). A debug-symbol interpreter sharpens the picture.

Which benchmarks get a profile follows the gate:

with --benchmark-memory-compare-fail → only the regressing ids (keep the failing run cheap and the output small);
without a fail-gate → every measured benchmark — drop the gate and keep --benchmark-memory-profile DIR alone to archive them all, regressing or not.

A minimal GitHub Actions job using the two-file approach, caching the baseline across runs:

- uses: actions/cache@v4
  with:
    path: main.json
    key: benchmem-baseline-${{ github.base_ref }}
- run: pytest --benchmark-only --benchmark-memory --benchmark-json=pr.json
- run: benchmem compare main.json pr.json --fail-on peak:10% --fail-on allocations:5%

The other two views¶

plot auto-selects by run count; override with --view:

Runs	Default view	Answers
1	`scaling`	how does cost grow with input size?
2	`scatter`	which ids moved, and were they already big?
2	`compare` (`--view compare`)	ranked — what moved most, in native units?
3+	`sweep`	fold-change across versions, one cell per (id, run)

--pivot re-points the series axis of the paired views (compare/scatter) from the run-file onto a dim, folding a single run along it (see One run, two configs). --facet splits any view into small multiples by a dim (including node.*), --where keeps only rows matching a dim=value filter (repeatable, AND-combined), --free-axes unmatches a faceted axis from the shared default, and --label/-l names the series per run (defaulting to the file stems):

benchmem plot run.json --columns peak --facet node.func             # one panel per operation
benchmem plot run.json --facet node.func --free-axes y             # ...each on its own cost scale
benchmem plot run.json --where axis=n                              # one sweep at a time
benchmem plot v1.json v2.json v3.json -l 0.6 -l 0.7 -l 0.8         # name series, not file stems

Faceted panels share axes by default — right when they're commensurable (the same n grid across functions). Two cases want them unmatched, and they free different axes:

Different cost scales — --facet node.func where one function is far costlier flattens the cheap panels on a shared y. --free-axes y gives each its own cost range.
Incommensurable sweeps — a run mixing sizes n (10–10⁴) and a 0–100 severity under one numeric dim squishes both onto one x. --free-axes x (or filter to one with --where axis=n).

--free-axes both frees each panel entirely. --where is the cleaner reach when you only care about one slice — it drops the rest rather than just rescaling.