Reference¶
Every flag, marker, fixture, CLI command, and public function. The CLI and Python API below are rendered live from the source; the pytest surface (flags, marker, fixture, blob schema) is curated here. For the narrative versions see Quickstart, Choosing a metric, Grouping by dims, and Compare & gate CI.
pytest command-line flags¶
The plugin adds these to any pytest run (alongside pytest-benchmark's own flags). This
table is generated from the plugin's own --help text, so it can't drift from the code:
| Flag | Default | What |
|---|---|---|
--benchmark-memory |
off | Record peak memory (a memray pass) for every benchmark() call, not just the benchmark_memory fixture — no test changes. Off by default; the fixture is always measured, with or without this flag. |
--benchmark-memory-repeats=N |
— | Force a fixed number of memray passes per benchmark, suite-wide; the reported peak is the min across them. Overridden per-test by @pytest.mark.benchmem(repeats=N). Default: adaptive — run passes until the min floor settles (≥2, cap 10). Set this for a fixed, reproducible count (e.g. CI gating against a saved baseline). |
--benchmark-memory-warmup=N |
1 |
Untracked dry-runs of the action before measuring, suite-wide, to shed one-time costs (lazy imports, first-touch caches) so the measured passes aren't inflated by cold start. Overridden per-test by @pytest.mark.benchmem(warmup=N). Default: 1; set 0 to disable. |
--benchmark-memory-max-time=SECONDS |
— | Wall-clock budget for the adaptive memory passes (the analogue of --benchmark-max-time): caps how long adaptive sampling spends per benchmark. Ignored when --benchmark-memory-repeats forces a fixed count. Default: no time bound — the pass cap alone bounds it. |
--benchmark-memory-compare=REF |
off | Compare this run's peak memory against a prior saved run (a pytest-benchmark storage ref like 0001, or the latest if no value is given); folds base and delta-peak columns into the table. |
--benchmark-memory-compare-fail=FIELD:THRESHOLD |
— | Fail the session on a memory regression, e.g. peak:10%, peak:5MiB, allocations:5% (repeatable). Fields: peak, allocated, allocations, rss (rss needs isolated runs). Implies --benchmark-memory-compare. |
--benchmark-memory-profile=DIR |
— | Save the memray profile (memray flamegraph (or tree/summary). Scope follows the gate: WITH --benchmark-memory-compare-fail only the regressing ids, otherwise EVERY measured benchmark. Off by default (disk cost). |
--benchmark-memory-profile-native |
off | Capture native (C/C++/Rust) stacks in the kept profile, so the flamegraph attributes memory inside extension code (polars/numpy/solver bindings) instead of one opaque ??? at ??? bucket. Only affects --benchmark-memory-profile runs; opt-in (slower, bigger .bin). Per-test override: @pytest.mark.benchmem(profile_native=True). Off by default. |
--benchmark-memory-table |
combined |
Layout for the memory metrics: combined (default) folds them into pytest-benchmark's timing table; split prints a separate memory table. |
--benchmark-memory-columns=peak,allocated,allocations,rss |
— | Which memory metrics the table shows, comma-separated and in order: peak, allocated, allocations, rss (rss only shows for isolated runs). Default: peak only. |
--benchmark-memory-stats=min,mean,max |
— | With repeats > 1, the stats each shown metric spreads into: min, mean, max, median, stddev. A single pass stays one column. Default: min,mean,max. |
Timing regressions still use pytest-benchmark's own --benchmark-compare /
--benchmark-compare-fail; the --benchmark-memory-compare* flags are the memory
mirror. Their baseline comes from pytest-benchmark's storage (.benchmarks/) — save
one first with --benchmark-save=NAME or --benchmark-autosave, or the gate finds
nothing and passes. See Gate CI on a regression.
The benchmem marker¶
| Kwarg | Default | What |
|---|---|---|
repeats |
auto | force a fixed N memray passes for this test (default: adaptive — see below). Every pass is kept (the blob stores the whole series); the headline peak is the minimum across them, and --stat reports any other. Overrides the suite-wide --benchmark-memory-repeats. |
warmup |
1 |
untracked dry-runs of the action before measuring, to shed one-time costs (lazy imports, first-touch caches). 0 disables. Overrides the suite-wide --benchmark-memory-warmup. |
isolate |
False |
run each memray pass in a fresh process and also record whole-process resident memory as the rss metric — the physical/OOM-relevant peak memray's logical heap can't give. Per-test only (no suite-wide flag): rss is a whole-job capacity number, meaningful only for build+operate benchmarks, so you mark the specific ones. Needs a top-level, picklable benchmarked function (see the whole-job warning below). |
profile_native |
False |
on the --benchmark-memory-profile path, capture native (C/C++/Rust) stacks in the kept .bin, so a flamegraph attributes extension-code memory (polars/numpy/solver bindings) instead of one opaque ??? at ??? bucket. Opt-in (slower, bigger .bin). Overrides the suite-wide --benchmark-memory-profile-native. |
max_peak |
— | fail the test if the headline peak exceeds this absolute ceiling. A size string ("100MiB", units B/KiB/MiB/GiB) or a bare int (bytes). |
max_allocated |
— | as max_peak, on allocated (total bytes). |
max_allocations |
— | as above, on the allocations count — a bare number (no unit). |
Isolated rss measures the whole job — build the state inside the callable
The rss metric (isolate=True) runs the action in a fresh, empty process. Two
consequences:
The build must happen inside the measured callable, and the callable must be a top-level,
picklable function. The child starts with nothing, so it must construct whatever it
operates on; and spawn serializes the call with standard pickle, so a lambda or closure
is rejected (we don't use cloudpickle) — pass a module-level function plus lightweight args.
# ✅ ships only the spec (~bytes); the child builds + writes cold = the whole job's RSS
benchmark_memory(build_and_write, spec, n)
# ❌ a lambda/closure — rejected; std pickle can't serialize it (even build-inside)
benchmark_memory(lambda: write(build(spec, n)))
# ❌ a top-level partial over a *pre-built* model pickles fine, but ships the model and
# measures *deserializing* it, not building it — the build never re-runs in the child
model = build(spec, n)
benchmark_memory(partial(write, model))
You can't isolate a single sub-phase. Since the child must build before it can operate,
isolated rss is a build-plus-operate capacity number by construction, never a per-phase
figure (e.g. write-only). For per-phase memory, use the in-process peak metric, which
can measure a write given an already-built model. So the rule is two-part: use a
top-level function (no lambdas), and don't pass heavy pre-built state — build it inside.
Absolute ceilings — max_peak / max_allocated / max_allocations¶
@pytest.mark.benchmem(max_peak="100MiB", max_allocations=5000)
def test_build(benchmark_memory):
benchmark_memory(build_model, 1000)
A baseline-free guardrail: the test fails if the measured metric exceeds the
ceiling (test_build: peak 117 MiB exceeds max_peak 100 MiB). Thresholds are absolute
only — there's no saved run to take a percent of; for relative gating against a prior run
use --benchmark-memory-compare-fail or benchmem compare --fail-on. A ceiling is a
worst-case budget, so with repeats > 1 (including adaptive sampling) the gate reads the
worst pass — not the headline min — and fails if any pass breaches it; the two coincide
for a single pass. The ceiling is enforced wherever memory is measured — the benchmark_memory
fixture and the --benchmark-memory patch — but a plain benchmark() call without
--benchmark-memory measures no memory, so the marker is a no-op there.
Scope: the benchmarked action only
This gates the benchmarked action only (the isolated call pytest-benchmem
measures), not the whole test. For a whole-test limit or leak check, that's
pytest-memray's limit_memory / limit_leaks
— see the README's "With pytest-memray".
How many passes? By default pytest-benchmem samples adaptively — after an untracked warmup
run, it runs the memray pass until the min floor settles (≥2 passes; capped at 10, or a
--benchmark-memory-max-time budget). Deterministic code settles in ~3 passes; noisy code runs
more. Set repeats=N (marker) or --benchmark-memory-repeats=N (suite) to force a fixed,
reproducible count — what CI gating against a saved baseline wants. Full rationale and the
noisy-workload guidance are in the guide: Repeats & adaptive sampling.
The benchmark_memory fixture¶
Depends on pytest-benchmark's benchmark fixture; measures peak in a separate untimed
pass, then times via pytest-benchmark.
Order — memory first (cold), then timing
Every call form runs the memray pass first, then pytest-benchmark's timing
(calibration + all rounds). This matters: timing runs the function thousands of times,
which grows and fragments the allocator's arenas — so measuring memory after timing
would report the warm plateau, not the fresh-process floor the headline min is meant
to be. Memory-first measures the cold cost (the warmup pass still sheds the one-time
cold-start within it); timing then runs cleanly, with no memray hooks active. This holds
for __call__, pedantic, and the --benchmark-memory patch alike. The standalone
measure_peak / measure_memory have no timing phase at all; warmup=0 skips the warmup,
repeats=N forces a fixed count.
Explicit control, like pytest-benchmark's pedantic plus a memory pass:
benchmark_memory.pedantic(target, args=(), kwargs=None, setup=None,
rounds=1, warmup_rounds=0, iterations=1)
setup— a callable run untracked before each measured call; if it returns(args, kwargs), those supply the call's arguments. Used for both the timed rounds and each (adaptive) memory sample — onesetuprebuilds fresh state for both — so a stateful action's memory samples stay independent. The same applies tobenchmark.pedantic(setup=…)under--benchmark-memory: no extra changes.rounds,warmup_rounds,iterations— as in pytest-benchmark.
Mostly memory, little timing? There's no memory-only switch — the entry rides
pytest-benchmark's timing. To trim it: --benchmark-min-rounds=1 --benchmark-max-time=0
(no test changes), or pedantic(rounds=1, warmup_rounds=0) for a single call. For pure
memory outside pytest, use measure_peak / measure_memory.
Attributes (available after a call):
| Attribute | What |
|---|---|
extra_info |
pytest-benchmark's per-benchmark dict. Set scalars here to attach analysis dims; the memory blob lands here under the benchmem key. |
peak_bytes |
peak memory (bytes) from the last call, or None before any call. |
result |
the full MemoryResult from the last call, or None. |
The extra_info.benchmem blob¶
Each measured benchmark stores this dict under extra_info["benchmem"] — three flat
per-repeat series, one entry per memray pass. Every reported number (headline peak =
min, any --stat) derives from these on read:
| Key | What |
|---|---|
peak_bytes |
per-repeat high-water of live bytes — the peak metric (headline = min) |
allocations |
per-repeat allocation count — the allocations metric |
total_bytes |
per-repeat total bytes allocated — the allocated metric (churn peak hides) |
rss_bytes |
per-repeat whole-process resident high-water (ru_maxrss) — the rss metric. Only present under isolate=True (each pass in a fresh process); absent otherwise. |
See Choosing a metric for when to reach for each, and --stat for distributions.
CLI — benchmem¶
Installed with pytest-benchmem[plot]. The full command tree and every option, captured
live from the typer app as it actually renders in a terminal:
Usage: benchmem [OPTIONS] COMMAND [ARGS]...
pytest-benchmem — plot and compare benchmark runs.
╭─ Options ────────────────────────────────────────────────────────────────────────────────────────╮
│ --install-completion Install completion for the current shell. │
│ --show-completion Show completion for the current shell, to copy it or customize the │
│ installation. │
│ --help Show this message and exit. │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
╭─ Commands ───────────────────────────────────────────────────────────────────────────────────────╮
│ plot Render an interactive plotly view from one or more pytest-benchmark runs. │
│ compare Print a per-id table for one run, or compare two or more (and optionally gate CI). │
│ sweep Run a benchmark suite across several installed versions of a package. │
│ flamegraph Render a kept memory profile in one step — resolve the ``.bin`` for a test and run │
│ memray. │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
Usage: benchmem plot [OPTIONS] RUNS...
Render an interactive plotly view from one or more pytest-benchmark runs.
╭─ Arguments ──────────────────────────────────────────────────────────────────────────────────────╮
│ * runs... PATH pytest-benchmark JSON file(s). [required] │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
╭─ Options ────────────────────────────────────────────────────────────────────────────────────────╮
│ --columns [time|peak|allocated|allocation Metric to plot: time | peak | │
│ s|rss] allocated | allocations | rss │
│ (rss = isolated runs only). One │
│ per figure (a plot has a single │
│ value axis) — same flag as │
│ `compare`; the spread shows as │
│ whiskers via --band. │
│ [default: time] │
│ --view TEXT compare | scatter | sweep | │
│ scaling (default: by count). │
│ --facet TEXT Dim to facet by. │
│ --pivot TEXT Comparison axis for --view │
│ compare/scatter: fold a single │
│ run along this dim instead of │
│ across run-files (param:NAME or a │
│ bare extra_info name); its values │
│ become the compared series. Like │
│ --group-by but it sets what's │
│ *compared*, not how rows cluster. │
│ Mutually exclusive with multiple │
│ runs. │
│ --x TEXT scaling: dim for the x-axis. │
│ --clip FLOAT Clamp the colour scale. │
│ --where TEXT Filter rows by dim: KEY=VALUE │
│ (repeatable, AND-combined). │
│ --free-axes [x|y|both] Free facet axes: x | y | both │
│ (needs --facet). │
│ --band [auto|minmax|none] scaling: spread whiskers on │
│ memory metrics — auto | minmax | │
│ none. │
│ [default: auto] │
│ --label -l TEXT Series label per run, in order │
│ (repeat). Default: stem. │
│ --output -o PATH HTML out. │
│ --open --no-open [default: no-open] │
│ --help Show this message and exit. │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
Usage: benchmem compare [OPTIONS] RUNS...
Print a per-id table for one run, or compare two or more (and optionally gate CI).
╭─ Arguments ──────────────────────────────────────────────────────────────────────────────────────╮
│ * runs... PATH One or more pytest-benchmark runs, oldest → newest. One prints a plain │
│ table; two or more compare (a sweep is N). │
│ [required] │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
╭─ Options ────────────────────────────────────────────────────────────────────────────────────────╮
│ --columns TEXT Comma list of metrics: time | peak | allocated | allocations | rss (rss │
│ = isolated runs only; e.g. peak or time,peak,rss). Default: time,peak. │
│ Each is shown across every --stat; a metric absent from every run is │
│ dropped. │
│ --group-by TEXT Group rows into sub-tables: fullname | name | func | group | module | │
│ class | param:NAME (comma-composable). │
│ [default: fullname] │
│ --stat TEXT Which stat column(s) per metric: min | max | mean | median | stddev, or │
│ all (the default) for the full spread side by side. │
│ --sort TEXT Row order: name (id) | value (largest in the last run) | change. │
│ [default: name] │
│ --pivot TEXT Comparison axis: fold a single run along this dim instead of across │
│ run-files — param:NAME or a bare extra_info name. Rows differing only in │
│ it pair up and its values become the compared series. Like --group-by │
│ but it sets what's *compared*, not how rows cluster. Mutually exclusive │
│ with multiple runs. │
│ --csv PATH Also write the raw (unscaled) comparison to this CSV file. │
│ --fail-on TEXT Exit non-zero on a regression of the first run vs the last (or, with │
│ --pivot, the first dim value vs the last). FIELD:THRESHOLD, repeatable — │
│ e.g. --fail-on peak:10% --fail-on peak:5MiB --fail-on rss:10% (rss gates │
│ only isolated runs). │
│ --help Show this message and exit. │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
Usage: benchmem sweep [OPTIONS] PACKAGE VERSIONS...
Run a benchmark suite across several installed versions of a package.
Provisions one fresh uv venv per version, runs 'pytest <suite> --benchmark-only'
in each writing <out>/<version>.json, then prints the next step. --memory adds
the memory pass; forward any other pytest flag with --pytest-arg, e.g.
benchmem sweep mypkg 1.2.0 1.3.0 --suite benchmarks/ --memory --pytest-arg=-k.
╭─ Arguments ──────────────────────────────────────────────────────────────────────────────────────╮
│ * package TEXT Package under test; each plain version installs `<package>==<v>`. │
│ [required] │
│ * versions... TEXT Versions or pip specs to sweep, e.g. 1.2.0 1.3.0 │
│ git+https://github.com/me/pkg@main. │
│ [required] │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
╭─ Options ────────────────────────────────────────────────────────────────────────────────────────╮
│ * --suite PATH Benchmark suite (dir or file) to run in each version's │
│ venv. │
│ [required] │
│ --out PATH Directory for the per-version JSON runs. │
│ [default: .benchmarks/sweep] │
│ --memory --no-memory Add --benchmark-memory to each pytest run. │
│ [default: no-memory] │
│ --pytest-arg TEXT Arg forwarded to pytest, one token each, repeatable │
│ (e.g. --pytest-arg=-k). │
│ --pin TEXT Extra pip spec installed alongside (repeatable). │
│ --as-of TEXT YYYY-MM-DD for uv --exclude-newer (reproducible │
│ resolve). │
│ --import-check TEXT Module asserted to resolve to the venv (isolation │
│ preflight). │
│ --copy-dir PATH Directory copied into each venv's cwd (the suite │
│ imports from here). │
│ --help Show this message and exit. │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
Usage: benchmem flamegraph [OPTIONS] PROFILE_DIR [TEST_ID]
Render a kept memory profile in one step — resolve the ``.bin`` for a test and run memray.
Closes the "regressed → *where*?" loop after ``--benchmark-memory-profile``: instead of
finding the right ``.bin`` and remembering the memray subcommand, point at the profile dir
and name the test (or ``--worst peak`` to auto-pick the heaviest). Defaults to an HTML
flamegraph written next to the ``.bin``; ``--report tree|summary|stats`` prints to the
terminal instead.
╭─ Arguments ──────────────────────────────────────────────────────────────────────────────────────╮
│ * profile_dir PATH Directory of kept .bin profiles (--benchmark-memory-profile). │
│ [required] │
│ [test_id] TEXT Test id (exact, or a unique substring) to render; omit with --worst. │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
╭─ Options ────────────────────────────────────────────────────────────────────────────────────────╮
│ --worst TEXT Auto-pick the heaviest: peak | allocated | allocations │
│ --report TEXT memray reporter: flamegraph | table | tree | summary | stats │
│ [default: flamegraph] │
│ --native --no-native Require the profile to carry native traces (captured via │
│ --benchmark-memory-profile-native); error if it doesn't. │
│ [default: no-native] │
│ --output -o PATH HTML out path (default: next to the .bin). │
│ --open --no-open Open the rendered HTML. [default: no-open] │
│ --force -f Overwrite an existing render. │
│ --help Show this message and exit. │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
Public Python API¶
Light to import — pytest_benchmem re-exports only the engine and the readers;
pytest_benchmem.plotting pulls plotly and pytest_benchmem.sweep shells to uv,
so import those submodules directly.
Engine¶
measure_peak ¶
Run action() under memray.Tracker and return peak bytes.
The bare one-liner for a REPL or notebook; :func:measure_memory returns the
full result (allocation count, spread). repeats behaves as there — None
(default) samples adaptively, an int forces a fixed pass count.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
action
|
Action
|
The zero-argument callable to measure. |
required |
repeats
|
int | None
|
Fixed pass count, or |
None
|
Returns:
| Type | Description |
|---|---|
int
|
Peak bytes (the headline |
measure_memory ¶
measure_memory(
action: Action,
repeats: int | None = None,
*,
warmup: int = _DEFAULT_WARMUP,
isolate: bool = False,
max_time: float | None = None,
min_passes: int = _ADAPTIVE_MIN_PASSES,
max_passes: int = _ADAPTIVE_MAX_PASSES,
patience: int = _ADAPTIVE_PATIENCE,
keep_bin: Path | None = None,
native: bool = False,
setup: Action | None = None,
) -> MemoryResult
Run action() under memray.Tracker → :class:MemoryResult, one pass per repeat.
warmup untracked dry-runs run first to shed one-time costs; then each measured pass
gets a fresh tracker. The headline is the min across passes (see :class:MemoryResult);
every pass's :class:Measurement is kept for spread stats.
With isolate=True each measured pass runs in a fresh spawned process (each warming
itself), and that child's whole-process resident high-water (ru_maxrss) is recorded as
:attr:Measurement.rss_bytes — a physical-memory reading attributable to the action, which
an in-process pass can't give. The action (and setup) must be picklable (a top-level
callable, not a lambda/closure); keep_bin is ignored in this mode.
Two modes, by repeats:
repeats=N(an int) — run exactlyNpasses. Fixed and reproducible; what CI gating and saved-baseline comparisons want.repeats=None(default) — sample adaptively: keep running passes until the min stops moving (no new low forpatiencepasses), bounded bymin_passes(≥2),max_passes, and an optionalmax_timebudget. Deterministic code settles in a few passes; noisy code runs more.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
action
|
Action
|
The zero-argument callable to measure. |
required |
repeats
|
int | None
|
Fixed pass count, or |
None
|
warmup
|
int
|
Untracked dry-runs ( |
_DEFAULT_WARMUP
|
isolate
|
bool
|
Run each pass in a fresh spawned process and record its |
False
|
max_time
|
float | None
|
Wall-clock budget (seconds) for adaptive sampling; |
None
|
min_passes
|
int
|
Minimum passes when sampling adaptively. |
_ADAPTIVE_MIN_PASSES
|
max_passes
|
int
|
Hard ceiling on passes when sampling adaptively. |
_ADAPTIVE_MAX_PASSES
|
patience
|
int
|
Stop adaptive sampling after this many consecutive passes with no new min. |
_ADAPTIVE_PATIENCE
|
keep_bin
|
Path | None
|
If set, the first pass's profile |
None
|
native
|
bool
|
Capture native (C/C++/Rust) stacks in the kept |
False
|
setup
|
Action | None
|
Optional zero-arg callable run untracked before each pass (and each warmup
run) — its allocations are not measured. Use it to rebuild fresh state so a stateful
|
None
|
Returns:
| Name | Type | Description |
|---|---|---|
A |
MemoryResult
|
class: |
MemoryResult
dataclass
¶
A memory measurement across repeats passes, derived from the per-repeat samples.
The per-repeat :attr:samples are the single source of truth — that's all the blob
stores (the series); everything else is derived from them on read.
The headline :attr:peak_bytes is the minimum peak across passes — the fresh-process
floor, unbiased by the in-process warm plateau (repeated runs fragment/grow arenas and
allocate more) that a central stat would report. :attr:allocations / :attr:total_bytes
come from that same min-peak run (a coherent snapshot); :attr:peak_bytes_max is the worst
peak, so the spread is visible. A warm-plateau / steady-state read is available via the
mean / median --stat. A single pass collapses all of these to its own values.
representative
property
¶
The min-peak run — the one the headline peak/allocations/total_bytes come from.
peak_bytes
property
¶
The headline peak — the minimum high-water across passes (the fresh-process floor).
peak_bytes_max
property
¶
The worst peak across repeats (equals :attr:peak_bytes with one repeat).
rss_bytes
property
¶
Headline whole-process RSS — the minimum ru_maxrss across isolated passes
(the cold floor, like :attr:peak_bytes), or None if memory wasn't measured in
isolation (in-process has no attributable process-global RSS).
series ¶
The per-repeat values of one series field (SERIES_FIELDS or optional).
as_dict ¶
The JSON blob stored under pytest-benchmark extra_info["benchmem"].
The three core per-repeat series, flat, plus any :data:OPTIONAL_SERIES_FIELDS
that were measured (all-or-nothing per result). No denormalized scalars and no
repeats (it's len of any series). Everything else derives on read.
from_blob
classmethod
¶
Rebuild from a blob's per-repeat series. Core columns are required; any
:data:OPTIONAL_SERIES_FIELDS are read when present (else left None).
Measurement
dataclass
¶
One repeat's raw numbers — memray's peak high-water, allocation count, and total bytes allocated (cumulative churn, incl. temporaries GC later frees), plus an optional whole-process resident high-water.
rss_bytes is getrusage's ru_maxrss from an isolated pass (a fresh child
process); None in-process, where a process-global RSS isn't attributable to the action.
Readers & loader¶
from_pytest_benchmark reads timing (seconds, from stats);
memory_from_pytest_benchmark reads memory (bytes, from extra_info.benchmem).
load_samples is the unified reader; load_long_df stacks runs into the tidy frame the
plots pivot. discover_runs collects saved runs from pytest-benchmark's .benchmarks/
storage, so you can hand the readers a directory instead of listing files.
from_pytest_benchmark ¶
Read timing out of a pytest-benchmark file → (label, samples, "s").
Dims come from each benchmark's parametrize params and extra_info, plus
the structural node.* dims (see :func:_node_dims).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
str | Path
|
A pytest-benchmark JSON file. |
required |
metric
|
str
|
Which pytest-benchmark stat to read ( |
'min'
|
Returns:
| Type | Description |
|---|---|
str
|
|
list[Sample]
|
and the unit ( |
memory_from_pytest_benchmark ¶
memory_from_pytest_benchmark(
path: str | Path,
*,
field: str = "peak_bytes",
reduce: Callable[[list[float]], float] | None = None,
) -> tuple[str, list[Sample], str]
Read memory out of a pytest-benchmark file → (label, samples, unit).
The benchmark_memory fixture stores each run's memory blob under
extra_info["benchmem"] (a flat per-repeat series per field), keyed by the same
benchmark id pytest-benchmark uses. Benchmarks lacking the blob (timing-only tests)
are skipped. Dims come from parametrize params and extra_info, plus the
structural node.* dims (see :func:_node_dims).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
str | Path
|
A pytest-benchmark JSON file. |
required |
field
|
str
|
Which series to read — |
'peak_bytes'
|
reduce
|
Callable[[list[float]], float] | None
|
Reduce the per-repeat series to one scalar. Default ( |
None
|
Returns:
| Type | Description |
|---|---|
str
|
|
list[Sample]
|
with the blob, and the unit ( |
load_samples ¶
load_samples(
path: str | Path,
*,
metric: Metric = "time",
stat: str | None = None,
) -> tuple[str, list[Sample], str]
Read one pytest-benchmark file for the chosen metric → (label, samples, unit).
The unified reader over :func:from_pytest_benchmark (timing) and
:func:memory_from_pytest_benchmark (memory).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
str | Path
|
A pytest-benchmark JSON file. |
required |
metric
|
Metric
|
Which metric to read ( |
'time'
|
stat
|
str | None
|
Distribution stat over the metric's per-repeat series ( |
None
|
Returns:
| Type | Description |
|---|---|
tuple[str, list[Sample], str]
|
|
load_long_df ¶
load_long_df(
runs: str | Path | Sequence[str | Path],
*,
metric: Metric = "time",
stat: str | None = None,
labels: Sequence[str] | None = None,
pivot: str | None = None,
) -> tuple[pd.DataFrame, str]
Stack pytest-benchmark files (one path or a sequence) into one long frame → (df, unit).
One row per (run, id) for the chosen metric. Columns: snapshot
(the series axis — see below), id (the pairing key), value, then one column
per dim key seen (missing dims are NaN). Every plot view and the compare table pivots
this frame, pairing rows on id and laying snapshot values side by side.
The series axis is just a dim. By default it's the run-file (snapshot = each file's
label), which is why a run-file is a comparison axis: compare a.json b.json ranks
one file against another. pivot re-points that axis at a real data dim instead — its
values become snapshot and it's lifted out of each row's identity (dropped from the
dims and stripped from the id) so rows differing only in it pair up. That lets one
combined run be A/B'd along a config dim (--pivot param:semantics) exactly as two files
are A/B'd today — the run-file is an external series axis, pivot promotes an
internal dim to the same role. The two are mutually exclusive (an A/B view has one series
axis), so pivot with more than one run is an error.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
runs
|
str | Path | Sequence[str | Path]
|
One path or a sequence of pytest-benchmark JSON files. |
required |
metric
|
Metric
|
Which metric to read ( |
'time'
|
stat
|
str | None
|
Distribution stat over the per-repeat series; |
None
|
labels
|
Sequence[str] | None
|
Overrides the |
None
|
pivot
|
str | None
|
Use this dim as the series axis instead of the run-file ( |
None
|
Returns:
| Type | Description |
|---|---|
tuple[DataFrame, str]
|
|
discover_runs ¶
Return pytest-benchmark JSON files under root (for CLI suggestions).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
root
|
str | Path
|
Directory to search (default: pytest-benchmark's |
'.benchmarks'
|
Returns:
| Type | Description |
|---|---|
list[Path]
|
The JSON file paths found under |
Sample ¶
Bases: NamedTuple
One measured result: an opaque id, a value, and analysis dims.
Plotting — pytest_benchmem.plotting¶
Every plot_* returns (figure, n_ids). snapshots is a list of run JSON paths;
labels names the series per run (defaults to the file stems) — the API behind plot's
-l/--label. plot_compare's sort is "absolute" (native units) or "relative"
(percent).
plot_scaling ¶
plot_scaling(
snapshots: Snapshots,
*,
metric: Metric = "time",
x: str | None = None,
color: str | None = None,
facet: str | None = None,
log: bool | Literal["auto"] = "auto",
band: Literal["auto", "minmax", "none"] = "auto",
where: Mapping[str, str] | None = None,
free_axes: FreeAxes | None = None,
labels: Sequence[str] | None = None,
) -> tuple[Figure, int]
Cost vs a numeric dim, coloured/faceted by other dims.
x/color/facet default to inference from the dims (the lone numeric
dim → x); pass them to override.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
snapshots
|
Snapshots
|
Run JSON path(s). |
required |
metric
|
Metric
|
Which metric to plot ( |
'time'
|
x
|
str | None
|
Dim for the x-axis (default: the lone numeric dim). |
None
|
color
|
str | None
|
Dim to colour by (default: inferred). |
None
|
facet
|
str | None
|
Dim to split into subplots (default: inferred). |
None
|
log
|
bool | Literal['auto']
|
|
'auto'
|
band
|
Literal['auto', 'minmax', 'none']
|
Spread whiskers ( |
'auto'
|
where
|
Mapping[str, str] | None
|
Keep only rows matching these |
None
|
free_axes
|
FreeAxes | None
|
|
None
|
labels
|
Sequence[str] | None
|
Names the snapshot in the title (default: file stem). |
None
|
Returns:
| Type | Description |
|---|---|
tuple[Figure, int]
|
|
plot_scatter ¶
plot_scatter(
snapshots: Snapshots,
*,
metric: Metric = "time",
facet: str | None = None,
clip: float | None = None,
where: Mapping[str, str] | None = None,
free_axes: FreeAxes | None = None,
labels: Sequence[str] | None = None,
pivot: str | None = None,
) -> tuple[Figure, int]
Baseline cost (log-x) vs candidate/baseline ratio (log-y).
Top-right = slow and slower (the regressed corner). The first series is the baseline;
with 3+, the rest animate. Colour encodes the absolute Δ. The series axis is the run-file
by default; pivot re-points it at a data dim, folding a single run so its dim-values are
the series (the first being the baseline) instead of the files (see :func:load_long_df).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
snapshots
|
Snapshots
|
Run JSON path(s); the first is the baseline, extras animate (one run when
|
required |
metric
|
Metric
|
Which metric to plot ( |
'time'
|
facet
|
str | None
|
Dim to split into subplots. |
None
|
clip
|
float | None
|
Clamp the colour scale (default p95). |
None
|
where
|
Mapping[str, str] | None
|
Keep only rows matching these |
None
|
free_axes
|
FreeAxes | None
|
Give each facet its own axes instead of sharing. |
None
|
labels
|
Sequence[str] | None
|
Series names per run (default: file stems). Ignored when |
None
|
pivot
|
str | None
|
Use this dim as the series axis instead of the run-file ( |
None
|
Returns:
| Type | Description |
|---|---|
tuple[Figure, int]
|
|
plot_compare ¶
plot_compare(
snapshots: Snapshots,
*,
metric: Metric = "time",
sort: SortMode = "absolute",
facet: str | None = None,
clip: float | None = None,
where: Mapping[str, str] | None = None,
free_axes: FreeAxes | None = None,
labels: Sequence[str] | None = None,
pivot: str | None = None,
) -> tuple[Figure, int]
Bar chart of per-id delta, sorted by the chosen Δ (biggest regressions on top).
The first two series are compared; the first is the baseline. The series axis is the
run-file by default (the first two files); pivot re-points it at a data dim, folding a
single run so its first two dim-values become the A and B series instead (see
:func:load_long_df).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
snapshots
|
Snapshots
|
Run JSON path(s); only the first two are used (one run when |
required |
metric
|
Metric
|
Which metric to plot ( |
'time'
|
sort
|
SortMode
|
|
'absolute'
|
facet
|
str | None
|
Dim to split into subplots. |
None
|
clip
|
float | None
|
Clamp the colour scale (default symmetric p95). |
None
|
where
|
Mapping[str, str] | None
|
Keep only rows matching these |
None
|
free_axes
|
FreeAxes | None
|
Give each facet its own axes instead of sharing. |
None
|
labels
|
Sequence[str] | None
|
Series names for the two runs (default: file stems). Ignored when |
None
|
pivot
|
str | None
|
Use this dim as the series axis instead of the run-file ( |
None
|
Returns:
| Type | Description |
|---|---|
tuple[Figure, int]
|
|
plot_sweep ¶
plot_sweep(
snapshots: Snapshots,
*,
metric: Metric = "time",
clip: float | None = None,
where: Mapping[str, str] | None = None,
labels: Sequence[str] | None = None,
) -> tuple[Figure, int]
Heatmap of per-id fold-change (log2 ratio) vs the first snapshot.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
snapshots
|
Snapshots
|
Run JSON paths; columns in order, the first is the reference. |
required |
metric
|
Metric
|
Which metric to plot ( |
'time'
|
clip
|
float | None
|
Clamp the colour scale. |
None
|
where
|
Mapping[str, str] | None
|
Keep only rows matching these |
None
|
labels
|
Sequence[str] | None
|
Column (version) names (default: file stems). |
None
|
Returns:
| Type | Description |
|---|---|
tuple[Figure, int]
|
|
Sweeps — pytest_benchmem.sweep¶
See Cross-version sweeps for the narrative, the Venv object, and the
provision parameters.
sweep ¶
sweep(
versions: Sequence[str],
run: Callable[[Venv], None],
**provision_kwargs: object,
) -> list[str]
Provision a venv per version and call run(venv) in each.
run does whatever the consumer needs (invoke pytest / a memory command
with venv.python and cwd=venv.cwd). Returns the list of versions
that failed to provision.