BenchProctor blog

Engineering notes

How we build a SAST benchmark you can't game: methodology, scoring, coverage, and the occasional war story from generating millions of labeled test cases.

Jun 28, 2026 · scoring, methodology

Reading a SAST scorecard

A single SAST score can hide as much as it shows. Here is how to read recall, false positives, and the traps that make a good tool look bad.

Read post →

Jun 26, 2026 · methodology, rotation

The 2026.2 corpus: versioned and rotated

A frozen benchmark gets memorized. BenchProctor ships a fresh corpus every quarter, so a high score means a tool analyzes code well, not that it has seen the test before.

Read post →

Jun 24, 2026 · sarif, scoring, interoperability

Bring any scanner's SARIF and we'll find the CWE

Scoring against BenchProctor is one standard-library Python file with no dependencies. The catch most benchmarks trip on is that tools don't report CWEs the same way, so the scorer recovers the CWE from wherever your tool actually writes it, with no per-tool adapter, and grades every tool on the same honest footing.

Read post →

Jun 22, 2026 · announcement, release

Java and Python are live: point a scanner at them, get a real number

BenchProctor's first two languages ship today. Java (Spring, Jakarta EE) and Python (Flask, Django, FastAPI), standalone, every supported CWE, balanced 50/50. An answer key you can't read off a filename, scored from any tool's SARIF.

Read post →

Jun 16, 2026 · methodology, sast, benchmarking

What makes a SAST test actually hard

A benchmark only means something if a pattern-matcher can't ace it. Here's what's inside a corpus designed to be hard: real framework idioms instead of toy snippets, sanitizers that are present but broken, taint that travels several steps, and a strict 50/50 split that makes 'flag everything' score zero.

Read post →

Jun 11, 2026 · methodology, correctness, benchmarking

A wrong answer key punishes the tools that get it right

A SAST benchmark with a mislabeled 'safe' file doesn't just measure wrong. It scores a correct finding as a false positive and penalizes the tool that was right. BenchProctor won't let a misplaced label ship.

Read post →

Jun 9, 2026 · ecosystem, benchmark

Proof, not vibes: the yardstick the whole stack answers to

SAST vendors grade their own homework. BenchProctor is the open, machine-verifiable benchmark that scores any tool on a real number, and it's the proof layer the entire stack is built to survive.

Read post →

May 30, 2026 · release, roadmap

Java first: why we release one language at a time

BenchProctor's corpus spans nine languages, but we publish each only once it's verified production-ready. Java ships fully to the public before the end of June 2026. Here's why we're not dumping all nine at once.

Read post →

May 28, 2026 · scoring, methodology

How BenchProctor scores a SAST tool

The whole scoring model is a confusion matrix and one subtraction. Here's how true-positive and false-positive rates become a single number, why we average per category, and how the benchmark checks itself.

Read post →

May 24, 2026 · methodology, benchmarking

Why static SAST benchmarks rot, and what quarterly rotation fixes

A frozen benchmark measures memorization as much as analysis. Here's the failure mode, and how rotating the corpus on a seed keeps scores honest without breaking comparability.

Read post →

May 20, 2026 · announcement, methodology

Introducing BenchProctor: a SAST benchmark you can't game

A polyglot, anti-leakage, quarterly-rotated corpus for measuring how accurately a static analysis tool actually finds vulnerabilities, and how often it cries wolf.

Read post →