Introducing BenchProctor: a SAST benchmark you can't game

A SAST tool is only as trustworthy as its accuracy, and accuracy is meaningless without ground truth. You can run a scanner against your own codebase all day, but unless someone has already labeled every line as vulnerable or safe, you have no idea whether a clean report means clean code or a blind tool.

That is what a benchmark is for: a body of code where the answer is known in advance, so you can ask two precise questions of any tool. Does it find the real bug, and does it stay quiet on code that is actually fine? BenchProctor exists to answer both, for any tool, across nine languages, with a corpus built so the score can’t be inflated by memorization or naming tricks.

The benchmarks we had were measuring the wrong thing

Most public SAST benchmarks share the same three weaknesses:

They are hand-authored and frozen. A fixed set of human-written cases gets published once and never changes. Tools, and increasingly the models behind them, overfit to it. A high score starts to mean “has seen this corpus,” not “analyzes code well.”
They leak the answer in the filename. When a test lives at sqli/BenchmarkTest01729_true_positive.java, a scanner can score well by pattern-matching the path. You are no longer measuring analysis.
They cover one language, one shape. Single-file, Java-only suites with no sanitizers don’t resemble the findings that matter: taint that crosses files, services, and languages, sitting next to defenses that almost work.

What BenchProctor does differently

Real taint flows, not hand-written snippets. Each vulnerability class is a taint flow: untrusted input enters at a source, moves through a propagator, and reaches a dangerous call, the sink, with a sanitizer being the thing that would make it safe. A vulnerable case is missing an effective sanitizer; its safe twin has one. Every case is idiomatic code constrained to a realistic flow, so a scanner never gets a free win from a nonsense pairing.

Anti-leakage by construction. Emitted test files contain no comments, no CWE tags, no category names, and no hints in their identifiers. A file name tells you nothing about its category or its label. The only ground truth is a separate CSV answer key that the scanner never reads.

Polyglot. Nine languages (Python, Go, Java, JavaScript, TypeScript, PHP, Ruby, Bash, and Rust) across eighteen web frameworks, each rendered in idioms a developer in that ecosystem would actually write.

Rotated every quarter. The test code changes completely between quarters while every scoring-relevant invariant stays constant. You can’t pre-train against next quarter’s files, and last quarter’s score stays comparable. (More on that in a future post.)

Machine-verifiable. Every label ships with the code that justifies it, so you can answer “why is this a true positive?” from the case itself. The labels are verified before a release goes out, and a flawless tool would score a perfect mark against the answer key, which proves the labels and the scorer agree.

Running it takes three commands

BenchProctor is tool-agnostic. Anything that emits SARIF 2.1.0 can be scored.

# 1. run your scanner, export SARIF 2.1.0
your-tool scan ./corpus --format sarif -o results.sarif

# 2. score against the answer key
python score_sarif.py results.sarif corpus/expectedresults-*.csv

# 3. read TPR, FPR, and your Youden J

The scorer is a single standard-library Python file. No dependencies, nothing to trust but source you can read in a sitting.

Free, and staying that way

The corpus and the scorer are released under Apache 2.0. No signup, no account, no telemetry. Corpora are versioned and published quarterly. Grab the current release, score your tool, and tell us where it’s wrong, because the labels are all in the open, receipts included.

This blog is where we’ll write up the methodology in detail: how rotation preserves comparability, how scoring resists gaming, and what a corpus full of broken sanitizers and cross-service taint actually exercises in a scanner. Next up: why static benchmarks rot, and what quarterly rotation fixes.