> BenchProctor / blog
· announcement, methodology

Introducing BenchProctor: a SAST benchmark you can't game

A polyglot, anti-leakage, quarterly-rotated corpus for measuring how accurately a static analysis tool actually finds vulnerabilities — and how often it cries wolf.

A SAST tool is only as trustworthy as its accuracy, and accuracy is meaningless without ground truth. You can run a scanner against your own codebase all day, but unless someone has already labeled every line as vulnerable or safe, you have no idea whether a clean report means clean code or a blind tool.

That is what a benchmark is for: a body of code where the answer is known in advance, so you can ask two precise questions of any tool. Does it find the real bug? And does it flag code that is actually fine? BenchProctor exists to answer both — for any tool, across nine languages, with a corpus designed so the score can’t be inflated by memorization or naming tricks.

The benchmarks we had were measuring the wrong thing

Most public SAST benchmarks share the same three weaknesses:

  • They are hand-authored and frozen. A fixed set of human-written cases gets published once and never changes. Tools — and increasingly, the models behind them — overfit to it. A high score starts to mean “has seen this corpus,” not “analyzes code well.”
  • They leak the answer in the filename. When a test lives at sqli/BenchmarkTest01729_true_positive.java, a scanner can score well by pattern-matching the path. You are no longer measuring analysis.
  • They cover one language, one shape. Single-file, Java-only suites with no sanitizers don’t resemble the findings that matter: taint that crosses files, services, and languages, sitting next to defenses that almost work.

What BenchProctor does differently

Combinatorial, not hand-written. Each vulnerability class is expressed as a taint flow over four axes — where untrusted input enters (the source), how it moves (the propagator), what would make it safe (the sanitizer), and the dangerous call it reaches (the sink). The corpus is assembled by combining those building blocks into concrete, idiomatic code. A vulnerable case omits an effective sanitizer; its safe twin applies one. The space is large by construction, and every combination is constrained to a realistic flow.

Anti-leakage by construction. Emitted test files contain no comments, no CWE tags, no category names, and no hints in their identifiers. File IDs are shuffled, so a filename tells you nothing about a file’s category or its label. The only ground truth is a separate CSV answer key that the scanner never reads.

Polyglot. Nine languages — Python, Go, Java, JavaScript, TypeScript, PHP, Ruby, Bash, and Rust — across eighteen web frameworks, each rendered in idioms a developer in that ecosystem would actually write.

Rotated every quarter. Each release is generated from a fixed seed that changes which combinations are emitted while holding every scoring-relevant invariant constant. You can’t pre-train against next quarter’s files, and last quarter’s score stays comparable. (More on that mechanism in a future post.)

Machine-verifiable. Every corpus ships a proof manifest: one record per file naming the exact source, propagator, sanitizer, sink, difficulty, the sink’s line number, and a SHA-256 of the file. You can answer “why is this a true positive?” from metadata alone — and a bundled self-test scores a perfect mark, proving the answer key and the scorer agree.

Running it takes three commands

BenchProctor is tool-agnostic. Anything that emits SARIF 2.1.0 can be scored.

# 1. run your scanner, export SARIF 2.1.0
your-tool scan ./corpus --format sarif -o results.sarif

# 2. score against the answer key
python score_sarif.py results.sarif corpus/expectedresults-*.csv

# 3. read TPR, FPR, and your Youden J

The scorer is a single standard-library Python file — no dependencies, nothing to trust but source you can read in a sitting.

Free, and staying that way

The corpus and the scorer are released under Apache 2.0. No signup, no account, no telemetry. Corpora are versioned and published quarterly; grab the current release, score your tool, and tell us where it’s wrong — the labels are all in the open, receipts included.

This blog is where we’ll write up the methodology in detail: how rotation preserves comparability, how scoring resists gaming, and what a corpus full of broken sanitizers and cross-service taint actually exercises in a scanner. Next up: why static benchmarks rot, and what quarterly rotation fixes.