Java first: why we release one language at a time
BenchProctor's corpus spans nine languages, but we publish each only once it's verified production-ready. Java ships fully to the public before the end of June 2026 — here's why we're not dumping all nine at once.
BenchProctor’s corpus spans nine languages — Python, Go, Java, JavaScript, TypeScript, PHP, Ruby, Bash, and Rust — each with two to three or more real frameworks (Spring and Jakarta EE for Java; Express and Koa for JavaScript; Flask, Django, and FastAPI for Python). The generator produces all of them.
We are not releasing all nine at once. Here’s the honest reason.
”Supports” is not “production-ready”
A generator that emits Rust or Ruby is not the same thing as a corpus we’d stake a tool’s accuracy score on. The whole value of a benchmark is that its answer key is correct. If even a small fraction of the labels are wrong, every number you compute against it is quietly wrong too — and a benchmark that gives false confidence is worse than no benchmark at all.
So before a language goes public, it has to clear a hard bar:
- Compile-clean across the real toolchain for every framework, not just “looks like valid syntax.”
- Every label independently verified — the proof manifest names the exact source, propagator, sanitizer, and sink for each file, and the bundled self-test SARIF must score a perfect Youden’s J against the answer key. If the labels and the scorer ever disagree, that check fails first.
- Anti-leakage enforced per file — no comments, no CWE tags, no naming hints, shuffled IDs.
- Idioms a real developer would actually write, framework by framework — not toy snippets that no production codebase resembles.
Until a language clears that bar in public, we hold it back. We’d rather ship one rock-solid language than nine shaky ones.
Where each language stands
- Java — production-ready. Spring (Boot 4 / Framework 7 / Security 7) and Jakarta EE 11. This is the most thoroughly verified language in the corpus, and it launches fully to the public before the end of June 2026.
- Python — close behind. Flask, Django, and FastAPI. It’s the next language we expect to clear the bar.
- The rest — Go, Rust, PHP, Ruby, JavaScript, TypeScript, Bash — follow as each is verified to the same standard. They exist in the generator today; they ship publicly when we can stand behind every label.
Honesty is the product
It would be easy to publish a “nine-language benchmark” headline and let people discover the rough edges themselves. That’s not the deal. A benchmark earns trust by being right, and being right about Java first is worth more than being approximately right about everything.
When Java lands at the end of June, you’ll be able to point any SAST tool that emits SARIF 2.1.0 at it, run the one-file scorer, and get a true-positive rate, a false-positive rate, and a Youden’s J you can recompute by hand. The other languages will arrive the same way: only when the answer key is one we’d defend line by line.