Java first: why we release one language at a time

BenchProctor’s corpus spans nine languages: Python, Go, Java, JavaScript, TypeScript, PHP, Ruby, Bash, and Rust. Each covers two to three or more real frameworks (Spring and Jakarta EE for Java; Express and Koa for JavaScript; Flask, Django, and FastAPI for Python).

We are not releasing all nine at once. Here’s the honest reason.

”Supports” is not “production-ready”

A generator that emits Rust or Ruby is not the same thing as a corpus we’d stake a tool’s accuracy score on. The whole value of a benchmark is that its answer key is correct. If even a small fraction of the labels are wrong, every number you compute against it is quietly wrong too, and a benchmark that gives false confidence is worse than no benchmark at all.

So before a language goes public, it has to clear a hard bar:

Compile-clean across the real toolchain for every framework, not just “looks like valid syntax.”
Every label independently verified. Each file’s label comes with the code that justifies it, and a flawless tool has to score a perfect Youden’s J against the answer key. If the labels and the scorer ever disagree, that check fails first.
No leakage, per file. No comments, no CWE tags, no naming hints, and file names that give away neither category nor label.
Idioms a real developer would actually write, framework by framework, not toy snippets that no production codebase resembles.

Until a language clears that bar in public, we hold it back. We’d rather ship one rock-solid language than nine shaky ones.

Where each language stands

Java is production-ready. Spring (Boot 4 / Framework 7 / Security 7) and Jakarta EE 11. This is the most thoroughly verified language in the corpus, and it launches fully to the public before the end of June 2026.
Python is close behind. Flask, Django, and FastAPI. It’s the next language we expect to clear the bar.
The rest (Go, Rust, PHP, Ruby, JavaScript, TypeScript, Bash) follow as each is verified to the same standard. They exist in the generator today; they ship publicly when we can stand behind every label.

Honesty is the product

It would be easy to publish a “nine-language benchmark” headline and let people discover the rough edges themselves. That’s not the deal. A benchmark earns trust by being right, and being right about Java first is worth more than being approximately right about everything.

When Java lands at the end of June, you’ll be able to point any SAST tool that emits SARIF 2.1.0 at it, run the one-file scorer, and get a true-positive rate, a false-positive rate, and a Youden’s J you can recompute by hand. The other languages will arrive the same way, only once the answer key is one we’d defend line by line.