How BenchProctor scores a SAST tool
The whole scoring model is a confusion matrix and one subtraction. Here's how true-positive and false-positive rates become a single number, why we average per category, and how the benchmark checks itself.
There is no machine learning in BenchProctor’s scoring, no learned judge, no opaque aggregate. It is a confusion matrix and one subtraction. That is deliberate: a benchmark you can’t audit is just another tool you have to trust.
Four buckets
Every test case has a known label — vulnerable or safe — in the CSV answer key. After your scanner runs, each case lands in one of four buckets:
detected ignored
vulnerable TP FN
safe FP TN
- TP (true positive): vulnerable, and the tool flagged it. Good.
- FN (false negative): vulnerable, and the tool missed it. A real bug shipped.
- FP (false positive): safe, and the tool flagged it anyway. Noise that erodes trust.
- TN (true negative): safe, and the tool stayed quiet. Good.
A case counts as detected if the tool produced at least one finding whose SARIF location points at that test file. Nothing about the category or the line has to match — just that the tool flagged the file at all. That keeps scoring fair across tools with different reporting granularity.
Two rates and one score
From the four buckets come two rates:
TPR = TP / (TP + FN) detection rate — of the real bugs, how many caught?
FPR = FP / (FP + TN) false-alarm rate — of the safe code, how much flagged?
The headline score is Youden’s J, the difference between them:
J = TPR − FPR
It runs from +1.0 to −1.0:
- +1.0 — every vulnerability caught, zero false alarms. Perfect.
- 0.0 — no better than a coin flip. A flag-everything tool lands here.
- −1.0 — inverted: flags the safe code, misses the real bugs.
A single number that rewards detection and penalizes noise in equal measure is exactly what you want from an accuracy metric. A tool that finds every bug but drowns you in false positives is not a good tool, and its J score says so.
Why flag-everything scores zero
Because the corpus is balanced 50/50, the laziest strategy collapses. Suppose a
tool reports every single file as vulnerable. It catches all the true positives
(TPR = 1.0) — and flags all the safe files too (FPR = 1.0). Its score is
1.0 − 1.0 = 0.0. The balance is what turns “flag everything” from a cheat into
a wash.
Category-averaged is the headline
There are two honest ways to combine per-category results, and BenchProctor reports both:
- Category-averaged (macro). Compute TPR and FPR for each category independently, then average across categories. Every vulnerability class counts equally, so a tool can’t earn a great score by nailing one enormous category and ignoring a dozen small ones. This is the number we lead with.
- Flat aggregate. Pool every case together and compute one TPR and FPR. Useful for comparison, but it lets large categories dominate.
When the two diverge, that gap is itself a finding: it usually means a tool is strong on a few common classes and weak across the long tail.
$ python score_sarif.py results.sarif corpus/expectedresults-2026.2.csv
category-averaged TPR 0.962 FPR 0.044 J 0.918
flat aggregate TPR 0.961 FPR 0.046 J 0.915
weakest categories
xxe J 0.71
open_redirect J 0.68
ssti J 0.64
The benchmark checks itself
A scorer is only as trustworthy as the answer key it reads. Every corpus ships a self-test SARIF — a synthetic “perfect tool” output that flags exactly the vulnerable files and nothing else. Score it against the CSV and it must come back at TPR 100%, FPR 0%, J 1.0. If it ever doesn’t, the labels and the scorer have drifted out of sync, and that’s a bug in the benchmark, not your tool. The benchmark fails first, before it can mislead you.
That’s the whole model. Run your scanner, hand the SARIF and the CSV to a one-file scorer, and read a number you can recompute by hand if you doubt it. No black boxes — just counts, two ratios, and a subtraction.