How BenchProctor scores a SAST tool

There is no machine learning in BenchProctor’s scoring, no learned judge, no opaque aggregate. It is a confusion matrix and one subtraction. That is deliberate: a benchmark you can’t audit is just another tool you have to trust.

Four buckets

Every test case has a known label, vulnerable or safe, in the CSV answer key. After your scanner runs, each case lands in one of four buckets:

                detected   ignored
 vulnerable        TP         FN
 safe              FP         TN

TP (true positive): vulnerable, and the tool flagged it. Good.
FN (false negative): vulnerable, and the tool missed it. A real bug shipped.
FP (false positive): safe, and the tool flagged it anyway. Noise that erodes trust.
TN (true negative): safe, and the tool stayed quiet. Good.

A case counts as detected if the tool produced at least one finding whose SARIF location points at that test file. Nothing about the category or the line has to match, only that the tool flagged the file at all. That keeps scoring fair across tools with different reporting granularity.

Two rates and one score

From the four buckets come two rates:

TPR = TP / (TP + FN)     detection rate    (of the real bugs, how many caught?)
FPR = FP / (FP + TN)     false-alarm rate  (of the safe code, how much flagged?)

The headline score is Youden’s J, the difference between them:

J = TPR − FPR

It runs from +1.0 to −1.0:

+1.0: every vulnerability caught, zero false alarms. Perfect.
0.0: no better than a coin flip. A flag-everything tool lands here.
−1.0: inverted, so it flags the safe code and misses the real bugs.

A single number that rewards detection and penalizes noise in equal measure is exactly what you want from an accuracy metric. A tool that finds every bug but drowns you in false positives is not a good tool, and its J score says so.

Why flag-everything scores zero

Because the corpus is balanced 50/50, the laziest strategy collapses. Suppose a tool reports every single file as vulnerable. It catches all the true positives (TPR = 1.0), and it flags all the safe files too (FPR = 1.0). Its score is 1.0 − 1.0 = 0.0. The balance is what turns “flag everything” from a cheat into a wash.

Category-averaged is the headline

There are two honest ways to combine per-category results, and BenchProctor reports both:

Category-averaged (macro). Compute TPR and FPR for each category independently, then average across categories. Every vulnerability class counts equally, so a tool can’t earn a great score by nailing one enormous category and ignoring a dozen small ones. This is the number we lead with.
Flat aggregate. Pool every case together and compute one TPR and FPR. Useful for comparison, but it lets large categories dominate.

When the two diverge, that gap is itself a finding: it usually means a tool is strong on a few common classes and weak across the long tail.

$ python score_sarif.py results.sarif corpus/expectedresults-2026.2.csv

  category-averaged   TPR 0.962   FPR 0.044   J 0.918
  flat aggregate      TPR 0.961   FPR 0.046   J 0.915

  weakest categories
    xxe                J 0.71
    open_redirect      J 0.68
    ssti               J 0.64

The benchmark checks itself

A scorer is only as trustworthy as the answer key it reads. Before a release goes out, the benchmark checks that a flawless tool would score TPR 100%, FPR 0%, J 1.0 against the answer key. If that ever comes up short, the labels and the scorer have drifted out of sync, which is a bug in the benchmark, not your tool. The benchmark fails first, before it can mislead you.

That’s the whole model. Run your scanner, hand the SARIF and the CSV to a one-file scorer, and read a number you can recompute by hand if you doubt it. No black boxes, just counts, two ratios, and a subtraction.