Why static SAST benchmarks rot — and what quarterly rotation fixes
A frozen benchmark measures memorization as much as analysis. Here's the failure mode, and how rotating the corpus on a seed keeps scores honest without breaking comparability.
Publish a benchmark once and leave it still, and it starts decaying the moment anyone looks at it. Not because the code goes stale — because the answers leak into the world around it.
The failure mode
A static corpus is a fixed set of files with fixed labels. Over time:
- Tool authors tune against it. Detection rules get shaped to the exact patterns in the suite, which is great for the suite and unremarkable everywhere else.
- The files end up in training data. Once a public benchmark is scraped, a model can recognize a case without analyzing it. The score measures recall of a dataset, not reasoning about code.
- “Improvements” become unfalsifiable. When the same 2,000 files are scored year after year, you can’t tell a genuinely better analyzer from one that has simply seen the answer key more times.
The result is score inflation that looks like progress. A number climbs, and nobody can say whether the tool got smarter or just more familiar.
What we actually want to hold fixed
The instinct is to keep generating new files. But if every release is a fresh random corpus, you lose the thing a benchmark is for: comparability. A score from this quarter has to mean roughly the same thing as a score from last quarter, or you can’t track regressions.
So the real requirement is subtle: change the code, keep the contract. The specific test files should be different every quarter — so nothing can be memorized — while everything that determines what the score means stays constant.
Rotation on a seed
BenchProctor generates each release from a fixed rotation seed. The seed selects which combinations of source, propagator, sanitizer, and sink get emitted, so the actual code changes completely between quarters. What does not change:
| Invariant | Held constant |
|---|---|
| CWE identity per category | Fixed — a category always tests the same weakness |
| Difficulty distribution | ≥20% trivial, ≥50% realistic, ≥20% hard, every release |
| True-positive / true-negative balance | 50 / 50 |
| Language & framework coverage | Unchanged across rotations |
Two properties fall out of this:
- Reproducible. The same seed regenerates the same corpus, byte for byte. A release is a deterministic artifact, not a roll of the dice — anyone can rebuild it and get an identical answer key.
- Unmemorizable but comparable. A new seed yields fresh variants drawn from the same pools. Nothing you trained on last quarter appears this quarter, yet a 2026.2 score and a 2026.3 score are measuring the same thing.
Why balance matters more than it looks
Holding the corpus at 50% vulnerable and 50% safe isn’t cosmetic. It’s what makes the score resistant to the laziest possible gaming strategy: flag everything. On a balanced corpus, a tool that reports every file as vulnerable catches every true positive — and every false positive too. Its detection rate and false-alarm rate cancel, and it scores roughly zero. A benchmark that is mostly vulnerable would reward that tool; a balanced one exposes it.
The point
A benchmark earns trust by being hard to game, and a frozen one is trivially gameable over a long enough timeline. Rotation closes that gap: the code is new each quarter, the contract is identical, and the artifact is reproducible from a seed. You get a moving target that still reads the same on the ruler.
Next: the scoring itself — confusion matrix, Youden’s J, and why we report a category-averaged number as the headline.