Why static SAST benchmarks rot, and what quarterly rotation fixes

Publish a benchmark once and leave it still, and it starts decaying the moment anyone looks at it. The code doesn’t go stale. The answers do, because they leak into the world around it.

The failure mode

A static corpus is a fixed set of files with fixed labels. Over time:

Tool authors tune against it. Detection rules get shaped to the exact patterns in the suite, which is great for the suite and unremarkable everywhere else.
The files end up in training data. Once a public benchmark is scraped, a model can recognize a case without analyzing it. The score measures recall of a dataset, not reasoning about code.
“Improvements” become unfalsifiable. When the same 2,000 files are scored year after year, you can’t tell a genuinely better analyzer from one that has simply seen the answer key more times.

The result is score inflation that looks like progress. A number climbs, and nobody can say whether the tool got smarter or just more familiar.

What we actually want to hold fixed

The instinct is to keep generating new files. But if every release is a fresh random corpus, you lose the thing a benchmark is for: comparability. A score from this quarter has to mean roughly the same thing as a score from last quarter, or you can’t track regressions.

So the real requirement is subtle: change the code, keep the contract. The specific test files should be different every quarter, so nothing can be memorized, while everything that determines what the score means stays constant.

What rotation holds fixed

Each quarter, the actual test code changes completely. What does not change is everything that decides what a score means:

Invariant	Held constant
CWE identity per category	Fixed, so a category always tests the same weakness
Difficulty distribution	≥20% trivial, ≥50% realistic, ≥20% hard, every release
True-positive / true-negative balance	50 / 50
Language & framework coverage	Unchanged across rotations

Two properties fall out of that split:

Deterministic, not a lucky draw. A release is a fixed artifact with a fixed answer key, not a random pile that happens to look right. Rebuild the same release and you get the same corpus.
Unmemorizable but comparable. This quarter’s files are fresh, so nothing you trained on last quarter shows up, yet a 2026.2 score and a 2026.3 score are measuring the same thing.

Why balance matters more than it looks

Holding the corpus at 50% vulnerable and 50% safe isn’t cosmetic. It’s what makes the score resistant to the laziest possible gaming strategy: flag everything. On a balanced corpus, a tool that reports every file as vulnerable catches every true positive, and every false positive right along with it. Its detection rate and false-alarm rate cancel, and it scores roughly zero. A benchmark that is mostly vulnerable would reward that tool; a balanced one exposes it.

The point

A benchmark earns trust by being hard to game, and a frozen one is trivially gameable over a long enough timeline. Rotation closes that gap: the code is new each quarter, the contract is identical, and the artifact is reproducible from a seed. You get a moving target that still reads the same on the ruler.

Next: the scoring itself. A confusion matrix, Youden’s J, and why we report a category-averaged number as the headline.