How do you tell if a backtest is overfit?

The warning signs are recognizable before any formal test: a Sharpe ratio that looks like a typo, performance that collapses when one parameter moves slightly, a handful of trades carrying the whole result, or everything depending on a single market — a one-market result is an illustration, not validation.

Statistical probes ask whether luck explains the result. The permutation test answers the sharpest version: could randomized entries on the same data have done this? Its p-value appears in every Pancake result, alongside a bootstrap confidence interval on returns and a Wilson confidence interval on win rate. Below 10 trades, Pancake suppresses headline metrics as insufficient_data rather than printing impressive noise.

Perturbation probes ask whether the result is a peak or a plateau. run_sensitivity_analysis re-runs the strategy under neighboring assumptions; a real edge degrades gracefully as parameters move, while an overfit one falls off a cliff one tick from its tuned values.

Time is the probe that cannot be gamed: a paper deployment runs the frozen, content-hashed version against live data it could never have memorized, accruing an append-only forward record. Overfitting can never be proven absent — but these three probes make it expensive to hide, which is the practical standard.