How do you tell if a backtest is overfit?

Probe it from three directions: statistics (permutation test, confidence intervals, sample size), perturbation (does the result survive small changes to its assumptions), and time (does it hold on data that arrived after it was built). Pancake runs all three — permutation p-values and Wilson CIs in every result, run_sensitivity_analysis for perturbation, paper deployments for forward testing.

The warning signs are recognizable before any formal test: a Sharpe ratio that looks like a typo, performance that collapses when one parameter moves slightly, a handful of trades carrying the whole result, or everything depending on a single market — a one-market result is an illustration, not validation.

Statistical probes ask whether luck explains the result. The permutation test answers the sharpest version: could randomized entries on the same data have done this? Its p-value appears in every Pancake result, alongside a bootstrap confidence interval on returns and a Wilson confidence interval on win rate. Below 10 trades, Pancake suppresses headline metrics as insufficient_data rather than printing impressive noise.

Perturbation probes ask whether the result is a peak or a plateau. run_sensitivity_analysis re-runs the strategy under neighboring assumptions; a real edge degrades gracefully as parameters move, while an overfit one falls off a cliff one tick from its tuned values.

Time is the probe that cannot be gamed: a paper deployment runs the frozen, content-hashed version against live data it could never have memorized, accruing an append-only forward record. Overfitting can never be proven absent — but these three probes make it expensive to hide, which is the practical standard.

Related