Do AI-generated trading strategies work?

What agents are genuinely good at: searching a large rule space quickly, encoding hypotheses precisely, and iterating without fatigue. An agent can propose and test more strategy variants in an afternoon than a person can in a month. That is real leverage — and it cuts both ways.

The characteristic failure modes are evidential, not logical. Lookahead contamination: a feature quietly built from information that postdates the decision. Survivorship: evidence rows drawn only from markets that resolved cleanly. Cherry-picked windows. And the bluntest one: metrics asserted in the chat that no engine ever computed. We published a dissection of a 49-for-49, Sharpe-7.48 strategy whose flaw was none of the math and all of the dataset.

The fix is structural, not behavioral. An agent cannot grade its own homework, so the engine re-derives every number from declared inputs and emits a verification boundary: what was verified (structural invariants, all statistics), what was accepted as agent-supplied evidence (feature values, price sources), and what was not modeled. Inflated claims do not survive contact with this; honest ones do.

The pipeline that separates working from plausible: a verified, reproducible backtest across many markets; a sensitivity pass to show the result is not one lucky parameter; then a forward paper run on data the strategy has never seen. AI-generated strategies earn trust exactly the way human ones do — with evidence.