Do AI-generated trading strategies work?

Some do, most do not, and without verification you cannot tell which is which. LLM agents are good at generating plausible strategy logic and bad at noticing when their own backtest evidence is contaminated. The structural fix: run every strategy through an engine that re-derives the math and states what it verified versus what the agent supplied.

What agents are genuinely good at: searching a large rule space quickly, encoding hypotheses precisely, and iterating without fatigue. An agent can propose and test more strategy variants in an afternoon than a person can in a month. That is real leverage — and it cuts both ways.

The characteristic failure modes are evidential, not logical. Lookahead contamination: a feature quietly built from information that postdates the decision. Survivorship: evidence rows drawn only from markets that resolved cleanly. Cherry-picked windows. And the bluntest one: metrics asserted in the chat that no engine ever computed. We published a dissection of a 49-for-49, Sharpe-7.48 strategy whose flaw was none of the math and all of the dataset.

The fix is structural, not behavioral. An agent cannot grade its own homework, so the engine re-derives every number from declared inputs and emits a verification boundary: what was verified (structural invariants, all statistics), what was accepted as agent-supplied evidence (feature values, price sources), and what was not modeled. Inflated claims do not survive contact with this; honest ones do.

The pipeline that separates working from plausible: a verified, reproducible backtest across many markets; a sensitivity pass to show the result is not one lucky parameter; then a forward paper run on data the strategy has never seen. AI-generated strategies earn trust exactly the way human ones do — with evidence.

Related