ResearchJune 12, 2026 · 9 min read

We Plugged a State-of-the-Art AI Forecasting Model Into Our Strategy. It Failed.

Kronos is an open-source foundation model for financial candlesticks with impressive academic benchmarks. We tested whether it could improve Mercurio's entries. The answer was an unambiguous no — and the failure is instructive.

Every few months a new model promises to predict markets. The latest is Kronos, an open-source foundation model trained on 12 billion candlesticks from 45 exchanges, with a paper accepted at AAAI 2026. Its benchmarks are genuinely impressive: large improvements in ranking-correlation and volatility-forecast accuracy over prior models.

Impressive benchmarks are a reason to test something, not to trust it. So we ran the experiment properly — on our real strategy, with real data, measuring real money, not academic error metrics.

The hypothesis

The idea fit Mercurio's design rule perfectly: the technical strategy is the edge, and any model layer can only filter trades, never create them. So we asked a narrow question. When trend following generates a long entry, if we also ask Kronos to forecast the next 24 hourly candles, does its forecast tell us anything useful about whether that trade will work?

How we tested it

We ran the model in shadow mode first: score every trend-following entry it could over the two-year window — 376 in all, after excluding a handful near the data's edge that lacked a full 24-hour forecast horizon — without changing any trading decision, and log the forecast next to what actually happened. That gives a clean measurement of predictive power before risking a single simulated dollar.

The results

The correlation between Kronos's forecast and the actual outcome was statistically indistinguishable from zero.

-0.05

Forecast vs. outcome correlation (p=0.29)

42.8%

Direction hit rate (worse than a coin flip)

-10.5%

Median forecast for stocks that rose +0.6%

Then we let it actually gate trades. Even the gentlest possible setting — vetoing only the 10% of entries it scored most bearishly — destroyed the strategy:

Configuration	2-year return	Sharpe
Baseline (no Kronos)	+53.1%	1.03
Kronos gate, gentle (block worst 10%)	+15.9%	0.45
Kronos gate, median (block worst 50%)	+9.4%	0.33

Why it failed

The failure is mechanical and worth understanding. Kronos normalizes the price window it sees and tends to pull its forecast back toward that window's average. But trend-following entries are, by definition, stocks breaking out near the top of their recent range. So the model predicts 'down' on almost every breakout — it is structurally biased against exactly the trades our strategy is designed to take. It predicted a median 24-hour drop of 10.5% for stocks that, on average, rose. Gating on that signal didn't remove bad trades; it removed good ones and replaced them with worse-ranked entries.

The real lesson

A model can be statistically excellent and economically worthless. A 5% improvement in next-candle error can still have zero tradeable edge once you account for the spread, the costs, and the specific trades your strategy actually takes. Statistical significance is not a profit signal.

This is exactly why Mercurio has a backtesting discipline instead of an intuition. The Kronos integration felt like it should help — a sophisticated model adding a sanity check to our entries. Measured honestly, it was a 37-percentage-point mistake. We rejected it, wrote it down, and moved on. That negative result is worth more than a dozen plausible-sounding features we never tested.

You can read more about Kronos in its AAAI 2026 paper, and about how we validate everything in our backtesting deep dive.

Disclaimer. Figures are from historical simulation on paper capital. Nothing here is financial advice. Past performance does not guarantee future results.

Keep reading

Backtesting

Inside the Mercurio Backtest: How We Turn a Claim Into Evidence

Read →Strategy

Why Trend Following Was the Only Strategy That Survived

Read →

See the full backtest