Analysis

BTS #9: Three Models, Zero Survivors

March 7, 2026 · 3 min read

Graveyard of Good Ideas — Where hypotheses go to get humbled. This one brought three models to a gunfight with the German betting market. The market didn't even flinch.

We spent the last two weeks building a proper backtesting infrastructure for the Bundesliga. Not because we expected it to work on the first try — we're not that naive anymore — but because you can't know if something works until you test it against real data with real odds. The pitch was simple enough. Academic papers reported positive returns from football prediction models. One paper claimed 10% ROI using expected goals data. Another documented market inefficiencies across European leagues. We had 20 years of historical data, 12 seasons of shot-quality metrics, and enough computing power to test every reasonable combination of parameters. So we did.

The Models

Model 1: Dixon-Coles (goals only) The classic. Published in 1997, still the baseline everyone compares against. It takes historical scores, estimates each team's attacking and defensive strength, and uses a Poisson distribution to predict outcomes. We gave it 17 seasons of out-of-sample testing. Result: -2.5% ROI across 7,205 bets. The model's probability estimates were consistently worse than the market's (Brier score 0.614 vs. market's 0.585). Permutation test p-value: 0.88. Completely indistinguishable from flipping coins. Model 2: Expected Goals + Skellam Distribution This is the approach that supposedly generated 10% returns. Instead of using raw goals, it uses expected goals (xG) — a metric that captures shot quality, not just whether the ball went in. The idea is that xG contains information the market hasn't fully absorbed. We tested 9 parameter combinations (3 window sizes × 3 home advantage values). Best result: +2.5% ROI at the most aggressive threshold. Sounds promising until you check the p-value: 0.25. And the model's Brier score was actually worse than pure goal-based prediction (0.661 vs 0.614). The model was making more money despite being less accurate — a classic sign of noise, not signal. Model 3: Simplified xG (Wilkens-faithful) Maybe our xG model was too complex. We stripped it back to the exact approach described in the paper: just use rolling home and away xG averages as direct Poisson parameters. Tried windows from 6 to 38 matches. Result: negative ROI at every single window size. Range: -1.2% to -6.2%.

The Pattern

Here's what every model had in common: the market was better. Not by a hair — by a mile. Our best Brier score (0.614) was 5% worse than the market's (0.585). Across thousands of predictions, three different modeling approaches, and dozens of parameter settings, we couldn't produce probability estimates that matched what Pinnacle was already offering. This is exactly what happened with our NBA ELO model. And it matches what Angelini & De Angelis found in their 2019 study of European betting markets: the Bundesliga is efficient. The odds already reflect the information we were trying to exploit.

What About That 10% ROI Paper?

Here's the part nobody talks about. That paper reported ~10% ROI with a calibration layer (Isotonic Regression) that adjusted the model's raw probabilities. Without calibration? About 1% — which is exactly what we found. The calibration layer was responsible for 90% of the profit. We've seen this movie before. In our NBA work, the same calibration technique made things worse because we didn't have enough data. With 3,000+ football matches, the conditions are different. But the lesson stands: if your model only works with post-hoc probability adjustment, you probably don't have a real edge.

What Happens Next

The Bundesliga experiment is done. Three models tested, zero survived. But the infrastructure we built — the data pipelines, the backtesting engine, the team name mappings, the xG collection — all of that transfers directly to other leagues. Specifically, we're looking at leagues that academic research has flagged as potentially less efficient. Tomas pointed out early on that we were testing against one of the most efficient football markets in Europe. Elena had the design for a multi-league approach ready before we even ran the first backtest. The question isn't whether our models work. It's whether there exists a market where they work. Different question, different answer — possibly.

Disclaimer: This blog documents a research project. Nothing here constitutes financial or betting advice. Past results — especially negative ones — are presented for educational transparency.