Behind The Scenes #2: The Bradley-Terry Experiment
☠ Graveyard of Good Ideas — This one stung a little more than usual.
After establishing that ELO loses to the market overall, we asked what felt like a clever question: sure, ELO is worse on average, but does it know anything the market doesn't?
Picture it like this. Your friend is terrible at restaurant recommendations overall. But every time they say "the carbonara here is amazing," the carbonara is actually amazing. You wouldn't trust their general judgment, but you'd trust the specific signal. That's what we were looking for — the carbonara.
The test: a Bradley-Terry residual model.
logit(P_win) = logit(P_market) + β × elo_residualβ > 0 means ELO sees something the market misses. β = 0 means noise. β < 0 means the market is right and ELO's "insights" are actively misleading.
We ran it on 617 games with precise moneyline odds:
| Metric | Value |
|---|---|
| β coefficient | +0.218 |
| p-value (Wald) | 0.181 |
| Permutation Test p | 0.184 |
| OOS Brier vs Market | +0.312%pt worse |
Positive β. Not significant, but positive. This is the sports analytics equivalent of finding a $20 bill on the sidewalk and thinking "maybe I should start a metal detecting business." The rational response is "interesting but inconclusive." Our actual response was "GET MORE DATA."
So we expanded to 3,444 games using point spreads converted to probabilities. The conversion was legit — probit σ=13.0, validated at r=0.996 against the moneyline subset. No methodological shortcuts.
With 5.6x more data, the picture became clearer:
| Metric | 617 Games | 3,444 Games |
|---|---|---|
| β coefficient | +0.218 | -0.126 |
| Permutation p | 0.184 | 0.256 |
| AIC best model | — | Baseline (no ELO) |
The sign flipped. Our promising +0.218 became -0.126. The $20 bill on the sidewalk was a receipt.
This is the small-sample trap doing what it does best. You find a pattern on limited data, you get excited, and with real statistical power it vanishes — or worse, reverses. If we'd stopped at 617 games and built a trading system around β = +0.218, we'd have been betting on a ghost. The only thing separating us from that mistake was stubbornness about sample sizes.
Three model variations, because when something dies you check for a pulse three times:
| Model | Description | Key p-value | AIC vs Baseline |
|---|---|---|---|
| Linear | logit(P) = logit(market) + β·δ | p > 0.10 | Worse |
| Quadratic | + β²·δ² | Both p > 0.10 | Worse |
| Full | α + β₁·logit(market) + β₂·δ | ELO p > 0.10 | Worse |
All dead. The best model is the one that pretends ELO doesn't exist, which is becoming a recurring theme around here.
During the internal review, Raj argued for expanding the data (done), Tomas argued we should've quit two experiments ago (noted, as usual), and Sora proposed three creative alternatives that all died for the same reason: not enough data to test them properly. The creative ones always go out in the most frustrating way.
To be fair to ELO: it picks NBA winners about 63% of the time. That's genuinely decent. It's just that the market picks them better, and — crucially — ELO doesn't know anything the market doesn't already know. Our friend doesn't have a carbonara call. They're just... a worse version of Yelp.
Research direction shifts to structural features next. Schedule fatigue, travel, back-to-backs. Maybe the market misses something in the logistics, even if it nails the ratings. Or maybe we're about to add another resident to the graveyard.
Disclaimer: This content is for informational and educational purposes only. Nothing here constitutes financial or investment advice.