Behind The Scenes #5: The Overconfidence Detector
☠ Graveyard of Good Ideas — We knew this one was dead before we started. We tested it anyway.
Before running this experiment, the room was 90% certain it would fail. Tomas gave it 5%. Raj didn't even look up from his coffee. Sora tried to defend it and couldn't get past the first sentence.
So why bother? Because "probably doesn't work" is an opinion, and "definitely doesn't work, here's the proof" is data. We've seen too many maybe-dead ideas resurrect themselves in late-night conversations to leave doors half-closed.
In our earlier analysis, we established that our model is worse than the market (Brier 0.2222 vs 0.2065). Known fact. But when we sliced a 577-game backtest by edge size, this pattern showed up:
| Edge Band | Bets | Win Rate vs Market | ROI |
|---|---|---|---|
| 0-5% | 104 | +6.8%pt | +11.6% |
| 5-10% | 174 | +2.6%pt | +4.2% |
| 10-15% | 145 | -1.1%pt | -8.2% |
| 15-20% | 79 | -0.6%pt | -5.8% |
| 20-30% | 69 | +10.9%pt | +43.9% |
A U-shape. Small disagreements with the market: money. Medium disagreements: disaster. Large disagreements: money again? The idea wrote itself — if we could just filter out the 10-20% "overconfidence zone," maybe a losing model becomes a winning one.
Elena drafted an architecture. Nate ran preliminary numbers. For about 45 minutes, the room was cautiously hopeful, which around here is the emotional equivalent of a parade.
Then we actually subjected it to proper testing. Four tests. You need to pass all four.
Bootstrap confidence intervals
Is the 0-5% band's +11.6% ROI real or noise?
95% CI: [-11.0%, +35.2%]. That interval is so wide you could fit a career change inside it. Every band's CI included zero. Not one could prove it wasn't random.
Time-split out-of-sample
Split the 577 games in half chronologically. Does the U-shape appear in both?
| Edge Band | 1st Half ROI | 2nd Half ROI | Consistent? |
|---|---|---|---|
| 0-5% | +10.4% | +12.3% | Yes |
| 5-10% | -10.3% | +16.8% | Nope |
| 10-15% | -0.5% | -14.1% | Yes |
| 15-20% | -40.3% | +50.7% | Nope |
| 20-30% | +31.7% | +81.2% | Yes |
The "bad zone" at 15-20% flipped from -40.3% to +50.7% between halves. The U-shape is a Rorschach test — you see what you want to see, and it changes depending on which half of the data you're looking at.
Permutation tests
| Edge Band | ROI | p-value |
|---|---|---|
| 0-5% | +11.6% | 0.387 |
| 5-10% | +4.2% | 0.644 |
| 10-15% | -8.2% | 0.925 |
| 15-20% | -5.8% | 0.796 |
| 20-30% | +43.9% | 0.019 |
Only the 20-30% band survived — but 69 bets, and the top 5 winners account for 78% of the profit. Five underdog parlays carrying an entire strategy is less "proven edge" and more "lucky streak with a good story."
Final count
| Test | Result |
|---|---|
| Bootstrap CI | FAIL — includes zero |
| Time-Split OOS | FAIL — pattern doesn't reproduce |
| Permutation (0-5%) | FAIL — p = 0.39 |
| High-Edge Jackknife | PASS — stable |
One out of four. The U-shape is what happens when you slice data enough times — you'll always find a pattern. It just won't be there next time you look.
Full scoreboard of everything we've tried:
| Approach | Result |
|---|---|
| ELO model (base) | Brier worse than market |
| B2B fatigue correction | Made things worse |
| Parameter optimization | No improvement |
| Spread market entry | Negative ROI everywhere |
| Calibrator tuning | Data leak (oops) |
| Bradley-Terry residuals | Signal reversed at scale |
| GLM structural features | All non-significant |
| Player impact | Failed out-of-sample |
| Overconfidence detector | U-shape was a mirage |
Marcus, who argued against running this test, said afterward that he sleeps better knowing the door is locked instead of just closed. Tomas said "I told you so" with a level of satisfaction that should probably concern us. But even he admitted: knowing something doesn't work is different from assuming it doesn't work. One costs thirty minutes of compute. The other costs months of "but what if."
Every ELO-based angle has been tested. The research direction has to change now — different model, different sport, or both. We're still tracking live bets (37, hovering around break-even), and the CLV checkpoint is coming. But the flashlight we've been using is out of batteries, and the hallway is dark.
Disclaimer: This content is for informational and educational purposes only. Nothing here constitutes financial or investment advice.