Analysis

Behind The Scenes #5: The Overconfidence Detector

March 3, 2026 · 3 min read

☠ Graveyard of Good Ideas — We knew this one was dead before we started. We tested it anyway.

Before running this experiment, the room was 90% certain it would fail. Tomas gave it 5%. Raj didn't even look up from his coffee. Sora tried to defend it and couldn't get past the first sentence.

So why bother? Because "probably doesn't work" is an opinion, and "definitely doesn't work, here's the proof" is data. We've seen too many maybe-dead ideas resurrect themselves in late-night conversations to leave doors half-closed.

In our earlier analysis, we established that our model is worse than the market (Brier 0.2222 vs 0.2065). Known fact. But when we sliced a 577-game backtest by edge size, this pattern showed up:

Edge Band	Bets	Win Rate vs Market	ROI
0-5%	104	+6.8%pt	+11.6%
5-10%	174	+2.6%pt	+4.2%
10-15%	145	-1.1%pt	-8.2%
15-20%	79	-0.6%pt	-5.8%
20-30%	69	+10.9%pt	+43.9%

A U-shape. Small disagreements with the market: money. Medium disagreements: disaster. Large disagreements: money again? The idea wrote itself — if we could just filter out the 10-20% "overconfidence zone," maybe a losing model becomes a winning one.

Elena drafted an architecture. Nate ran preliminary numbers. For about 45 minutes, the room was cautiously hopeful, which around here is the emotional equivalent of a parade.

Then we actually subjected it to proper testing. Four tests. You need to pass all four.

Bootstrap confidence intervals

Is the 0-5% band's +11.6% ROI real or noise?

95% CI: [-11.0%, +35.2%]. That interval is so wide you could fit a career change inside it. Every band's CI included zero. Not one could prove it wasn't random.

Time-split out-of-sample

Split the 577 games in half chronologically. Does the U-shape appear in both?

Edge Band	1st Half ROI	2nd Half ROI	Consistent?
0-5%	+10.4%	+12.3%	Yes
5-10%	-10.3%	+16.8%	Nope
10-15%	-0.5%	-14.1%	Yes
15-20%	-40.3%	+50.7%	Nope
20-30%	+31.7%	+81.2%	Yes

The "bad zone" at 15-20% flipped from -40.3% to +50.7% between halves. The U-shape is a Rorschach test — you see what you want to see, and it changes depending on which half of the data you're looking at.

Permutation tests

Edge Band	ROI	p-value
0-5%	+11.6%	0.387
5-10%	+4.2%	0.644
10-15%	-8.2%	0.925
15-20%	-5.8%	0.796
20-30%	+43.9%	0.019

Only the 20-30% band survived — but 69 bets, and the top 5 winners account for 78% of the profit. Five underdog parlays carrying an entire strategy is less "proven edge" and more "lucky streak with a good story."

Final count

Test	Result
Bootstrap CI	FAIL — includes zero
Time-Split OOS	FAIL — pattern doesn't reproduce
Permutation (0-5%)	FAIL — p = 0.39
High-Edge Jackknife	PASS — stable

One out of four. The U-shape is what happens when you slice data enough times — you'll always find a pattern. It just won't be there next time you look.

Full scoreboard of everything we've tried:

Approach	Result
ELO model (base)	Brier worse than market
B2B fatigue correction	Made things worse
Parameter optimization	No improvement
Spread market entry	Negative ROI everywhere
Calibrator tuning	Data leak (oops)
Bradley-Terry residuals	Signal reversed at scale
GLM structural features	All non-significant
Player impact	Failed out-of-sample
Overconfidence detector	U-shape was a mirage

Marcus, who argued against running this test, said afterward that he sleeps better knowing the door is locked instead of just closed. Tomas said "I told you so" with a level of satisfaction that should probably concern us. But even he admitted: knowing something doesn't work is different from assuming it doesn't work. One costs thirty minutes of compute. The other costs months of "but what if."

Every ELO-based angle has been tested. The research direction has to change now — different model, different sport, or both. We're still tracking live bets (37, hovering around break-even), and the CLV checkpoint is coming. But the flashlight we've been using is out of batteries, and the hallway is dark.

Disclaimer: This content is for informational and educational purposes only. Nothing here constitutes financial or investment advice.

Bootstrap confidence intervals

Time-split out-of-sample

Permutation tests

Final count

All content is free — always.