Evaluated on 100 resolved prediction markets from Polymarket and Metaculus. All results on held-out test data.
Predicted probability vs actual outcome frequency. Points on the diagonal = perfect calibration.
Dot size = sample count. ECE = 9.4%
Lower is better. Compared on 58 markets with consensus data.
How each training iteration improved performance on the same 100-market test set.
| Version | Brier Score | ECE | Log Loss | Accuracy | Notes |
|---|---|---|---|---|---|
| Zero-Shot (70B) | 0.287 | 22.1% | — | 49% | Llama 3.3 70B via ollama, no training |
| SFT v1 | 0.228 | 23.1% | 3.303 | 74% | Hard binary labels → overconfident |
| SFT v2 | 0.240 | 21.6% | 0.826 | 71% | Label smoothing fix |
| SFT v2 + Temp Cal | 0.197 | 9.4% | 0.582 | 71% | Temperature scaling (T=2.98) |
Brier Score across market categories (lower is better)
Selected predictions from the 100-market test set. All markets have resolved.
| Market | Category | Model | Consensus | Outcome | Result |
|---|---|---|---|---|---|
| USB close price higher on Dec 5 vs Nov 24? | Price | 52% | 55% | YES | Correct |
| Will US government shut down before Oct 2, 2025? | Politics | 79% | 74% | YES | Correct |
| OpenAI file S-1 with SEC before Dec 15, 2025? | Other | 27% | 5% | NO | Correct |
| Will Australia retain the Ashes 2025-26? | Other | 73% | 99% | YES | Correct |
| Ukraine extend martial law beyond Nov 5, 2025? | Politics | 79% | 90% | YES | Correct |
| Metaculus: Will Nvidia stock close 2025 higher? | Meta | 73% | 60% | YES | Correct |
| Bill Ackman beat politician returns in 2025? | Other | 73% | 4% | NO | Wrong |
| BLDR close price higher Dec 20 vs Dec 8? | Price | 73% | 52% | NO | Wrong |
| UN General Assembly condemn US re Venezuela? | Politics | 79% | 55% | NO | Wrong |
| Arsenal vs Man City match end in draw? | Other | 27% | 28% | YES | Wrong |
Showing 10 of 100 test markets. Predictions are post-calibration probabilities.