calibration
Track record
Brier score by category and track — editorial (LLM-based reading) and markets (prediction-market-derived). Lower is better: 0.00 is perfect, 0.25 is a coin-flip, 1.00 is maximally wrong. Includes only resolved forecasts from active categories. Each forecast horizon (1 week / 2 weeks / 1 month) is graded independently — the 2-week slot is the canonical reviewed signal; the others enroll automatically as their windows close.
Calibration backtest
For resolved forecasts in each probability quartile, how often did the event actually happen? A well-calibrated model has hit rates that match the bucket's mean probability — a 75–100% bucket should resolve "yes" ~75–100% of the time. Calibration error = hit rate − mean probability; closer to zero is better.
How this is computed
Every forecast call produces three independently graded horizons — 1 week, 2 weeks, and 1 month — all anchored on the same category-specific event class. After each horizon's window closes the operator adjudicates whether the event class actually occurred (yes/no), or marks it unresolvable if the event class was too vague to grade. The 2-week slot is the canonical reviewed signal; the 1w and 1mo slots enroll automatically as their windows close and are graded the same way. Use the horizon toggle above to inspect calibration per slot.
Editorial forecasts come from a model reading the news pool — scored 0–100, the probability used for Brier is score / 100.
Markets forecasts come from prediction-market data — the probability used for Brier is the liquidity-weighted average of matched markets at the time the forecast was snapshotted (already in [0, 1]; no scale conversion).
For each resolved forecast, Brier = (p − o)² where p is the forecast probability and o is the realized outcome (1 if yes, 0 if no). Mean Brier averages across all resolved forecasts in a category × track.
Unresolvable counts are surfaced separately so an operator can't improve their score by skipping hard cases. More on the methodology →