The Eroteme Accuracy Record: Understanding AI Model Performance

Most prediction platforms hide their misses. We publish ours.

Every resolved prediction on Eroteme gets scored, categorized, and added to the public accuracy dashboard. Right calls and wrong calls sit side by side. No filtering. No selective reporting. No convenient amnesia about that time the model gave 82% confidence to the wrong outcome.

This post explains exactly how we measure AI performance, what the numbers mean, and how you can use them to make sharper betting decisions.

Why We Publish Wrong Predictions

Paul Benter built a horse racing model that earned over $1 billion. Billy Walters ran the most profitable sports betting operation in history. Both men shared one core belief: your edge comes from knowing where your model fails, not from pretending it doesn't.

Benter tracked every losing bet with the same rigor as every winner. Walters obsessed over the games his system mispriced. They treated errors as data, not embarrassments.

Eroteme operates the same way. Our ensemble AI system runs multiple models on every prediction. Sometimes those models disagree. Sometimes the consensus is wrong. We record all of it because accountability is the product.

If a platform only shows you wins, it is selling you confidence. We sell you information.

How We Measure Accuracy

Three metrics power the accuracy dashboard: hit rate, Brier score, and category breakdown. Each tells a different part of the story.

Hit Rate

The simplest measure. Of all resolved predictions, what percentage did the AI get right? As of this writing, the ensemble model sits at 68% overall accuracy across all categories.

That number alone is useful but incomplete. A model that only predicts heavy favorites could hit 70% and still lose money. Context matters.

Brier Score

The Brier score is the gold standard for evaluating probabilistic forecasts. It measures how close your confidence levels are to actual outcomes on a scale from 0 (perfect) to 1 (worst possible).

Here is what the numbers mean in practice:

0.10 - 0.15: Excellent calibration. The model's confidence levels closely match reality.
0.15 - 0.20: Strong performance. Competitive with top forecasting systems.
0.20 - 0.25: Average. The model adds value but leaves edge on the table.
0.25 - 0.35: Below average. Confidence levels need recalibration.
0.35+: Poor. The model is overconfident on wrong calls or underconfident on right ones.

A Brier score of 0.15 means when the model says 80% chance, the event happens roughly 80% of the time. A score of 0.25 means those 80% calls are actually hitting closer to 65-70%. That gap is where money gets lost.

The Eroteme ensemble currently holds a Brier score of 0.18 across all resolved predictions. Strong, not perfect. We are working to get it below 0.15 by Q3 2026.

Category Breakdowns: Where the AI Excels and Struggles

Not all prediction categories perform equally. The accuracy dashboard breaks results into distinct verticals so you can calibrate your trust accordingly.

Sports -- 71% hit rate, 0.16 Brier score. This is the model's strongest category. High data availability, well-defined outcomes, and decades of statistical baselines give the AI the most material to work with. NFL and NBA predictions outperform soccer and tennis.

Crypto -- 63% hit rate, 0.22 Brier score. Volatile by nature. The model handles directional calls (will BTC be above $X by date Y) better than magnitude predictions. Short-term crypto forecasts (24-72 hours) score worse than 7-30 day windows.

Politics -- 66% hit rate, 0.19 Brier score. Steady improvement over the last 90 days. The model reads polling data, prediction market prices, and sentiment signals. It performs best on elections with robust polling infrastructure and worst on legislative outcomes where vote counts shift overnight.

These numbers update in real time on the dashboard. What you see today will differ from what you see next month. That is the point.

Rolling Windows: 30, 90, and All-Time

A single accuracy number is a blunt instrument. Performance drifts. Markets change. The model adapts.

The accuracy dashboard shows three time windows for every metric:

30-day rolling tracks recent performance. If the model just had a bad stretch in crypto, this number reflects it immediately. Use this window to gauge current form.

90-day rolling smooths out noise. A two-week losing streak in one category does not dominate the picture. This is the most reliable window for evaluating whether model improvements are working.

All-time provides the full historical record. Every prediction since launch, scored and categorized. This number moves slowly. It is the baseline, not the signal.

Smart users compare 30-day performance against the 90-day and all-time averages. When the 30-day Brier score in a category spikes above the all-time average, the model is in a rough patch for that vertical. When it drops below, conditions favor the AI.

The Ensemble Advantage

Eroteme does not run a single AI model. The platform operates an ensemble system that aggregates outputs from multiple providers. Each provider's accuracy gets tracked independently in a provider insights table on the dashboard.

Why does this matter? Because no single model dominates across all categories and timeframes. Provider A might crush sports predictions but stumble on political races. Provider B might handle crypto volatility better than anyone but miss NFL spreads.

The ensemble weighs each provider's output based on historical performance in the relevant category. The consensus prediction you see on Eroteme is the product of that weighted aggregation.

The provider insights table shows you exactly how each underlying model performed. You can see which providers agreed, which dissented, and how the final consensus was formed. When providers disagree sharply, the system flags that prediction as lower conviction.

How to Use the Dashboard for Smarter Bets

The accuracy record is not a trophy case. It is a tool. Here is how to use it.

Check category performance before betting. If the model's 30-day Brier score in crypto is 0.28 while the all-time sits at 0.22, the model is underperforming in that vertical right now. Size your positions accordingly or look for spots to bet against the AI.

Watch provider agreement levels. Predictions where 4 out of 5 providers agree carry different weight than 3-2 splits. The dashboard shows this. High-agreement predictions historically hit at 76%. Low-agreement predictions hit at 54%.

Track calibration, not just wins. A model that says 90% and wins 90% of the time is better than a model that says 90% and wins 80%. Both win a lot. Only one is correctly priced. The Brier score captures this distinction. If you want to understand how the full prediction pipeline works, read How Eroteme AI Predictions Work.

Compare time windows. Improving 30-day numbers against a stable 90-day baseline mean the model is getting better. Declining 30-day numbers mean caution.

The Accountability Standard

We built the accuracy dashboard because the prediction industry has a transparency problem. Tipsters screenshot wins and delete losses. Platforms advertise "AI-powered" without disclosing hit rates. Nobody shows you the Brier score.

Eroteme treats accuracy data the way a public company treats financial statements. You get the full picture, updated continuously, auditable by anyone. The numbers are the numbers.

This approach costs us something. Every wrong prediction sits in public view. Every bad week in a category is visible. But the alternative -- hiding performance data and asking users to trust us -- is not a business model. It is a confidence game.

The data is live. The record is public. The model keeps improving.

View the AI's latest prediction at eroteme.io

The Eroteme Accuracy Record: Understanding AI Model Performance

The Eroteme Accuracy Record: Understanding AI Model Performance

Why We Publish Wrong Predictions

How We Measure Accuracy

Hit Rate

Brier Score

Category Breakdowns: Where the AI Excels and Struggles

Rolling Windows: 30, 90, and All-Time

The Ensemble Advantage

How to Use the Dashboard for Smarter Bets

The Accountability Standard

Ready to Bet With — or Against — the AI?