For the second year, Metaculus is teaming up with Bridgewater Associates to host a competition featuring $25k in prizes and potential opportunities with the asset management firm — and this year, forecasters from around the globe can compete.
Start here to register for the February 3rd competition, warm up on practice questions, and learn about the separate Open and Undergraduate leaderboards.
Below are Frequently Asked Questions (and answers!) about scores. The general FAQ is here, and the medals FAQ is here.
A scoring rule is a mathematical function which, given a prediction and an outcome, gives a score in the form of a number.
A naive scoring rule could be: "you score equals the probability you gave to the correct outcome". So, for example, if you predict 80% and the question resolves Yes, your score would be 0.8 (and 0.2 if the question resolved No). At first glance this seems like a good scoring rule: forecasters who gave predictions closer to the truth get higher scores.
Unfortunately this scoring rule is not "proper", as we'll see in the next section.
Proper scoring rules have a very special property: the only way to optimize your score on average is to predict your sincere beliefs.
How do we know that the naive scoring rule from the previous section is not proper? An example should be illuminating: consider the question "Will I roll a 6 on this fair die?". Since the die is fair, your belief is "1/6" or about 17%. Now consider three possibilities: you could either predict your true belief (17%), predict something more extreme, like 5%, or predict something less extreme, like 30%. Here's a table of the scores you expect for each possible die roll:
outcome die roll | naive score of p=5% | naive score of p=17% | naive score of p=30% |
1 | 0.95 | 0.83 | 0.7 |
2 | 0.95 | 0.83 | 0.7 |
3 | 0.95 | 0.83 | 0.7 |
4 | 0.95 | 0.83 | 0.7 |
5 | 0.95 | 0.83 | 0.7 |
6 | 0.05 | 0.17 | 0.3 |
average | 0.8 | 0.72 | 0.63 |
Which means you get a better score on average if you predict 5% than 17%. In other words, this naive score incentivizes you to predict something other than the true probability. This is very bad!
Proper scoring rules do not have this problem: your score is best when you predict the true probability. The log score, which underpins all Metaculus scores, is a proper score (see What is the log score?). We can compare the scores you get in the previous example:
outcome die roll | log score of p=5% | log score of p=17% | log score of p=30% |
1 | -0.05 | -0.19 | -0.37 |
2 | -0.05 | -0.19 | -0.37 |
3 | -0.05 | -0.19 | -0.37 |
4 | -0.05 | -0.19 | -0.37 |
5 | -0.05 | -0.19 | -0.37 |
6 | -3 | -1.77 | -1.2 |
average | -0.54 | -0.45 | -0.51 |
With the log score, you do get a higher (better) score if you predict the true probability of 17%.
The logarithmic scoring rule, or "log score" for short, is defined as:
Where is the natural logarithm and is the probability predicted for the outcome that actually happened. This log score applies to categorical predictions, where one of a (usually) small set of outcomes can happen. On Metaculus those are Binary and Multiple Choice questions. See the next section for the log scores of continuous questions.
Higher scores are better:
This means that the log score is always negative (for Binary and Multiple Choice questions). This has proved unintuitive, which is one reason why Metaculus uses the Baseline and Peer scores, which are based on the log score but can be positive.
The log score is proper (see What is a proper scoring rule?). This means that to maximize your score you should predict your true beliefs (see Can I get better scores by predicting extreme values?).
One interesting property of the log score: it is much more punitive of extreme wrong predictions than it is rewarding of extreme right predictions. Consider the scores you get for predicting 99% or 99.9%:
99% Yes, 1% No | 99.9% Yes, 0.1% No | |
Score if outcome = Yes | -0.01 | -0.001 |
Score if outcome = No | -4.6 | -6.9 |
Going from 99% to 99.9% only gives you a tiny advantage if you are correct (+0.009), but a huge penalty if you are wrong (-2.3). So be careful, and only use extreme probabilities when you're sure they're appropriate!
Since the domain of possible outcomes for continuous questions is (drum roll) continuous, any outcome has mathematically 0 chance of happening. Thankfully we can adapt the log score in the form:
Where is the natural logarithm and is the value of the predicted probability density function at the outcome. Note that on Metaculus, all pdfs have a uniform distribution of height 0.01 added to them. This prevents extreme log scores.
This is also a proper scoring rule, and behaves in somewhat similar ways to the log score described above. One difference is that, contrary to probabilities that are always between 0 and 1, values can be greater than 1. This means that the continuous log score can be greater than 0: in theory it has no maximum value, but in practice Metaculus restricts how sharp pdfs can get (see the maximum scores tabulated below).
A "spot" score is a specific version of the given score type (e.g. "spot peer score") where the evaluation doesn't take prediction duration into account. For a spot score, only the prediction at a specified time is considered. Unless otherwise indicated, spot scores are evaluated at the same time the Community Prediction is revealed. Coverage is 100% if there is an active prediction at the time, and 0% if there is not. The math is the same as the given score type.
The Baseline score compares a prediction to a fixed "chance" baseline. If it is positive, the prediction was better than chance. If it is negative, it was worse than chance.
That "chance" baseline gives the same probability to all outcomes. For binary questions, this is a prediction of 50%. For an N-option multiple choice question it is a prediction of 1/N for every option. For continuous questions this is a uniform (flat) distribution.
The Baseline score is derived from the log score, rescaled so that:
Here are some notable values for the Baseline score:
Binary questions | Multiple Choice questions (8 options) | Continuous questions | |
---|---|---|---|
Best possible Baseline score on Metaculus | +99.9 | +99.9 | +183 |
Worst possible Baseline score on Metaculus | -897 | -232 | -230 |
Median Baseline empirical score | +17 | no data yet | +14 |
Average Baseline empirical score | +13 | no data yet | +13 |
Theoretically, binary scores can be infinitely negative, and continuous scores can be both infinitely positive and infinitely negative. In practice, Metaculus restricts binary predictions to be between 0.1% and 99.9%, and continuous pdfs to be between 0.01 and ~35, leading to the scores above. The empirical scores are based on all scores observed on all resolved Metaculus questions, as of November 2023.
Note that the above describes the Baseline score at a single point in time. Metaculus scores are time-averaged over the lifetime of the question, see Do all my predictions on a question count toward my score?.
You can expand the section below for more details and maths.
The Peer score compares a prediction to all the other predictions made on the same question. If it is positive, the prediction was (on average) better than others. If it is negative it was worse than others.
The Peer score is derived from the log score: it is the average difference between a prediction's log score, and the log scores of all other predictions on that question. Like the Baseline score, the Peer score is multiplied by 100.
One interesting property of the Peer score is that, on any given question, the sum of all participants' Peer scores is always 0. This is because each forecaster's score is their average difference with every other: when you add all the scores, all the differences cancel out and the result is 0. Here's a quick example: imagine a continuous question, with three forecasters having predicted:
Forecaster | log score | Peer score |
---|---|---|
Alex | ||
Bailey | ||
Cory | ||
sum |
Here are some notable values for the Peer score:
Binary and Multiple Choice questions | Continuous questions | |
---|---|---|
Best possible Peer score on Metaculus | +996 | +408 |
Worst possible Peer score on Metaculus | -996 | -408 |
Median Peer empirical score | +2 | +3 |
Average Peer empirical score | 0* | 0* |
*The average Peer score is 0 by definition.
Theoretically, binary scores can be infinitely negative, and continuous scores can be both infinitely positive and infinitely negative. In practice, Metaculus restricts binary predictions to be between 0.1% and 99.9%, and continuous pdfs to be between 0.01 and ~35, leading to the scores above.
The "empirical scores" are based on all scores observed on all resolved Metaculus questions, as of November 2023.
Note that the above describes the Peer score at a single point in time. Metaculus scores are time-averaged over the lifetime of the question, see Do all my predictions on a question count toward my score?.
You can expand the section below for more details and maths.
The Peer score measures whether a forecaster was on average better than other forecasters. It is the difference between the forecaster's log score and the average of all other forecasters' log scores. If you have a positive Peer score, it means your log score was better than the average of all other forecasters' log scores.
The Community Prediction is a time-weighted median of all forecasters on the question. Like most aggregates, it is better than most of the forecasters it feeds on: it is less noisy, less biased, and updates more often.
Since the Community Prediction is better than most forecasters, it follows that its score should be higher than the average score of all forecasters. And so its Peer score is positive.
Yes. Metaculus uses time-averaged scores, so all your predictions count, proportional to how long they were standing. An example goes a long way (we will use the Baseline score for simplicity, but the same logic applies to any score):
A binary question is open 5 days, then closes and resolves Yes. You start predicting on the second day, make these predictions, and get those scores:
Day 1 | Day 2 | Day 3 | Day 4 | Day 5 | Average | |
---|---|---|---|---|---|---|
Prediction | 40% | 70% | 80% | N/A | ||
Baseline score | 0 | -32 | +49 | +49 | +68 | +27 |
Some things to note:
Lastly, note that scores are always averaged for every instant between the Open date and (scheduled) Close date of the question. If a question resolves early (i.e. before the scheduled close date), then scores are set to 0 between the resolution date and scheduled close date, and still count in the average. This ensures alignment of incentives, as explained in the section Why did I get a small score when I was right? below.
Metaculus uses proper scores (see What is a proper scoring rule?), so you cannot get a better score (on average) by making predictions more extreme than your beliefs. On any question, if you want to maximize your expected score, you should predict exactly what you believe.
Let's walk through a simple example using the Baseline score. Suppose you are considering predicting a binary question. After some thought, you conclude that the question has 80% chance to resolve Yes.
If you predict 80%, you will get a score of +68 if the question resolves Yes, and -132 if it resolves No. Since you think there is an 80% chance it resolves Yes, you expect on average a score of
If you predict 90%, you will get a score of +85 if the question resolves Yes, and -232 if it resolves No. Since you think there is an 80% chance it resolves Yes, you expect on average a score of
So by predicting a more extreme value, you actually lower the score you expect to get (on average!).
Here are some more values from the same example, tabulated:
Prediction | Score if Yes | Score if No | Expected score |
---|---|---|---|
70% | +48 | -74 | +24 |
80% | +68 | -132 | +28 |
90% | +85 | -232 | +21 |
99% | +99 | -564 | -34 |
The 99% prediction gets the highest score when the question resolves Yes, but it also gets the lowest score when it resolves No. This is why, on average, the strategy that maximizes your score is to predict what you believe. This is one of the reasons why looking at scores on individual questions is not very informative; only aggregate over many questions are interesting!
To make sure incentives are aligned, Metaculus needs to ensure that our scores are proper. We also time-average scores.
This has a counter-intuitive consequence: when a question resolves before its intended close date, the times between resolution and close date need to count in the time-average, with scores of 0. We call this "score truncation".
An example is best: imagine the question "Will a new human land on the Moon before 2030?". It can either resolve Yes before 2030 (because someone landed on the Moon), or it can resolve No in 2030. If we did not truncate scores, you could game this question by predicting close to 100% in the beginning (since it can only resolve positive early), and lower later (since it can only resolve negative at the end).
Another way to think about this is that if a question lasts a year, then each day (or in fact each second) is scored as a separate question. To preserve properness, it is imperative that each day is weighted the same in the final average (or at least that the weights be decided in advance). From this perspective, not doing truncation is equivalent to retroactively giving much more weight to days before the question resolves, which is not proper.
You can read a worked example with maths by expanding the section below.
The Relative score compares a prediction to the median of all other predictions on the same question. If it is positive, the prediction was (on average) better than the median. If it is negative it was worse than the median.
It is based on the log score, with the formula:
Where is the prediction being scored and is the median of all other predictions on that question.
As of late 2023, the Relative score is in the process of being replaced by the Peer score, but it is still used for many open tournaments.
The Coverage measures for what proportion of a question's lifetime you had a prediction standing.
If you make your first prediction right when the question opens, your coverage will be 100%. If you make your first prediction one second before the question closes, your coverage will be very close to 0%.
The Coverage is used in tournaments, to incentivize early predictions.
Metaculus points were used as the main score on Metaculus until late 2023.
You can still find the rankings based on points here.
They are a proper score, based on the log score. They are a mixture of a Baseline-like score and a Peer-like score, so they reward both beating an impartial baseline and beating other forecasters.
For full mathematical details, expand the section below.
This scoring method was introduced in March 2024. It is based on the Peer scores described above.
Your rank in the tournament is determined by the sum of your Peer scores over all questions weighted by the question's weight in the tournament (you get 0 for any question you didn’t forecast). Questions that have weights other than 1.0 are indicated in the sidebar of the question detail page. Typically, a question weight is changed if it is determined to be highly correllated with other questions included in the same tournament, especially question groups.
The share of the prize pool you get is proportional to that same sum of Peer scores, squared. If the sum of your Peer scores is negative, you don’t get any prize.
For a tournament with a sufficiently large number of independent questions, this scoring method is effectively proper for the top quartile of forecasters. While there are small imperfections for forecasters near a 0 Peer score for which they might win a tiny bit of money by extremizing their forecasts, we believe this is an edge case that you can safely ignore. In short, you should predict your true belief on any question.
Taking the square of your Peer scores incentivizes forecasting every question and forecasting early. Don’t forget to Follow a tournament to be notified of new questions.
This scoring method was superseded in March 2024 by the New Tournament Score described above. It is still in use for tournaments that concluded before March 2024 for some tournaments that were in flight then.
Your tournament Score is the sum of your Relative scores over all questions in the tournament. If, on average, you were better than the Community Prediction, then it will be positive; otherwise, it will be negative.
Your tournament Coverage is the average of your coverage on each question. If you predicted all questions when they opened, your Coverage will be 100%. If you predicted all questions halfway through, or if you predicted half the questions when they opened, your Coverage will be 50%.
Your tournament Take is the exponential of your Score, times your Coverage:
Your Prize is how much money you earned on that tournament. It is proportional to your take and is equal to your Take divided by the sum of all competing forecasters' Takes.
Your Rank is simply how high you were in the leaderboard, sorted by Prize.
The higher your Score and Coverage, the higher your Take will be. The higher your Take, the more Prize you'll receive, and the higher your Rank will be.
The Community Prediction is on average much better than most forecasters. This means that you could get decent scores by just copying the Community Prediction at all times. To prevent this, many tournament questions have a significant period of time at the beginning when the Community Prediction is hidden. We call this time the Hidden Period.
To incentivize forecasting during the hidden period, questions sometimes are also set up so that the coverage you accrue during the Hidden Period counts more. For example, the Hidden Period could count for 50% of the question coverage, or even 100%. We call this percentage the Hidden Period Coverage Weight.
If the Hidden Period Coverage Weight is 50%, then if you don't forecast during the hidden period your coverage will be at most 50%, regardless of how long the hidden period lasted.