Published: Sep 2, 2025
Working paper
  • Working paper

Assessing Near-Term Accuracy in the Existential Risk Persuasion Tournament

Assessing Near-Term Accuracy in the Existential Risk Persuasion Tournament
This report assesses the accuracy of short-term forecasts made during the Existential Risk Persuasion Tournament (XPT)—a 2022 study that convened 169 superforecasters and domain experts to make predictions on long-term risks including AI, climate change, nuclear war, and pandemics.
Simas Kučinskas1*, Josh Rosenberg1, Rebecca Ceppas de Castro1, Zach Jacobs1, Jordan Canedy1, Philip E. Tetlock1,2, Ezra Karger1,3 ,
1 = Forecasting Research Institute
2 = Wharton School of the University of Pennsylvania
3 = Federal Reserve Bank of Chicago

*Corresponding author: simas@forecastingresearch.org
Published: Sep 2, 2025
Simas Kučinskas1*, Josh Rosenberg1, Rebecca Ceppas de Castro1, Zach Jacobs1, Jordan Canedy1, Philip E. Tetlock1,2, Ezra Karger1,3

Abstract

In June–October 2022, we convened 169 people to participate in the “Existential Risk Persuasion Tournament” (XPT). The XPT participants included both superforecasters with proven forecasting track records and domain experts with subject-matter expertise. The tournament incentivized accurate forecasting and persuasive argumentation about long-term risks humanity may face, including risks from artificial intelligence (AI), climate change, nuclear war, and pandemics. This report analyzes respondents’ forecasting accuracy on 38 near-term questions that resolved by mid-2025. Key findings include: (a) there was overall performance parity between superforecasters and domain experts, with both groups underestimating AI progress and overestimating improvements in climate technology; (b) both superforecasters and domain experts substantially outperformed a baseline of educated members of the general public; (c) at the individual level, the median superforecaster and median domain expert performed statistically indistinguishably from simple extrapolation algorithms; (d) at the aggregate level, superforecasters and domain experts showed improved accuracy and some evidence of outperforming simple extrapolation algorithms; (e) there was no statistically significant correlation between near-term accuracy and long-term existential risk forecasts.

Acknowledgments

This research would not have been possible without the support of the Musk Foundation, Open Philanthropy, and the Long-Term Future Fund. We greatly appreciate the assistance and input of Sam Glover, Rory Svarc, and Bridget Williams throughout the project.

Disclaimers

The views expressed in this paper do not necessarily reflect those of the Federal Reserve Bank of Chicago or the Federal Reserve System.

Executive Summary

This report evaluates the accuracy of near-term forecasts made by domain experts and superforecasters in the Existential Risk Persuasion Tournament (XPT).1

Background

The XPT tournament took place in June–October 2022. The tournament convened 169 participants to generate probabilistic forecasts about humanity’s long-term future and potential global risks such as climate change, nuclear war, pandemics, and artificial intelligence (AI). Of these participants, 89 were superforecasters with track records of high accuracy on near-term questions, while 80 were domain experts. In addition, we sampled hundreds of public participants for comparison. The XPT represents the largest existential risk forecasting tournament to date, uniquely combining superforecasters and domain experts to predict humanity’s long-term risks.

The tournament included 59 forecasting questions set to resolve at dates ranging from mid-2024 to as late as 2100. These questions broke down into 172 subquestions over multiple forecasting horizons and, in some cases, across different countries. Out of these 172 subquestions, 38 have known outcomes (i.e., are “resolved”) as of mid-2025. We note that the XPT tournament concluded prior to the public release of ChatGPT in November 2022.

Key Findings on Accuracy

Performance parity between superforecasters and domain experts. The near-term questions revealed no meaningful accuracy differences between superforecasters and experts forecasting on questions within their domain of expertise. Both groups achieved nearly identical accuracy scores. The performance gap between the most- and least-accurate XPT participant groups spanned just 0.18 standard deviations, comparable to the difference between median and slightly above-median performance. These small differences were not statistically significant, indicating that neither a proven forecasting track record nor domain expertise provided a consistent edge for these near-term predictions.

Individual forecasters outperformed public participants but not simple algorithms. Both superforecasters and domain experts strongly outperformed a sample of educated public participants, who scored 1.82 standard deviations below the median XPT participant. However, individual forecasters’ performance was not statistically distinguishable from two simple algorithms: a “no-change” forecast and trend extrapolation. These simple algorithms performed well partly because many questions involved low-probability events (which did not occur) or slow-moving variables (where trends persisted).

Aggregate forecasts demonstrated the wisdom of crowds. Median aggregation of XPT participants’ forecasts achieved a substantial improvement over individual performance, increasing accuracy by roughly 1 standard deviation. These aggregated predictions showed weak but positive evidence of outperforming the “no-change” forecast, though not trend extrapolation. This finding reinforces the well-established principle that combining multiple forecasts improves accuracy.

Main Insights across Subject Areas

Despite the strong overall performance of aggregate forecasts, XPT participants systematically misjudged progress in specific domains.

Respondents underestimated AI progress, especially superforecasters. XPT participants significantly underestimated the pace of AI advancement across multiple benchmarks. For three standard AI benchmarks—MATH, MMLU, and QuALITY—domain experts assigned probabilities of 21.4%, 25.0%, and 43.5% respectively to the outcomes achieved by the end of 2024. Superforecasters were even more pessimistic, assigning only 9.3%, 7.2%, and 20.1% respectively. The International Mathematical Olympiad results proved particularly surprising: AI systems achieved gold-level performance in July 2025, an outcome to which domain experts assigned only an 8.6% probability and superforecasters a mere 2.3% probability. Overall, superforecasters assigned an average probability of just 9.7% to the observed outcomes across these four AI benchmarks, compared to 24.6% from domain experts.

Climate technology progress was overestimated. In contrast to AI, forecasters were overly optimistic about the development of green technology. In 2024, the cost of hydrogen produced using renewable electricity remained higher than anticipated at $7.50 USD/kg (median forecasts of $4.50 by superforecasters and $3.50 USD/kg by domain experts), and direct air CO₂ capture technology captured only 0.01 MtCO₂/year (median forecasts of 0.32 by superforecasters and 0.60 MtCO₂/year by domain experts).

Implications for Long-Term Risks

No correlation between near-term accuracy and long-term existential risk forecasts. There was no statistically significant correlation between forecasters’ near-term accuracy and their forecasts of long-term risks. Ideally, we would use near-term forecasting ability to assess the reliability of forecasts about humanity’s long-term future. Unfortunately, in our XPT data, near-term forecasting accuracy did not consistently align with any particular position on long-term risks. Overall, near-term forecasting accuracy provides limited evidence at present for identifying who makes the most credible long-term risk forecasts.

Next Steps

Given the faster-than-expected progress on AI capabilities, it is more important than ever to understand the likely future trajectory and impact of AI. In our current and future work, we aim to shed more light on these questions. Our current projects on this front include a longitudinal panel of AI experts and a survey of economists on the expected economic impacts of AI. Through these systematic efforts to gather expert perspectives, we will provide empirically grounded insights that can inform policy and decision-making.

1. Introduction

The Existential Risk Persuasion Tournament (XPT)2 convened 169 participants from June to October 2022 to forecast questions about humanity’s long-term future and the impact of global risks such as climate change, nuclear war, pandemics, and artificial intelligence (AI). Of these 169 participants, 89 were experienced forecasters with a track record of high accuracy on near-term questions (“superforecasters”), and the other 80 were specialists working in domains related to global risks and humanity’s future (“experts”). Additionally, hundreds of public participants provided their answers to the same forecasting questions in 2023 and 2024.

We recruited superforecasters with assistance from Good Judgment, Inc. To find experts, we contacted organizations, academic departments, and research labs working on existential-risk-related issues; we also made several posts via social media and websites such as the Effective Altruism Forum. We received hundreds of expressions of interest and offered slots to the most qualified among the interested applicants. The final expert sample included 32 AI experts, 12 biorisk experts, 12 nuclear experts, 9 climate experts, and 15 “general” experts who study existential risks more broadly (referred to as “x-risk generalists”). Many in the expert pool were affiliated with the Effective Altruism (EA) community; 42% of experts participating in the XPT reported having attended an EA meetup in the past.

The median expert in the XPT forecast a 20% probability of global catastrophe—defined as a loss of at least 10% of the global population—and a 6% probability of human extinction by 2100. Superforecasters viewed the world as less risky, forecasting a 9% and 1% probability of global catastrophe and human extinction by 2100, respectively. This held across domains, though not uniformly: superforecasters and experts were much further apart on risk related to AI than on the risk of nuclear war.

Participants in the tournament forecast on questions set to resolve at various dates ranging from as early as mid-2024 to as late as 2100. The 59 forecasting questions in the XPT broke down into 172 subquestions. Of these, 32 questions (38 subquestions) have resolved as of the writing of this report. The resolved questions provide us with a unique opportunity to evaluate forecasting accuracy across different expertise groups, identify key surprises, and explore the relationship between near-term forecasting accuracy and predictions of long-term existential risks.

While we analyze all resolved questions in our dataset, our confidence in resolutions varies across questions (see Table A1.2 in the Appendix). Of the 32 resolved forecasting questions, 47% (15/32) have been definitively resolved based on authoritative data sources, while 53% (17/32) have been provisionally resolved based on available evidence or expert consultation. These provisional resolutions reflect two constraints. First, some questions require expert panels for adjudication (particularly biorisk questions lacking clear ground truth). Second, others await authoritative data publications like International Energy Agency (IEA) reports or labor statistics from the Organisation for Economic Co-operation and Development (OECD).

2. Forecasting Performance

2.1 Accuracy Metrics

We measure forecasting performance using two main accuracy metrics.

Our primary accuracy metric is the Accuracy Score. Accuracy Score evaluates forecasting performance using the original XPT scoring rules (log score for binary questions; S score for continuous questions). Accuracy Score is standardized to measure performance relative to the median XPT participant (i.e., the median across experts and superforecasters). For example, an Accuracy Score of 0.25 means a forecaster was 0.25 standard deviations more accurate than the median XPT participant. An Accuracy Score of 0.25 would place them roughly in the top 40% of all forecasters. Higher Accuracy Score values indicate better accuracy.

Our secondary accuracy metric is Standardized Absolute Forecast Error (SAFE). SAFE measures how “surprising” the actual outcome was relative to forecasters’ expectations. For example, a SAFE of 1.0 means the outcome was 1 standard deviation from the forecast—corresponding to a moderate but not extreme surprise. Lower SAFE values indicate better accuracy.

Table 2.1 provides a summary and interpretation of our main accuracy metrics. Technical details are provided in the Appendix.

MetricDescriptionInterpretationUse case
Accuracy Score (primary)

Average standardized score across all questions.


Higher values are better.

Accuracy Score = 0.25 means a forecaster is 0.25 standard deviations more accurate than the median XPT participant, placing them roughly in the top 40% of all forecasters.Used to measure relative performance.
Standardized Absolute Forecast Error, SAFE (secondary)

Average absolute forecast error in units of predictive standard deviations.


Lower values are better.

SAFE = 1.0 means outcomes are on average one standard deviation from forecasters’ expectations (corresponding to 16th/84th percentile realizations).Used to measure absolute performance; how “surprising” questions were to forecasters.
Table 2.1: Primary and secondary accuracy metrics used to evaluate forecasting accuracy.

To provide an apples-to-apples comparison between the different groups of XPT participants, we calculate accuracy metrics at the individual forecaster level.

Box 1: Individual versus aggregate forecasts

When analyzing forecasting performance, it is important to distinguish between individual and aggregate forecasts. Individual forecasts represent each forecaster’s predictions, while aggregate forecasts combine predictions from multiple forecasters within a group.

For individual-level accuracy, we calculate metrics for each forecaster separately, and then take the median across all individuals in a group. For aggregate-level accuracy, we first combine forecasts via median aggregation, and then calculate the accuracy of that combined forecast.

To compare accuracy between different groups of XPT participants, we use individual-level metrics. This is important to ensure fair comparisons. The median subquestion in the XPT has 32 superforecaster predictions versus only 4 domain-expert predictions. Since aggregating more forecasters improves accuracy via a wisdom-of-the-crowd effect, comparing group aggregates would unfairly advantage superforecasters due to their greater sample size.

Outside of group comparisons, however, we primarily analyze aggregate forecasts. In particular, we use aggregate forecasts when examining substantive questions—such as whether forecasters correctly anticipated AI progress or developments in climate technology (see Section 3). The reason is that aggregation yields more accurate predictions. As a result, the aggregate forecasts produce the most reliable measure of the XPT participants’ collective judgment.

2.2 Relative Accuracy

2.2.1 Main Results

Figure 2.1 summarizes the overall forecasting performance.

The graph provides the Accuracy Score of the median XPT participant (i.e., individual-level accuracy) by subgroup:

  • Superforecaster: Forecasters with a proven track record of high accuracy on near-term forecasting questions;
  • Domain Expert: Subject-matter experts answering questions within their specific area of expertise;
  • Non-domain Expert: Subject-matter experts answering questions outside their primary area of expertise;
  • X-risk Generalist: Experts specializing in existential risks.

Note that the same expert may be classified differently across questions. For example, an AI expert is classified as a domain expert when forecasting progress on AI benchmarks but as a non-domain expert when predicting green hydrogen costs. The sample size and composition of the final dataset is provided in the Appendix (Table A3.1). Accuracy results at the group level are given in the Appendix (Table A3.3).

Figure 2.1: For each group (non-domain expert, domain expert, superforecaster, x-risk generalist), the error bars indicate the Accuracy Score of the median individual in that group. The whiskers provide 95% bootstrap confidence intervals.

Overall, performance differences between groups were small, with only a 0.18 standard-deviation gap between the top and bottom groups. For context, a difference of 0.18 in the Accuracy Score corresponds to a difference of approximately 8 percentiles—comparable to the difference between someone performing at the median (50th percentile) versus someone performing slightly above average (around the 58th percentile). Superforecasters and domain experts achieved an almost identical Accuracy Score. Intuitively, these results indicate that there was no consistent pattern in accuracy: for some questions, domain experts were more accurate; for others, superforecasters were closer to the truth. In the Appendix (Table A1.1), we provide a question-by-question table with superforecaster and domain expert predictions (group-level aggregates) and their forecast errors, highlighting the same pattern.

Consistent with the above finding, the performance differences between groups were not statistically significant, as we document below in Figure 2.3. Therefore, we cannot confidently conclude that superforecasters, domain experts, or other groups demonstrated meaningfully higher forecasting accuracy. This finding is consistent with previous research showing that superforecasters do not have a consistent edge over domain experts (or vice versa).3

Domain experts were slightly more accurate when predicting within their area of expertise. However, this difference was small in absolute value (a difference of 0.05 in the Accuracy Score) and not statistically significant. This finding suggests limited gains from specialized knowledge in this specific forecasting context.

2.2.2 Performance against Benchmarks

Next, we compare the quality of predictions made by XPT participants (experts and superforecasters) to two benchmarks:

  • Sample of public participants;
  • Simple prediction algorithms (see “Methods” in the Appendix for details):
    • Naive “no-change” forecast (predict no change);
    • Naive “extrapolation” forecast (extrapolate the current trend).

Since we did not elicit the full set of quantile predictions for the public-participant sample, only the 50th-percentile predictions are used for this benchmarking exercise.

Figure 2.2: For each group (domain expert, non-domain expert, x-risk generalist, superforecaster, public), the bars indicate the Accuracy Score of the median individual in that group. For the two prediction algorithms (no change and extrapolated), the bars directly indicate their performance. The whiskers indicate 95% bootstrap confidence intervals. Only 50th-percentile predictions are used in the construction of this graph. The y-axis is log-transformed, so visual distances may understate true differences.

Figure 2.2 provides the comparison with our benchmarks. We observe the following takeaways:

  • XPT participants outperformed public participants. The median public participant performed substantially worse than XPT forecasters, with anAccuracy Score of –1.82. This underperformance is large: a 1.82-point gap in theAccuracy Score corresponds to the difference between participants at the 50th and 3rd percentiles of the forecasting accuracy distribution. As shown below in Figure 2.3, this difference is weakly statistically significant (p < 0.10) when comparing public participants to the full XPT sample. Domain experts and non-domain experts showed stronger outperformance (p < 0.05), while superforecasters exhibited weaker outperformance (p < 0.10).
  • The median XPT participant did not outperform statistical benchmarks. The accuracy differences between individual XPT participants and statistical benchmarks were small and not statistically significant. In fact, the simple “no-change” benchmark (Accuracy Score of 0.03) slightly outperformed both the median XPT participant and the median superforecaster, highlighting the difficulty of beating naive statistical rules.

We note that certain features of the XPT tournament may have favored simple prediction algorithms. First, a substantial portion of subquestions (8/38) concerned low-probability events that did not occur during the resolution period. These included questions about biological and nuclear weapon use (for example, Q15–18 and Q31). For all these subquestions, the no-change prediction of zero matched the actual outcome perfectly. Second, several questions tracked slowly-evolving variables for which historical trends provide strong predictive power, such as labor force participation rates (Q38) and nuclear warhead counts (Q33). By contrast, in dynamic domains like AI, these simple algorithms performed substantially worse. As we document in the Appendix (Table A3.4), the no-change and extrapolation algorithms achieved SAFE scores of 1.89 and 1.35 respectively on AI questions—substantially worse than their full-sample values of 1.04 and 0.94.

Finally, we statistically test the relative performance of different forecasts, including aggregated group-level predictions. The results are provided in Figure 2.3.

Figure 2.3: Comparison of Accuracy Score differences across different forecasts; only 50th-percentile predictions are used to calculate the Accuracy Score. Bootstrap 95% confidence intervals appear in parentheses.

A key insight that emerges is that aggregated XPT forecasts were substantially more accurate and showed evidence of outperforming statistical benchmarks. Consistent with the forecasting literature,4 aggregated forecasts substantially outperformed individual forecasts. The aggregate of all XPT participants achieved an Accuracy Score of 0.97 when using all quantile predictions and 0.78 when only median forecasts were used—a large improvement over the median individual participant (see Table A3.3 in the Appendix). While the aggregated forecast outperformed both naive benchmarks by a large margin in absolute terms, statistical significance varied. The aggregated forecast showed weak statistical evidence of outperforming the “no-change” benchmark (p < 0.10) but did not statistically significantly outperform the “extrapolation” benchmark.

Due to the limited number of resolved questions, our statistical power to detect small accuracy differences between groups is constrained. However, Figure 2.3 shows that, at the individual level, the 95% confidence interval for the accuracy difference between superforecasters and domain experts is (-0.8, 0.3). Here, negative numbers indicate greater accuracy by domain experts. This finding allows us to rule out large performance differences: with 95% confidence, the true accuracy gap between these groups is less than 0.8 standard deviations in either direction.

2.2.3 Robustness Tests and Other Analyses

A natural concern when evaluating forecasting performance is whether the results depend on the chosen accuracy metric. To address this concern, we examined forecasting performance using six different accuracy measures, including our primary Accuracy Score and alternative metrics like standardized absolute forecast error (SAFE), percentile accuracy, and mean standardized squared error; see Table A3.2 in the Appendix for the full results. Our core findings remain robust across all metrics (i.e., the differences between XPT participant groups remain small; XPT participants outperform public participants; individual XPT participants have similar accuracy to the two naive statistical benchmarks).

In the Appendix (Appendix 4: Forecaster Calibration), we also analyze forecaster calibration. Overall, we find that forecasters are overconfident at the individual level but well-calibrated when aggregated at the group level. At the individual level, forecasters are overconfident when predicting less likely tail events (i.e., they underestimate the probability of tail events). The fact that group-level forecasts are well-calibrated provides additional confidence when using predictive standard deviations to calculate the SAFE metric, as the group-level predictive standard deviations appear to accurately reflect the uncertainty present in the real world.

Finally, we examined whether near-term forecasting accuracy correlates with intersubjective accuracy, i.e., participants’ ability to predict other forecasters’ predictions. While previous research has found that intersubjective accuracy often correlates with real-world forecasting performance,5 intersubjective accuracy was not correlated with near-term accuracy in our data (see Figure A3.1 in the Appendix). This null result may suggest that intersubjective accuracy is less informative in our specific empirical context. Alternatively, our sample of 38 resolved subquestions may be too small to reliably detect a meaningful relationship.

3. Key Surprises and Insights

We next examine areas in which aggregate forecasts—which demonstrated strong overall accuracy through a wisdom-of-the-crowd effect—most notably diverged from reality. We identify the “most surprising” questions based on standardized absolute forecast errors (SAFE) at the group level, revealing systematic patterns in what forecasters found difficult to predict (Tables 3.1 and 3.2). We first present the top-10 most surprising questions for each group, and then dive deeper into three key domains where forecasters’ expectations most diverged from actual outcomes: biological weapons (Section 3.1), climate technology (Section 3.2), and artificial intelligence (Section 3.3).

3.1 Biological Weapons

Both domain experts and superforecasters overestimated the number of countries with biological weapons programs by the end of 2024. Experts predicted an average of 6.5 countries, while superforecasters predicted 5 countries, an overestimation by a factor of 2.5–3.3 relative to our projected resolution of 2 countries. For several specific countries (i.e., China, Iran, Syria, and Israel), both groups also overestimated the fraction of a panel of 100 biosecurity experts who would agree that the country has an active biological weapons program. Here, multiple countries had forecast errors with SAFE values exceeding 1, indicating moderate surprises. However, as discussed in more detail in “Ambiguous Resolutions” in the Appendix, it is difficult to unambiguously resolve this question, which could explain part of the apparent surprise.

QuestionMedian ForecastResolutionSAFEN
45. Maximum Compute Used in an AI Experiment100,000578,703.71.9233
49. Largest Number of Parameters in a Machine Learning Model100 trillion10 trillion1.7131
30. Cost of Hydrogen4.5 USD/kg7.5 USD/kg1.7032
40. “Massive Multitask Language Understanding” Benchmark77.75%88.7%1.5932
20. Individual Countries with Biological Weapons Programs (China)70%30%1.5126
21. Number of Countries with Biological Weapons Programs521.4532
39. MATH Dataset Benchmark71%87.92%1.3830
20. Individual Countries with Biological Weapons Programs (Iran)60%30%1.1828
35. GPT Revenue (Hanson Wins Bet that GPT Revenue < $1B)53.5%0%1.0732
20. Individual Countries with Biological Weapons Programs (Israel)40%10%1.0127
Table 3.1: Most surprising questions, superforecasters (group-level forecast). The table provides the top-10 questions with the largest standardized absolute forecast errors (SAFE) for the group. N denotes the number of forecasters in the group.
QuestionMedian ForecastResolutionSAFEN
32. Total Nuclear Warheads9,94912,3312.931
49. Largest Number of Parameters in a Machine Learning Model150 trillion10 trillion2.747
30. Cost of Hydrogen3.5 USD/kg7.5 USD/kg2.272
21. Number of Countries with Biological Weapons Programs6.522.174
29. Annual Direct Air CO2 Capture0.6 Mt/year0.01 Mt/year1.527
20. Individual Countries with Biological Weapons Programs (Iran)61.5%30%1.244
38. Labor Force Participation Rate in OECD77.2%79.86%1.224
20. Individual Countries with Biological Weapons Programs (Syria)52.5%25%1.104
35. GPT Revenue (Hanson Wins Bet that GPT Revenue < $1B)45%0%0.906
20. Individual Countries with Biological Weapons Programs (China)51%30%0.793
Table 3.2: Most surprising questions, domain experts (group-level forecast). The table provides the top-10 questions with the largest standardized absolute forecast errors (SAFE) for the group. N denotes the number of forecasters in the group.

3.2 Climate Technology

Forecasters were overly optimistic about progress in climate technology. Both groups expected a more substantial decrease in the cost of hydrogen produced using renewable electricity: superforecasters expected the cost of hydrogen production to decrease to 4.5 USD/kg in 2024, while domain experts predicted an even greater decline to 3.5 USD/kg. By contrast, we currently project a resolution of 7.5 USD/kg for the question. The SAFE values for this question are in the range of 1.70–2.27, suggesting large surprises. Similarly, XPT participants anticipated greater advances in carbon removal. For total direct air capture and storage, domain experts and superforecasters predicted 0.6 and 0.32 MtCO₂/year in 2024, respectively, while we currently project just 0.01 MtCO₂/year.

3.3 Artificial Intelligence

Both domain experts and superforecasters misjudged the pace and direction of AI progress. Both groups predicted lower values for the maximum compute used in an AI experiment by the end of 2024, with superforecasters underestimating the actual maximum by a factor of five. At the same time, both domain experts and superforecasters overestimated the size of the largest machine learning models by the end of 2024 (1.00E+14 parameters and 4.00E+14 parameters respectively), projecting parameter counts ten times higher than provisionally resolved (1.00E+13 parameters). However, as we note in the Appendix (Section A1.2), this overestimation likely has to do with incorrect base rate information provided to participants during the XPT tournament.

XPT participants systematically underestimated AI progress on multiple benchmarks, with superforecasters exhibiting larger underestimation. Figure 3.1 shows the probability XPT participants assigned to observed outcomes on various AI benchmarks, calculated using an estimated density function (see “Methods” in the Appendix). GPT-4 Turbo achieved 87.82% on the MATH Dataset Benchmark in April 2024; domain experts and superforecasters had assigned a 21.4% and a 9.3% probability, respectively, to reaching this level by June 30, 2024. Both GPT-4o and Claude 3.5 Sonnet achieved 88.7% on MMLU by mid-2024, an outcome that had been assigned a 25.0% and a 7.2% probability for the June 30, 2024 resolution date. RAPTOR + GPT-4 scored 69.3 on QuALITY’s hard subset in June 2023—a full year before the resolution date—yet domain experts and superforecasters had assigned only a 43.5% and a 20.1% probability to this achievement by June 30, 2024. Across these three benchmarks, superforecasters assigned probabilities 12–23 percentage points below those of domain experts.

Figure 3.1: Superforecasters’ and domain experts’ predicted probabilities of observed progress on AI benchmarks. Probabilities were calculated based on the estimated probability density functions (see Appendix 5) and the observed resolution values. Appendix 2 provides the methodological details on the density function estimation.

Among the most surprising developments was the performance of AI systems on the International Mathematical Olympiad (IMO). While not officially an AI benchmark, the IMO in recent years has “become an aspirational challenge for AI systems as a test of their advanced mathematical problem-solving and reasoning capabilities.”6 Domain experts and superforecasters did not anticipate an AI system to win a gold medal in the International Mathematical Olympiad (IMO) until after 2030. In July 2025, both Google DeepMind and OpenAI reported that their models achieved gold-level performance in the IMO 2025 competition—five years earlier than the median expert prediction and 10 years earlier than the median superforecaster prediction.7 Domain experts and superforecasters only expected an 8.6% and a 2.3% probability of this achievement on or before 2025.

We note that the XPT tournament concluded prior to the public release of ChatGPT at the end of 2022, which marked the beginning of an intense phase of AI investment and capability acceleration. While domain experts were more calibrated to trends in AI progress than superforecasters, at times even their judgment failed to anticipate the speed of advancement. These results align closely with previous reports about how experts were surprised by progress in language models in 2022 and 2023,8 particularly as it related to the MMLU, MATH, and the International Mathematical Olympiad.

4. Long-Term Risk Implications

A key goal of the original XPT tournament was to obtain forecasts for long-term risks facing humanity. XPT participants forecast two types of risks: catastrophic risks (the probability of more than 10% of the global population dying within a five-year period) and extinction risks (the probability of human extinction or a reduction of the global population below 5,000). The tournament assessed these risks across multiple domains: genetically-engineered and naturally-occurring pathogens, artificial intelligence, nuclear weapons, non-anthropogenic causes (such as asteroids or supervolcanoes), and overall risk from all causes combined.

A natural question is whether more accurate near-term forecasters made systematically different long-term risk predictions. Figure 4.1 suggests that there is no meaningful relationship between near-term accuracy and long-term risk forecasts. Across accuracy quartiles (from least accurate in quartile 1 to most accurate in quartile 4), median risk estimates remain fairly flat for all risk categories, and there is no statistically significant correlation between accuracy and long-term risk forecasts. The correlation coefficients all cluster around zero, ranging from -0.08 to 0.14, and they are not statistically significant.

In the Appendix (Figure A3.2), we examine how long-term risk forecasts relate to near-term accuracy in our sample of public participants. An advantage of using this sample is that most public participants provided a forecast on every question, eliminating issues surrounding self-selection into questions. In particular, the median public participant answered 36 out of the 38 resolved subquestions. For the public participants, unlike the main XPT sample, we observe a statistically significant negative correlation (i.e., the most accurate public forecasters predicted lower risks).

Overall, our findings challenge the hope that near-term accuracy can reliably identify forecasters with more credible long-term risk predictions. These results are consistent with the analysis from the original XPT report. The original XPT report found that, for “AI-concerned” (the third of participants with the highest forecast of AI extinction risk by 2100) and “AI-skeptic” (the third of participants with the lowest forecast of AI extinction risk by 2100) groups, their near-term forecasts were in strong agreement (see Table 26 in Appendix 4). The same was also true for superforecasters and domain experts (see Table 28 in Appendix 4).

Figure 4.1: XPT participants’ forecasts on catastrophic and extinction risks by 2100. “Catastrophic risk” is defined as the probability of 10% or more of humans dying within a five-year period (except for pathogen risks, which use a 1% threshold). “Extinction risk” is defined as the probability of human extinction or a reduction of the global population below 5,000. Participants are divided into quartiles based on their near-term accuracy, from least (1) to most (4) accurate. Error bars represent 95% bootstrap confidence intervals for the median risk forecast within each quartile. Only forecasters with at least 10 resolved near-term forecasts are included. Labels show the Spearman rank correlation between individual-level accuracy and long-term risk forecasts as well as the corresponding p-value.

5. Conclusions

This report provides the first empirical assessment of forecasting accuracy in the Existential Risk Persuasion Tournament (XPT). We conclude by discussing the limitations of this work and highlighting next steps.

5.1 Limitations

Some methodological limitations should be considered when interpreting our results:

  • Limited statistical power. With only 38 resolved subquestions—further subdivided across different domains—our ability to detect statistically significant differences between forecaster groups is constrained. Most observed accuracy differences between groups did not reach statistical significance, limiting any conclusions about relative expertise.
  • Limited implications for long-term risks. This analysis covers only questions resolved by mid-2025. Despite observing, for example, faster-than-expected AI progress, this short timeframe provides limited basis for updating beliefs on long-term existential risks.
  • Non-representative expert sample. The XPT relied on a nonrepresentative expert sample with a 34% attrition rate by the end of the tournament. (See Appendix 1 in the original XPT report.) The experts who participated may not accurately represent the broader expert communities in their respective fields.
  • Post-hoc benchmark definition. Simple algorithmic benchmarks (no change, extrapolation) were developed after data collection rather than defined a priori. This post-hoc approach may introduce hindsight bias and make tournament participants appear less accurate than they actually were.
  • Ambiguous resolutions. While 38 subquestions have resolved, our confidence in each resolution varies from question to question. While many questions have been definitively resolved (i.e., according to the criteria specified in the original XPT report), others have provisional resolutions that may change in the future. For more details on potentially ambiguous resolutions, see Appendix 1.

5.2 Looking Forward to 2030

While the questions resolved by mid-2025 have provided valuable initial insights, we are looking forward to the next wave of questions set to resolve in 2030. These questions will offer deeper insights into potential existential risks:

  • AI development and impact. Given the faster-than-expected progress on AI benchmarks, we are interested to track how this acceleration continues in the coming years. Question #51 asks whether Nick Bostrom affirms the existence of AGI by 2030, where superforecasters estimated just a 1% probability compared to domain experts’ 9%. Another key milestone is Question #44 (“Date of first publicly known advanced AI”). For this question, superforecasters predicted 2060 while domain experts predicted 2046. Beyond technical advancements, we will assess broader economic impacts through forecasts on US computer R&D spending (Question #37), labor force participation in OECD countries (Question #38), and the percentage of US GDP from software and information services (Question #36).
  • Climate trajectory and technology. Critical climate questions with 2030 resolution dates include global surface temperature change (Question #25), where superforecasters predicted 1.47°C warming versus domain experts’ 1.4°C estimate. We will also assess progress on climate technologies through questions about green hydrogen production costs (Question #30), direct air carbon capture (Question #29), and electricity share from solar and wind energy (Question #28). These resolutions will be particularly telling given the current overestimation of climate technology development.
  • Global risk forecasts. While most existential risk forecasts for 2030 were very low, we will track several important risk predictions that resolve by this date. For public health emergencies, both superforecasters and domain experts predicted approximately 2 declarations of a public health emergency of international concern (PHEIC) with at least 10,000 deaths by 2030 (Question #22). We will also monitor forecasts about nuclear weapon use causing significant casualties (Question #31). As these and other 2030 questions resolve, they will also enable us to answer crucial meta-questions: What is the relationship between near- and medium-term (five to eight years) forecasting accuracy? Do forecasters with high medium-term accuracy make systematically different predictions on long-term existential risks?

5.3 Next Steps

Building on the insights from this initial analysis, we plan to take the following next steps:

  • Develop specialized AI insights. Given the faster-than-expected progress on AI benchmarks, researchers at the Forecasting Research Institute are in the process of launching multiple dedicated projects to better understand the likely future trajectories and impacts of AI. These projects include establishing a longitudinal panel of AI experts and conducting a survey of economists on AI’s potential economic and labor market effects.
  • Track future resolutions. We will continue tracking the resolution of questions posed in the XPT. We may also re-engage the original XPT participants and gather data on how their forecasts have changed in light of recent AI advances and other developments.

Notes

  1. Karger, Ezra, Josh Rosenberg, Zach Jacobs, et al. “Forecasting Existential Risks: Evidence from a Long-Run Forecasting Tournament.” FRI Working Paper #1. Forecasting Research Institute, 2023. https://forecastingresearch.org/research/existential-risk-persuasion-tournament. ↩︎
  2. Karger, Ezra, Josh Rosenberg, Zach Jacobs, et al. “Forecasting Existential Risks: Evidence from a Long-Run Forecasting Tournament.” FRI Working Paper #1. Forecasting Research Institute, 2023. https://forecastingresearch.org/research/existential-risk-persuasion-tournament. ↩︎
  3. Leech, Gavin, and Misha Yagudin. “Can Policymakers Trust Forecasters?” Institute for Progress, March 7, 2023. https://ifp.org/can-policymakers-trust-forecasters/. ↩︎
  4. Clemen, Robert T. “Combining Forecasts: A Review and Annotated Bibliography.” International Journal of Forecasting 5, no. 4 (1989): 559–583. https://doi.org/10.1016/0169-2070(89)90012-5. ↩︎
  5. Karger, Ezra, Joshua Monrad, Barbara Mellers, and Philip Tetlock. “Reciprocal Scoring: A Method for Forecasting Unanswerable Questions.” SSRN Working Paper, October 31, 2021. https://doi.org/10.2139/ssrn.3954498. ↩︎
  6. Google DeepMind. “Advanced Version of Gemini with Deep Think Officially Achieves Gold-Medal Standard at the International Mathematical Olympiad.” Google DeepMind Blog, July 2025. https://deepmind.google/blog/advanced-version-of-gemini-with-deep-think-officially-achieves-gold-medal-standard-at-the-international-mathematical-olympiad/. ↩︎
  7. Google DeepMind. “Advanced Version of Gemini with Deep Think Officially Achieves Gold-Medal Standard at the International Mathematical Olympiad.” Google DeepMind Blog, July 21, 2025. https://deepmind.google/discover/blog/advanced-version-of-gemini-with-deep-think-officially-achieves-gold-medal-standard-at-the-international-mathematical-olympiad/; OpenAI. “We achieved gold medal-level performance 🥇on the 2025 International Mathematical Olympiad with a general-purpose reasoning LLM!” X (formerly Twitter), July 19, 2025. https://x.com/OpenAI/status/1946594928945148246. ↩︎
  8. Cotra, Ajeya, and Kelsey Piper. “Language Models Surprised Us.” Planned Obsolescence (blog), August 2023. https://www.planned-obsolescence.org/language-models-surprised-us/. ↩︎
1 = Forecasting Research Institute
2 = Wharton School of the University of Pennsylvania
3 = Federal Reserve Bank of Chicago

*Corresponding author: simas@forecastingresearch.org
    Related Research
    Working paper
    Forecasting Existential Risks: Evidence from a Long-Run Forecasting Tournament
    Jul 10, 2023
    Academic article
    Belief updating in AI-risk debates: Exploring the limits of adversarial collaboration
    Apr 3, 2025
    Project
    The Longitudinal Expert AI Panel (LEAP)
    Ongoing
    Project
    ForecastBench
    Ongoing