{"id":812,"date":"2025-09-02T12:00:00","date_gmt":"2025-09-02T12:00:00","guid":{"rendered":"https:\/\/forecastingresearch.org\/?post_type=research&#038;p=812"},"modified":"2026-05-04T17:42:48","modified_gmt":"2026-05-04T17:42:48","slug":"near-term-xpt-accuracy","status":"publish","type":"research","link":"https:\/\/forecastingresearch.org\/research\/near-term-xpt-accuracy","title":{"rendered":"Assessing Near-Term Accuracy in the Existential Risk Persuasion Tournament"},"content":{"rendered":"\n<h2 class=\"wp-block-heading\" id=\"abstract\">Abstract<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">In June\u2013October 2022, we convened 169 people to participate in the \u201cExistential Risk Persuasion Tournament\u201d (<a href=\"https:\/\/forecastingresearch.org\/research\/existential-risk-persuasion-tournament\" id=\"876\" target=\"_blank\" rel=\"noreferrer noopener\">XPT<\/a>). The XPT participants included both superforecasters with proven forecasting track records and domain experts with subject-matter expertise. The tournament incentivized accurate forecasting and persuasive argumentation about long-term risks humanity may face, including risks from artificial intelligence (AI), climate change, nuclear war, and pandemics. This report analyzes respondents\u2019 forecasting accuracy on 38 near-term questions that resolved by mid-2025. Key findings include: (a) there was overall performance parity between superforecasters and domain experts, with both groups underestimating AI progress and overestimating improvements in climate technology; (b) both superforecasters and domain experts substantially outperformed a baseline of educated members of the general public; (c) at the individual level, the median superforecaster and median domain expert performed statistically indistinguishably from simple extrapolation algorithms; (d) at the aggregate level, superforecasters and domain experts showed improved accuracy and some evidence of outperforming simple extrapolation algorithms; (e) there was no statistically significant correlation between near-term accuracy and long-term existential risk forecasts.<\/p>\n\n\n\n<div class=\"wp-block-buttons is-layout-flex wp-block-buttons-is-layout-flex\">\n<div class=\"wp-block-button\"><a class=\"btn orange\" href=\"https:\/\/forecastingresearch.org\/pdf\/near-term-xpt-accuracy.pdf\" target=\"_blank\" rel=\"noreferrer noopener\">View the&nbsp;full PDF report <svg width=\"7\" height=\"9\" viewBox=\"0 0 7 9\" fill=\"none\" xmlns=\"http:\/\/www.w3.org\/2000\/svg\">\n  <path d=\"M0.000156283 8.60806L4.22416 4.33606V4.24006L0.000156283 6.10352e-05H1.80816L6.06416 4.28806L1.80816 8.60806H0.000156283Z\" fill=\"#102B23\"\/>\n<\/svg>\n<svg width=\"8\" height=\"10\" viewBox=\"0 0 8 10\" fill=\"none\" xmlns=\"http:\/\/www.w3.org\/2000\/svg\">\n  <path d=\"M0.601719 8.85794L4.82572 4.58594V4.48994L0.601719 0.249939H2.40972L6.66572 4.53794L2.40972 8.85794H0.601719Z\" fill=\"#102B23\"\/>\n<\/svg><\/a><\/div>\n<\/div>\n\n\n\n<div class=\"wp-block-group is-vertical is-layout-flex wp-container-core-group-is-layout-4fc3f8e1 wp-block-group-is-layout-flex\">\n<details class=\"wp-block-details is-layout-flow wp-block-details-is-layout-flow\"><summary>Acknowledgments<\/summary>\n<p class=\"wp-block-paragraph\">This research would not have been possible without the support of the\nMusk Foundation, Open Philanthropy, and the Long-Term Future Fund. We\ngreatly appreciate the assistance and input of Sam Glover, Rory Svarc,\nand Bridget Williams throughout the project.<\/p>\n<\/details>\n\n\n\n<details class=\"wp-block-details is-layout-flow wp-block-details-is-layout-flow\"><summary>Disclaimers<\/summary>\n<p class=\"wp-block-paragraph\">The views expressed in this paper do not necessarily reflect those of\nthe Federal Reserve Bank of Chicago or the Federal Reserve System.<\/p>\n<\/details>\n<\/div>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"executive-summary\">Executive Summary<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">This report evaluates the <strong>accuracy of near-term forecasts<\/strong> made by domain experts and superforecasters in the Existential Risk Persuasion Tournament (<a href=\"https:\/\/forecastingresearch.org\/research\/existential-risk-persuasion-tournament\" id=\"876\" target=\"_blank\" rel=\"noreferrer noopener\">XPT<\/a>).<sup data-fn=\"b74fb551-a400-43f3-87ca-68d3140be3e5\" class=\"fn\"><a href=\"#b74fb551-a400-43f3-87ca-68d3140be3e5\" id=\"b74fb551-a400-43f3-87ca-68d3140be3e5-link\">1<\/a><\/sup><\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"background\">Background<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">The XPT tournament took place in June\u2013October 2022. The tournament\nconvened 169 participants to generate probabilistic forecasts about\nhumanity\u2019s long-term future and potential global risks such as climate\nchange, nuclear war, pandemics, and artificial intelligence (AI). Of\nthese participants, 89 were superforecasters with track records of high\naccuracy on near-term questions, while 80 were domain experts. In\naddition, we sampled hundreds of public participants for comparison. The\nXPT represents the largest existential risk forecasting tournament to\ndate, uniquely combining superforecasters and domain experts to predict\nhumanity\u2019s long-term risks.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The tournament included 59 forecasting questions set to resolve at\ndates ranging from mid-2024 to as late as 2100. These questions broke\ndown into 172 subquestions over multiple forecasting horizons and, in\nsome cases, across different countries. Out of these 172 subquestions,\n38 have known outcomes (i.e., are \u201cresolved\u201d) as of mid-2025. We note\nthat the XPT tournament concluded prior to the public release of ChatGPT\nin November 2022.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"key-findings-on-accuracy\">Key Findings on Accuracy<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Performance parity between superforecasters and domain\nexperts.<\/strong> The near-term questions revealed no meaningful\naccuracy differences between superforecasters and experts forecasting on\nquestions within their domain of expertise. Both groups achieved nearly\nidentical accuracy scores. The performance gap between the most- and\nleast-accurate XPT participant groups spanned just 0.18 standard\ndeviations, comparable to the difference between median and slightly\nabove-median performance. These small differences were not statistically\nsignificant, indicating that neither a proven forecasting track record\nnor domain expertise provided a consistent edge for these near-term\npredictions.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Individual forecasters outperformed public participants but\nnot simple algorithms.<\/strong> Both superforecasters and domain experts\nstrongly outperformed a sample of educated public participants, who\nscored 1.82 standard deviations below the median XPT participant.\nHowever, individual forecasters\u2019 performance was not statistically\ndistinguishable from two simple algorithms: a \u201cno-change\u201d forecast and\ntrend extrapolation. These simple algorithms performed well partly\nbecause many questions involved low-probability events (which did not\noccur) or slow-moving variables (where trends persisted).<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Aggregate forecasts demonstrated the wisdom of\ncrowds.<\/strong> Median aggregation of XPT participants\u2019 forecasts\nachieved a substantial improvement over individual performance,\nincreasing accuracy by roughly 1 standard deviation. These aggregated\npredictions showed weak but positive evidence of outperforming the\n\u201cno-change\u201d forecast, though not trend extrapolation. This finding\nreinforces the well-established principle that combining multiple\nforecasts improves accuracy.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"main-insights-across-subject-areas\">Main Insights across Subject\nAreas<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Despite the strong overall performance of aggregate forecasts, XPT\nparticipants systematically misjudged progress in specific domains.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Respondents underestimated AI progress, especially\nsuperforecasters.<\/strong> XPT participants significantly underestimated\nthe pace of AI advancement across multiple benchmarks. For three\nstandard AI benchmarks\u2014MATH, MMLU, and QuALITY\u2014domain experts assigned\nprobabilities of 21.4%, 25.0%, and 43.5% respectively to the outcomes\nachieved by the end of 2024. Superforecasters were even more\npessimistic, assigning only 9.3%, 7.2%, and 20.1% respectively. The\nInternational Mathematical Olympiad results proved particularly\nsurprising: AI systems achieved gold-level performance in July 2025, an\noutcome to which domain experts assigned only an 8.6% probability and\nsuperforecasters a mere 2.3% probability. Overall, superforecasters\nassigned an average probability of just 9.7% to the observed outcomes\nacross these four AI benchmarks, compared to 24.6% from domain\nexperts.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Climate technology progress was overestimated.<\/strong> In\ncontrast to AI, forecasters were overly optimistic about the development\nof green technology. In 2024, the cost of hydrogen produced using\nrenewable electricity remained higher than anticipated at $7.50 USD\/kg\n(median forecasts of $4.50 by superforecasters and $3.50 USD\/kg by\ndomain experts), and direct air CO\u2082 capture technology captured only\n0.01 MtCO\u2082\/year (median forecasts of 0.32 by superforecasters and 0.60\nMtCO\u2082\/year by domain experts).<\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"implications-for-long-term-risks\">Implications for Long-Term\nRisks<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>No correlation between near-term accuracy and long-term\nexistential risk forecasts.<\/strong> There was no statistically\nsignificant correlation between forecasters\u2019 near-term accuracy and\ntheir forecasts of long-term risks. Ideally, we would use near-term\nforecasting ability to assess the reliability of forecasts about\nhumanity\u2019s long-term future. Unfortunately, in our XPT data, near-term\nforecasting accuracy did not consistently align with any particular\nposition on long-term risks. Overall, near-term forecasting accuracy\nprovides limited evidence at present for identifying who makes the most\ncredible long-term risk forecasts.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"next-steps\">Next Steps<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Given the faster-than-expected progress on AI capabilities, it is\nmore important than ever to understand the likely future trajectory and\nimpact of AI. In our current and future work, we aim to shed more light\non these questions. Our current projects on this front include a\nlongitudinal panel of AI experts and a survey of economists on the\nexpected economic impacts of AI. Through these systematic efforts to\ngather expert perspectives, we will provide empirically grounded\ninsights that can inform policy and decision-making.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"introduction\">1. Introduction<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">The Existential Risk Persuasion Tournament (XPT)<sup data-fn=\"432ddbe7-e811-4731-9a04-d548b6479c81\" class=\"fn\"><a href=\"#432ddbe7-e811-4731-9a04-d548b6479c81\" id=\"432ddbe7-e811-4731-9a04-d548b6479c81-link\">2<\/a><\/sup> convened 169 participants from June to October 2022 to forecast questions about humanity\u2019s long-term future and the impact of global risks such as climate change, nuclear war, pandemics, and artificial intelligence (AI). Of these 169 participants, 89 were experienced forecasters with a track record of high accuracy on near-term questions (\u201csuperforecasters\u201d), and the other 80 were specialists working in domains related to global risks and humanity\u2019s future (\u201cexperts\u201d). Additionally, hundreds of public participants provided their answers to the same forecasting questions in 2023 and 2024.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">We recruited superforecasters with assistance from <a href=\"https:\/\/goodjudgment.com\/\"><u>Good Judgment, Inc<\/u><\/a>. To find\nexperts, we contacted organizations, academic departments, and research\nlabs working on existential-risk-related issues; we also made several\nposts via social media and websites such as the Effective Altruism\nForum. We received hundreds of expressions of interest and offered slots\nto the most qualified among the interested applicants. The final expert\nsample included 32 AI experts, 12 biorisk experts, 12 nuclear experts, 9\nclimate experts, and 15 \u201cgeneral\u201d experts who study existential risks\nmore broadly (referred to as \u201cx-risk generalists\u201d). Many in the expert\npool were affiliated with the Effective Altruism (EA) community; 42% of\nexperts participating in the XPT reported having attended an EA meetup\nin the past.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The median expert in the XPT forecast a 20% probability of global\ncatastrophe\u2014defined as a loss of at least 10% of the global\npopulation\u2014and a 6% probability of human extinction by 2100.\nSuperforecasters viewed the world as less risky, forecasting a 9% and 1%\nprobability of global catastrophe and human extinction by 2100,\nrespectively. This held across domains, though not uniformly:\nsuperforecasters and experts were much further apart on risk related to\nAI than on the risk of nuclear war.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Participants in the tournament forecast on questions set to resolve\nat various dates ranging from as early as mid-2024 to as late as 2100.\nThe 59 forecasting questions in the XPT broke down into 172\nsubquestions. Of these, 32 questions (38 subquestions) have resolved as\nof the writing of this report. The resolved questions provide us with a\nunique opportunity to evaluate forecasting accuracy across different\nexpertise groups, identify key surprises, and explore the relationship\nbetween near-term forecasting accuracy and predictions of long-term\nexistential risks.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">While we analyze all resolved questions in our dataset, our confidence in resolutions varies across questions (see <a href=\"https:\/\/forecastingresearch.org\/pdf\/near-term-xpt-accuracy.pdf#page=22\" target=\"_blank\" rel=\"noreferrer noopener\">Table A1.2<\/a> in the Appendix). Of the 32 resolved forecasting questions, 47% (15\/32) have been definitively resolved based on authoritative data sources, while 53% (17\/32) have been provisionally resolved based on available evidence or expert consultation. These provisional resolutions reflect two constraints. First, some questions require expert panels for adjudication (particularly biorisk questions lacking clear ground truth). Second, others await authoritative data publications like International Energy Agency (IEA) reports or labor statistics from the Organisation for Economic Co-operation and Development (OECD).<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"forecasting-performance\">2. Forecasting Performance<\/h2>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"accuracy-metrics\">2.1 Accuracy Metrics<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">We measure forecasting performance using two main accuracy\nmetrics.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Our primary accuracy metric is the <strong>Accuracy Score<\/strong>.\n<em>Accuracy Score<\/em> evaluates forecasting performance using the\noriginal XPT scoring rules (log score for binary questions; S score for\ncontinuous questions). <em>Accuracy Score<\/em> is standardized to\nmeasure performance relative to the median XPT participant (i.e., the\nmedian across experts and superforecasters). For example, an\n<em>Accuracy Score<\/em> of 0.25 means a forecaster was 0.25 standard\ndeviations more accurate than the median XPT participant. An\n<em>Accuracy Score<\/em> of 0.25 would place them roughly in the top 40%\nof all forecasters. Higher <em>Accuracy Score<\/em> values indicate\nbetter accuracy.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Our secondary accuracy metric is <strong>Standardized Absolute\nForecast Error<\/strong> (<em>SAFE<\/em>). <em>SAFE<\/em> measures how\n\u201csurprising\u201d the actual outcome was relative to forecasters\u2019\nexpectations. For example, a <em>SAFE<\/em> of 1.0 means the outcome was\n1 standard deviation from the forecast\u2014corresponding to a moderate but\nnot extreme surprise. Lower <em>SAFE<\/em> values indicate better\naccuracy.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Table 2.1 provides a summary and interpretation of our main accuracy metrics. Technical details are provided in the <a href=\"https:\/\/forecastingresearch.org\/pdf\/near-term-xpt-accuracy.pdf#page=34\" target=\"_blank\" rel=\"noreferrer noopener\">Appendix<\/a>.<\/p>\n\n\n\n<figure class=\"wp-block-table\"><div class=\"table-wrapper\"><table class=\"has-fixed-layout\"><tbody><tr><td><strong>Metric<\/strong><\/td><td><strong>Description<\/strong><\/td><td><strong>Interpretation<\/strong><\/td><td><strong>Use case<\/strong><\/td><\/tr><tr><td><strong>Accuracy Score<\/strong>\n<em>(primary)<\/em><\/td><td><p>Average standardized score across all questions.<\/p><br><p>Higher values are better.<\/p><\/td><td><em>Accuracy Score<\/em> = 0.25 means a forecaster is 0.25 standard deviations more accurate than the median XPT participant, placing them roughly in the top 40% of all forecasters.<\/td><td>Used to measure relative performance.<\/td><\/tr><tr><td><strong>Standardized Absolute Forecast\nError, <em>SAFE<\/em><\/strong> <em>(secondary)<\/em><\/td><td><p>Average absolute forecast error in units of predictive standard deviations.<\/p><br><p>Lower values are better.<\/p><\/td><td><em>SAFE<\/em> = 1.0 means outcomes are on average one standard deviation from forecasters\u2019 expectations (corresponding to 16<sup>th<\/sup>\/84<sup>th<\/sup> percentile realizations).<\/td><td>Used to measure absolute performance; how\n\u201csurprising\u201d questions were to forecasters.<\/td><\/tr><\/tbody><\/table><\/div><figcaption class=\"wp-element-caption\"><strong>Table 2.1:<\/strong> Primary and secondary accuracy metrics used to evaluate forecasting accuracy.<\/figcaption><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">To provide an apples-to-apples comparison between the different\ngroups of XPT participants, <strong>we calculate accuracy metrics at the\nindividual forecaster level.<\/strong><\/p>\n\n\n\n<div class=\"wp-block-group\"><div class=\"wp-block-group__inner-container is-layout-constrained wp-block-group-is-layout-constrained\">\n<h4 class=\"wp-block-heading\" id=\"box-1-individual-versus-aggregate-forecasts\">Box 1: Individual\nversus aggregate forecasts<\/h4>\n\n\n\n<p class=\"wp-block-paragraph\">When analyzing forecasting performance, it is important to\ndistinguish between individual and aggregate forecasts. <em>Individual\nforecasts<\/em> represent each forecaster\u2019s predictions, while\n<em>aggregate forecasts<\/em> combine predictions from multiple\nforecasters within a group.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">For individual-level accuracy, we calculate metrics for each\nforecaster separately, and then take the median across all individuals\nin a group. For aggregate-level accuracy, we first combine forecasts via\nmedian aggregation, and then calculate the accuracy of that combined\nforecast.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">To compare accuracy between different groups of XPT participants, we\nuse individual-level metrics. This is important to ensure fair\ncomparisons. The median subquestion in the XPT has 32 superforecaster\npredictions versus only 4 domain-expert predictions. Since aggregating\nmore forecasters improves accuracy via a <a href=\"https:\/\/en.wikipedia.org\/wiki\/Wisdom_of_the_crowd\"><u>wisdom-of-the-crowd\neffect<\/u><\/a>, comparing group aggregates would unfairly advantage\nsuperforecasters due to their greater sample size.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Outside of group comparisons, however, we primarily analyze aggregate forecasts. In particular, we use aggregate forecasts when examining substantive questions\u2014such as whether forecasters correctly anticipated AI progress or developments in climate technology (see <a href=\"#key-surprises-and-insights\">Section 3<\/a>). The reason is that aggregation yields more accurate predictions. As a result, the aggregate forecasts produce the most reliable measure of the XPT participants\u2019 collective judgment.<\/p>\n<\/div><\/div>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"relative-accuracy\">2.2 Relative Accuracy<\/h3>\n\n\n\n<h4 class=\"wp-block-heading\" id=\"main-results\">2.2.1 Main Results<\/h4>\n\n\n\n<p class=\"wp-block-paragraph\">Figure 2.1 summarizes the overall forecasting performance.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The graph provides the <em>Accuracy Score<\/em> of the median XPT\nparticipant (i.e., individual-level accuracy) by subgroup:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><em>Superforecaster<\/em>: Forecasters with a proven track record\nof high accuracy on near-term forecasting questions;<\/li>\n\n\n\n<li><em>Domain Expert<\/em>: Subject-matter experts answering\nquestions within their specific area of expertise;<\/li>\n\n\n\n<li><em>Non-domain Expert<\/em>: Subject-matter experts answering\nquestions outside their primary area of expertise;<\/li>\n\n\n\n<li><em>X-risk Generalist<\/em>: Experts specializing in existential\nrisks.<\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\">Note that the same expert may be classified differently across questions. For example, an AI expert is classified as a domain expert when forecasting progress on AI benchmarks but as a non-domain expert when predicting green hydrogen costs. The sample size and composition of the final dataset is provided in the Appendix (<a href=\"https:\/\/forecastingresearch.org\/pdf\/near-term-xpt-accuracy.pdf#page=37\" target=\"_blank\" rel=\"noreferrer noopener\">Table A3.1<\/a>). Accuracy results at the group level are given in the Appendix (<a href=\"https:\/\/forecastingresearch.org\/pdf\/near-term-xpt-accuracy.pdf#page=39\" target=\"_blank\" rel=\"noreferrer noopener\">Table A3.3<\/a>).<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\" id=\"accuracy-score\"><img loading=\"lazy\" decoding=\"async\" width=\"2048\" height=\"819\" src=\"https:\/\/forecastingresearch.org\/wp-content\/uploads\/2026\/03\/paper_2025-09-02_near-term-xpt-accuracy_fig-02-1.png\" alt=\"\" class=\"wp-image-816\" srcset=\"https:\/\/forecastingresearch.org\/wp-content\/uploads\/2026\/03\/paper_2025-09-02_near-term-xpt-accuracy_fig-02-1.png 2048w, https:\/\/forecastingresearch.org\/wp-content\/uploads\/2026\/03\/paper_2025-09-02_near-term-xpt-accuracy_fig-02-1-350x140.png 350w, https:\/\/forecastingresearch.org\/wp-content\/uploads\/2026\/03\/paper_2025-09-02_near-term-xpt-accuracy_fig-02-1-700x280.png 700w, https:\/\/forecastingresearch.org\/wp-content\/uploads\/2026\/03\/paper_2025-09-02_near-term-xpt-accuracy_fig-02-1-768x307.png 768w, https:\/\/forecastingresearch.org\/wp-content\/uploads\/2026\/03\/paper_2025-09-02_near-term-xpt-accuracy_fig-02-1-1536x614.png 1536w, https:\/\/forecastingresearch.org\/wp-content\/uploads\/2026\/03\/paper_2025-09-02_near-term-xpt-accuracy_fig-02-1-2000x800.png 2000w, https:\/\/forecastingresearch.org\/wp-content\/uploads\/2026\/03\/paper_2025-09-02_near-term-xpt-accuracy_fig-02-1-1200x480.png 1200w, https:\/\/forecastingresearch.org\/wp-content\/uploads\/2026\/03\/paper_2025-09-02_near-term-xpt-accuracy_fig-02-1-150x60.png 150w\" sizes=\"auto, (max-width: 2048px) 100vw, 2048px\" \/><figcaption class=\"wp-element-caption\"><strong>Figure 2.1:<\/strong> For each group (non-domain expert,\ndomain expert, superforecaster, x-risk generalist), the error bars\nindicate the Accuracy Score of the median individual in that group. The\nwhiskers provide 95% bootstrap confidence intervals.<\/figcaption><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">Overall, <strong>performance differences between groups were small<\/strong>, with only a 0.18 standard-deviation gap between the top and bottom groups. For context, a difference of 0.18 in the <em>Accuracy Score<\/em> corresponds to a difference of approximately 8 percentiles\u2014comparable to the difference between someone performing at the median (50<sup>th<\/sup> percentile) versus someone performing slightly above average (around the 58<sup>th<\/sup> percentile). Superforecasters and domain experts achieved an almost identical <em>Accuracy Score<\/em>. Intuitively, these results indicate that there was no consistent pattern in accuracy: for some questions, domain experts were more accurate; for others, superforecasters were closer to the truth. In the Appendix (<a href=\"https:\/\/forecastingresearch.org\/pdf\/near-term-xpt-accuracy.pdf#page=20\" target=\"_blank\" rel=\"noreferrer noopener\">Table A1.1<\/a>), we provide a question-by-question table with superforecaster and domain expert predictions (group-level aggregates) and their forecast errors, highlighting the same pattern.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Consistent with the above finding, the <strong>performance differences between groups were not statistically significant<\/strong>, as we document below in Figure 2.3. Therefore, we cannot confidently conclude that superforecasters, domain experts, or other groups demonstrated meaningfully higher forecasting accuracy. This finding is consistent with <a href=\"https:\/\/ifp.org\/can-policymakers-trust-forecasters\/\" id=\"https:\/\/ifp.org\/can-policymakers-trust-forecasters\/\" target=\"_blank\" rel=\"noreferrer noopener\">previous research<\/a> showing that superforecasters do not have a consistent edge over domain experts (or vice versa).<sup data-fn=\"29474279-fe20-4eb4-bedd-f607b5805c27\" class=\"fn\"><a href=\"#29474279-fe20-4eb4-bedd-f607b5805c27\" id=\"29474279-fe20-4eb4-bedd-f607b5805c27-link\">3<\/a><\/sup><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Domain experts were slightly more accurate when predicting within\ntheir area of expertise. However, this difference was small in absolute\nvalue (a difference of 0.05 in the <em>Accuracy Score<\/em>) and not\nstatistically significant. This finding suggests limited gains from\nspecialized knowledge in this specific forecasting context.<\/p>\n\n\n\n<h4 class=\"wp-block-heading\" id=\"performance-against-benchmarks\">2.2.2 Performance against\nBenchmarks<\/h4>\n\n\n\n<p class=\"wp-block-paragraph\">Next, we compare the quality of predictions made by XPT participants\n(experts and superforecasters) to two benchmarks:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Sample of public participants;<\/li>\n\n\n\n<li>Simple prediction algorithms (see \u201c<a href=\"https:\/\/forecastingresearch.org\/pdf\/near-term-xpt-accuracy.pdf#page=31\" target=\"_blank\" rel=\"noreferrer noopener\">Methods<\/a>\u201d in the Appendix for details):\n<ul class=\"wp-block-list\">\n<li>Naive \u201cno-change\u201d forecast (predict no change);<\/li>\n\n\n\n<li>Naive \u201cextrapolation\u201d forecast (extrapolate the current\ntrend).<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\">Since we did not elicit the full set of quantile predictions for the\npublic-participant sample, only the 50<sup>th<\/sup>-percentile\npredictions are used for this benchmarking exercise.<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"2048\" height=\"585\" src=\"https:\/\/forecastingresearch.org\/wp-content\/uploads\/2026\/03\/paper_2025-09-02_near-term-xpt-accuracy_fig-02-2.png\" alt=\"\" class=\"wp-image-817\" srcset=\"https:\/\/forecastingresearch.org\/wp-content\/uploads\/2026\/03\/paper_2025-09-02_near-term-xpt-accuracy_fig-02-2.png 2048w, https:\/\/forecastingresearch.org\/wp-content\/uploads\/2026\/03\/paper_2025-09-02_near-term-xpt-accuracy_fig-02-2-350x100.png 350w, https:\/\/forecastingresearch.org\/wp-content\/uploads\/2026\/03\/paper_2025-09-02_near-term-xpt-accuracy_fig-02-2-700x200.png 700w, https:\/\/forecastingresearch.org\/wp-content\/uploads\/2026\/03\/paper_2025-09-02_near-term-xpt-accuracy_fig-02-2-768x219.png 768w, https:\/\/forecastingresearch.org\/wp-content\/uploads\/2026\/03\/paper_2025-09-02_near-term-xpt-accuracy_fig-02-2-1536x439.png 1536w, https:\/\/forecastingresearch.org\/wp-content\/uploads\/2026\/03\/paper_2025-09-02_near-term-xpt-accuracy_fig-02-2-2000x571.png 2000w, https:\/\/forecastingresearch.org\/wp-content\/uploads\/2026\/03\/paper_2025-09-02_near-term-xpt-accuracy_fig-02-2-1200x343.png 1200w, https:\/\/forecastingresearch.org\/wp-content\/uploads\/2026\/03\/paper_2025-09-02_near-term-xpt-accuracy_fig-02-2-150x43.png 150w\" sizes=\"auto, (max-width: 2048px) 100vw, 2048px\" \/><figcaption class=\"wp-element-caption\"><strong>Figure 2.2:<\/strong> For each group (domain expert,\nnon-domain expert, x-risk generalist, superforecaster, public), the bars\nindicate the Accuracy Score of the median individual in that group. For\nthe two prediction algorithms (no change and extrapolated), the bars\ndirectly indicate their performance. The whiskers indicate 95% bootstrap\nconfidence intervals. Only 50<sup>th<\/sup>-percentile predictions are\nused in the construction of this graph. The y-axis is log-transformed,\nso visual distances may understate true differences.<\/figcaption><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">Figure 2.2 provides the comparison with our benchmarks. We observe\nthe following takeaways:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>XPT participants outperformed public\nparticipants.<\/strong> The median public participant performed\nsubstantially worse than XPT forecasters, with an<em>Accuracy\nScore<\/em> of \u20131.82. This underperformance is large: a 1.82-point gap in\nthe<em>Accuracy Score<\/em> corresponds to the difference between\nparticipants at the 50<sup>th<\/sup> and 3<sup>rd<\/sup> percentiles of\nthe forecasting accuracy distribution. As shown below in Figure 2.3,\nthis difference is weakly statistically significant (<em>p<\/em> &lt;\n0.10) when comparing public participants to the full XPT sample. Domain\nexperts and non-domain experts showed stronger outperformance\n(<em>p<\/em> &lt; 0.05), while superforecasters exhibited weaker\noutperformance (<em>p<\/em> &lt; 0.10).<\/li>\n\n\n\n<li><strong>The median XPT participant did not outperform statistical\nbenchmarks.<\/strong> The accuracy differences between individual XPT\nparticipants and statistical benchmarks were small and not statistically\nsignificant. In fact, the simple \u201cno-change\u201d benchmark (<em>Accuracy\nScore<\/em> of 0.03) slightly outperformed both the median XPT\nparticipant and the median superforecaster, highlighting the difficulty\nof beating naive statistical rules.<\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\">We note that certain features of the XPT tournament may have favored simple prediction algorithms. First, a substantial portion of subquestions (8\/38) concerned low-probability events that did not occur during the resolution period. These included questions about biological and nuclear weapon use (for example, Q15\u201318 and Q31). For all these subquestions, the no-change prediction of zero matched the actual outcome perfectly. Second, several questions tracked slowly-evolving variables for which historical trends provide strong predictive power, such as labor force participation rates (Q38) and nuclear warhead counts (Q33). By contrast, in dynamic domains like AI, these simple algorithms performed substantially worse. As we document in the Appendix (<a href=\"https:\/\/forecastingresearch.org\/pdf\/near-term-xpt-accuracy.pdf#page=40\" target=\"_blank\" rel=\"noreferrer noopener\">Table A3.4<\/a>), the no-change and extrapolation algorithms achieved <em>SAFE<\/em> scores of 1.89 and 1.35 respectively on AI questions\u2014substantially worse than their full-sample values of 1.04 and 0.94.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Finally, we statistically test the relative performance of different\nforecasts, including aggregated group-level predictions. The results are\nprovided in Figure 2.3.<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"2048\" height=\"1638\" src=\"https:\/\/forecastingresearch.org\/wp-content\/uploads\/2026\/03\/paper_2025-09-02_near-term-xpt-accuracy_fig-02-3.png\" alt=\"\" class=\"wp-image-818\" srcset=\"https:\/\/forecastingresearch.org\/wp-content\/uploads\/2026\/03\/paper_2025-09-02_near-term-xpt-accuracy_fig-02-3.png 2048w, https:\/\/forecastingresearch.org\/wp-content\/uploads\/2026\/03\/paper_2025-09-02_near-term-xpt-accuracy_fig-02-3-350x280.png 350w, https:\/\/forecastingresearch.org\/wp-content\/uploads\/2026\/03\/paper_2025-09-02_near-term-xpt-accuracy_fig-02-3-700x560.png 700w, https:\/\/forecastingresearch.org\/wp-content\/uploads\/2026\/03\/paper_2025-09-02_near-term-xpt-accuracy_fig-02-3-768x614.png 768w, https:\/\/forecastingresearch.org\/wp-content\/uploads\/2026\/03\/paper_2025-09-02_near-term-xpt-accuracy_fig-02-3-1536x1229.png 1536w, https:\/\/forecastingresearch.org\/wp-content\/uploads\/2026\/03\/paper_2025-09-02_near-term-xpt-accuracy_fig-02-3-2000x1600.png 2000w, https:\/\/forecastingresearch.org\/wp-content\/uploads\/2026\/03\/paper_2025-09-02_near-term-xpt-accuracy_fig-02-3-1200x960.png 1200w, https:\/\/forecastingresearch.org\/wp-content\/uploads\/2026\/03\/paper_2025-09-02_near-term-xpt-accuracy_fig-02-3-150x120.png 150w\" sizes=\"auto, (max-width: 2048px) 100vw, 2048px\" \/><figcaption class=\"wp-element-caption\"><strong>Figure 2.3:<\/strong> Comparison of Accuracy Score differences\nacross different forecasts; only 50<sup>th<\/sup>-percentile predictions\nare used to calculate the Accuracy Score. Bootstrap 95% confidence\nintervals appear in parentheses.<\/figcaption><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">A key insight that emerges is that <strong>aggregated XPT forecasts were substantially more accurate and showed evidence of outperforming statistical benchmarks<\/strong>. Consistent with the forecasting literature,<sup data-fn=\"0dc7ab8f-55a4-41f9-a26b-b1f679ef9aba\" class=\"fn\"><a href=\"#0dc7ab8f-55a4-41f9-a26b-b1f679ef9aba\" id=\"0dc7ab8f-55a4-41f9-a26b-b1f679ef9aba-link\">4<\/a><\/sup> aggregated forecasts substantially outperformed individual forecasts. The aggregate of all XPT participants achieved an <em>Accuracy Score<\/em> of 0.97 when using all quantile predictions and 0.78 when only median forecasts were used\u2014a large improvement over the median individual participant (see <a href=\"https:\/\/forecastingresearch.org\/pdf\/near-term-xpt-accuracy.pdf#page=39\" target=\"_blank\" rel=\"noreferrer noopener\">Table A3.3<\/a> in the Appendix). While the aggregated forecast outperformed both naive benchmarks by a large margin in absolute terms, statistical significance varied. The aggregated forecast showed weak statistical evidence of outperforming the \u201cno-change\u201d benchmark (p &lt; 0.10) but did not statistically significantly outperform the \u201cextrapolation\u201d benchmark.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Due to the limited number of resolved questions, our statistical\npower to detect small accuracy differences between groups is\nconstrained. However, Figure 2.3 shows that, at the individual level,\nthe 95% confidence interval for the accuracy difference between\nsuperforecasters and domain experts is (-0.8, 0.3). Here, negative\nnumbers indicate greater accuracy by domain experts. This finding allows\nus to rule out large performance differences: with 95% confidence, the\ntrue accuracy gap between these groups is less than 0.8 standard\ndeviations in either direction.<\/p>\n\n\n\n<h4 class=\"wp-block-heading\" id=\"robustness-tests-and-other-analyses\">2.2.3 Robustness Tests and\nOther Analyses<\/h4>\n\n\n\n<p class=\"wp-block-paragraph\">A natural concern when evaluating forecasting performance is whether the results depend on the chosen accuracy metric. To address this concern, we examined forecasting performance using six different accuracy measures, including our primary <em>Accuracy Score<\/em> and alternative metrics like standardized absolute forecast error (<em>SAFE<\/em>), percentile accuracy, and mean standardized squared error; see <a href=\"https:\/\/forecastingresearch.org\/pdf\/near-term-xpt-accuracy.pdf#page=38\" target=\"_blank\" rel=\"noreferrer noopener\">Table A3.2<\/a> in the Appendix for the full results. Our <strong>core findings remain robust across all metrics<\/strong> (i.e., the differences between XPT participant groups remain small; XPT participants outperform public participants; individual XPT participants have similar accuracy to the two naive statistical benchmarks).<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">In the Appendix (<a href=\"https:\/\/forecastingresearch.org\/pdf\/near-term-xpt-accuracy.pdf#page=44\" target=\"_blank\" rel=\"noreferrer noopener\">Appendix 4: Forecaster Calibration<\/a>), we also analyze forecaster calibration. Overall, we find that <strong>forecasters are overconfident at the individual level but well-calibrated when aggregated at the group level<\/strong>. At the individual level, forecasters are overconfident when predicting less likely tail events (i.e., they underestimate the probability of tail events). The fact that group-level forecasts are well-calibrated provides additional confidence when using predictive standard deviations to calculate the SAFE metric, as the group-level predictive standard deviations appear to accurately reflect the uncertainty present in the real world.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Finally, we examined whether near-term forecasting accuracy correlates with <em>intersubjective accuracy<\/em>, i.e., participants\u2019 ability to predict other forecasters\u2019 predictions. While previous research has found that intersubjective accuracy often correlates with real-world forecasting performance,<sup data-fn=\"3c07f1eb-5415-496b-a0b6-e6e41bdab770\" class=\"fn\"><a href=\"#3c07f1eb-5415-496b-a0b6-e6e41bdab770\" id=\"3c07f1eb-5415-496b-a0b6-e6e41bdab770-link\">5<\/a><\/sup> <strong>intersubjective accuracy was not correlated with near-term accuracy<\/strong> in our data (see <a href=\"https:\/\/forecastingresearch.org\/pdf\/near-term-xpt-accuracy.pdf#page=42\" target=\"_blank\" rel=\"noreferrer noopener\">Figure A3.1<\/a> in the Appendix). This null result may suggest that intersubjective accuracy is less informative in our specific empirical context. Alternatively, our sample of 38 resolved subquestions may be too small to reliably detect a meaningful relationship.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"key-surprises-and-insights\">3. Key Surprises and Insights<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">We next examine areas in which aggregate forecasts\u2014which demonstrated strong overall accuracy through a wisdom-of-the-crowd effect\u2014most notably diverged from reality. We identify the \u201cmost surprising\u201d questions based on standardized absolute forecast errors (<em>SAFE<\/em>) at the group level, revealing systematic patterns in what forecasters found difficult to predict (Tables 3.1 and 3.2). We first present the top-10 most surprising questions for each group, and then dive deeper into three key domains where forecasters&#8217; expectations most diverged from actual outcomes: biological weapons (<a href=\"#biological-weapons\" id=\"#biological-weapons\">Section 3.1<\/a>), climate technology (<a href=\"#climate-technology\">Section 3.2<\/a>), and artificial intelligence (<a href=\"#artificial-intelligence\">Section 3.3<\/a>).<\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"biological-weapons\">3.1 Biological Weapons<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Both domain experts and superforecasters overestimated the number of countries with biological weapons programs by the end of 2024.<\/strong> Experts predicted an average of 6.5 countries, while superforecasters predicted 5 countries, an overestimation by a factor of 2.5\u20133.3 relative to our projected resolution of 2 countries. For several specific countries (i.e., China, Iran, Syria, and Israel), both groups also overestimated the fraction of a panel of 100 biosecurity experts who would agree that the country has an active biological weapons program. Here, multiple countries had forecast errors with <em>SAFE<\/em> values exceeding 1, indicating moderate surprises. However, as discussed in more detail in \u201c<a href=\"https:\/\/forecastingresearch.org\/pdf\/near-term-xpt-accuracy.pdf#page=27\" target=\"_blank\" rel=\"noreferrer noopener\">Ambiguous Resolutions<\/a>\u201d in the Appendix, it is difficult to unambiguously resolve this question, which could explain part of the apparent surprise.<\/p>\n\n\n\n<figure class=\"wp-block-table\"><div class=\"table-wrapper\"><table class=\"has-fixed-layout\"><tbody><tr><td>Question<\/td><td>Median Forecast<\/td><td>Resolution<\/td><td>SAFE<\/td><td>N<\/td><\/tr><tr><td>45. Maximum Compute Used in an AI Experiment<\/td><td>100,000<\/td><td>578,703.7<\/td><td>1.92<\/td><td>33<\/td><\/tr><tr><td>49. Largest Number of Parameters in a Machine Learning Model<\/td><td>100 trillion<\/td><td>10 trillion<\/td><td>1.71<\/td><td>31<\/td><\/tr><tr><td>30. Cost of Hydrogen<\/td><td>4.5 USD\/kg<\/td><td>7.5 USD\/kg<\/td><td>1.70<\/td><td>32<\/td><\/tr><tr><td>40. &#8220;Massive Multitask Language Understanding&#8221; Benchmark<\/td><td>77.75%<\/td><td>88.7%<\/td><td>1.59<\/td><td>32<\/td><\/tr><tr><td>20. Individual Countries with Biological Weapons Programs\n(China)<\/td><td>70%<\/td><td>30%<\/td><td>1.51<\/td><td>26<\/td><\/tr><tr><td>21. Number of Countries with Biological Weapons Programs<\/td><td>5<\/td><td>2<\/td><td>1.45<\/td><td>32<\/td><\/tr><tr><td>39. MATH Dataset Benchmark<\/td><td>71%<\/td><td>87.92%<\/td><td>1.38<\/td><td>30<\/td><\/tr><tr><td>20. Individual Countries with Biological Weapons Programs\n(Iran)<\/td><td>60%<\/td><td>30%<\/td><td>1.18<\/td><td>28<\/td><\/tr><tr><td>35. GPT Revenue (Hanson Wins Bet that GPT Revenue &lt; $1B)<\/td><td>53.5%<\/td><td>0%<\/td><td>1.07<\/td><td>32<\/td><\/tr><tr><td>20. Individual Countries with Biological Weapons Programs\n(Israel)<\/td><td>40%<\/td><td>10%<\/td><td>1.01<\/td><td>27<\/td><\/tr><\/tbody><\/table><\/div><figcaption class=\"wp-element-caption\"><strong>Table 3.1:<\/strong> Most surprising questions, superforecasters (group-level forecast). The table provides the top-10 questions with the largest standardized absolute forecast errors (SAFE) for the group. N denotes the number of forecasters in the group.<\/figcaption><\/figure>\n\n\n\n<figure class=\"wp-block-table\"><div class=\"table-wrapper\"><table class=\"has-fixed-layout\"><tbody><tr><td>Question<\/td><td>Median Forecast<\/td><td>Resolution<\/td><td>SAFE<\/td><td>N<\/td><\/tr><tr><td>32. Total Nuclear Warheads<\/td><td>9,949<\/td><td>12,331<\/td><td>2.93<\/td><td>1<\/td><\/tr><tr><td>49. Largest Number of Parameters in a Machine Learning Model<\/td><td>150 trillion<\/td><td>10 trillion<\/td><td>2.74<\/td><td>7<\/td><\/tr><tr><td>30. Cost of Hydrogen<\/td><td>3.5 USD\/kg<\/td><td>7.5 USD\/kg<\/td><td>2.27<\/td><td>2<\/td><\/tr><tr><td>21. Number of Countries with Biological Weapons Programs<\/td><td>6.5<\/td><td>2<\/td><td>2.17<\/td><td>4<\/td><\/tr><tr><td>29. Annual Direct Air CO2 Capture<\/td><td>0.6 Mt\/year<\/td><td>0.01 Mt\/year<\/td><td>1.52<\/td><td>7<\/td><\/tr><tr><td>20. Individual Countries with Biological Weapons Programs\n(Iran)<\/td><td>61.5%<\/td><td>30%<\/td><td>1.24<\/td><td>4<\/td><\/tr><tr><td>38. Labor Force Participation Rate in OECD<\/td><td>77.2%<\/td><td>79.86%<\/td><td>1.22<\/td><td>4<\/td><\/tr><tr><td>20. Individual Countries with Biological Weapons Programs\n(Syria)<\/td><td>52.5%<\/td><td>25%<\/td><td>1.10<\/td><td>4<\/td><\/tr><tr><td>35. GPT Revenue (Hanson Wins Bet that GPT Revenue &lt; $1B)<\/td><td>45%<\/td><td>0%<\/td><td>0.90<\/td><td>6<\/td><\/tr><tr><td>20. Individual Countries with Biological Weapons Programs\n(China)<\/td><td>51%<\/td><td>30%<\/td><td>0.79<\/td><td>3<\/td><\/tr><\/tbody><\/table><\/div><figcaption class=\"wp-element-caption\"><strong>Table 3.2:<\/strong> Most surprising questions, domain experts (group-level forecast). The table provides the top-10 questions with the largest standardized absolute forecast errors (SAFE) for the group. N denotes the number of forecasters in the group.<\/figcaption><\/figure>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"climate-technology\">3.2 Climate Technology<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Forecasters were overly optimistic about progress in climate\ntechnology<\/strong>. Both groups expected a more substantial decrease in\nthe cost of hydrogen produced using renewable electricity:\nsuperforecasters expected the cost of hydrogen production to decrease to\n4.5 USD\/kg in 2024, while domain experts predicted an even greater\ndecline to 3.5 USD\/kg. By contrast, we currently project a resolution of\n7.5 USD\/kg for the question. The <em>SAFE<\/em> values for this question\nare in the range of 1.70\u20132.27, suggesting large surprises. Similarly,\nXPT participants anticipated greater advances in carbon removal. For\ntotal direct air capture and storage, domain experts and\nsuperforecasters predicted 0.6 and 0.32 MtCO\u2082\/year in 2024,\nrespectively, while we currently project just 0.01 MtCO\u2082\/year.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"artificial-intelligence\">3.3 Artificial Intelligence<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Both domain experts and superforecasters misjudged the pace and direction of AI progress.<\/strong> Both groups predicted lower values for the maximum compute used in an AI experiment by the end of 2024, with superforecasters underestimating the actual maximum by a factor of five. At the same time, both domain experts and superforecasters overestimated the size of the largest machine learning models by the end of 2024 (1.00E+14 parameters and 4.00E+14 parameters respectively), projecting parameter counts ten times higher than provisionally resolved (1.00E+13 parameters). However, as we note in the Appendix (<a href=\"https:\/\/forecastingresearch.org\/pdf\/near-term-xpt-accuracy.pdf#page=22\" target=\"_blank\" rel=\"noreferrer noopener\">Section A1.2<\/a>), this overestimation likely has to do with incorrect base rate information provided to participants during the XPT tournament.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">XPT participants systematically underestimated AI progress on multiple benchmarks, with superforecasters exhibiting larger underestimation. Figure 3.1 shows the probability XPT participants assigned to observed outcomes on various AI benchmarks, calculated using an estimated density function (see \u201c<a href=\"https:\/\/forecastingresearch.org\/pdf\/near-term-xpt-accuracy.pdf#page=31\" target=\"_blank\" rel=\"noreferrer noopener\">Methods<\/a>\u201d in the Appendix). GPT-4 Turbo achieved 87.82% on the MATH Dataset Benchmark in April 2024; domain experts and superforecasters had assigned a 21.4% and a 9.3% probability, respectively, to reaching this level by June 30, 2024. Both GPT-4o and Claude 3.5 Sonnet achieved 88.7% on MMLU by mid-2024, an outcome that had been assigned a 25.0% and a 7.2% probability for the June 30, 2024 resolution date. RAPTOR + GPT-4 scored 69.3 on QuALITY&#8217;s hard subset in June 2023\u2014a full year before the resolution date\u2014yet domain experts and superforecasters had assigned only a 43.5% and a 20.1% probability to this achievement by June 30, 2024. Across these three benchmarks, superforecasters assigned probabilities 12\u201323 percentage points below those of domain experts.<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"2048\" height=\"1024\" src=\"https:\/\/forecastingresearch.org\/wp-content\/uploads\/2026\/03\/paper_2025-09-02_near-term-xpt-accuracy_fig-03-1.png\" alt=\"\" class=\"wp-image-819\" srcset=\"https:\/\/forecastingresearch.org\/wp-content\/uploads\/2026\/03\/paper_2025-09-02_near-term-xpt-accuracy_fig-03-1.png 2048w, https:\/\/forecastingresearch.org\/wp-content\/uploads\/2026\/03\/paper_2025-09-02_near-term-xpt-accuracy_fig-03-1-350x175.png 350w, https:\/\/forecastingresearch.org\/wp-content\/uploads\/2026\/03\/paper_2025-09-02_near-term-xpt-accuracy_fig-03-1-700x350.png 700w, https:\/\/forecastingresearch.org\/wp-content\/uploads\/2026\/03\/paper_2025-09-02_near-term-xpt-accuracy_fig-03-1-768x384.png 768w, https:\/\/forecastingresearch.org\/wp-content\/uploads\/2026\/03\/paper_2025-09-02_near-term-xpt-accuracy_fig-03-1-1536x768.png 1536w, https:\/\/forecastingresearch.org\/wp-content\/uploads\/2026\/03\/paper_2025-09-02_near-term-xpt-accuracy_fig-03-1-2000x1000.png 2000w, https:\/\/forecastingresearch.org\/wp-content\/uploads\/2026\/03\/paper_2025-09-02_near-term-xpt-accuracy_fig-03-1-1200x600.png 1200w, https:\/\/forecastingresearch.org\/wp-content\/uploads\/2026\/03\/paper_2025-09-02_near-term-xpt-accuracy_fig-03-1-150x75.png 150w\" sizes=\"auto, (max-width: 2048px) 100vw, 2048px\" \/><figcaption class=\"wp-element-caption\"><strong>Figure 3.1:<\/strong> Superforecasters\u2019 and domain experts\u2019 predicted probabilities of observed progress on AI benchmarks. Probabilities were calculated based on the estimated probability density functions (see <a href=\"https:\/\/forecastingresearch.org\/pdf\/near-term-xpt-accuracy.pdf#page=46\" target=\"_blank\" rel=\"noreferrer noopener\">Appendix 5<\/a>) and the observed resolution values. <a href=\"https:\/\/forecastingresearch.org\/pdf\/near-term-xpt-accuracy.pdf#page=31\" target=\"_blank\" rel=\"noreferrer noopener\">Appendix 2<\/a> provides the methodological details on the density function estimation.<\/figcaption><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">Among the most surprising developments was the performance of AI systems on the International Mathematical Olympiad (IMO). While not officially an AI benchmark, the IMO in recent years has \u201cbecome an aspirational challenge for AI systems as a test of their advanced mathematical problem-solving and reasoning capabilities.\u201d<sup data-fn=\"1cc0c597-1d54-468f-a1d6-ab72167f5257\" class=\"fn\"><a href=\"#1cc0c597-1d54-468f-a1d6-ab72167f5257\" id=\"1cc0c597-1d54-468f-a1d6-ab72167f5257-link\">6<\/a><\/sup> Domain experts and superforecasters did not anticipate an AI system to win a gold medal in the International Mathematical Olympiad (IMO) until after 2030. In July 2025, both Google DeepMind and OpenAI reported that their models achieved gold-level performance in the IMO 2025 competition\u2014five years earlier than the median expert prediction and 10 years earlier than the median superforecaster prediction.<sup data-fn=\"eee4537a-752a-44af-b634-1fc8d4b8a6c2\" class=\"fn\"><a href=\"#eee4537a-752a-44af-b634-1fc8d4b8a6c2\" id=\"eee4537a-752a-44af-b634-1fc8d4b8a6c2-link\">7<\/a><\/sup> Domain experts and superforecasters only expected an 8.6% and a 2.3% probability of this achievement on or before 2025.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">We note that the XPT tournament concluded prior to the public release of ChatGPT at the end of 2022, which marked the beginning of an intense phase of AI investment and capability acceleration. While domain experts were more calibrated to trends in AI progress than superforecasters, at times even their judgment failed to anticipate the speed of advancement. These results align closely with previous reports about how experts were surprised by progress in language models in 2022 and 2023,<sup data-fn=\"68ce7432-717c-40c6-948f-e75b48d752e3\" class=\"fn\"><a href=\"#68ce7432-717c-40c6-948f-e75b48d752e3\" id=\"68ce7432-717c-40c6-948f-e75b48d752e3-link\">8<\/a><\/sup> particularly as it related to the MMLU, MATH, and the International Mathematical Olympiad.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"long-term-risk-implications\">4. Long-Term Risk Implications<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">A key goal of the original XPT tournament was to obtain forecasts for\nlong-term risks facing humanity. XPT participants forecast two types of\nrisks: <em>catastrophic risks<\/em> (the probability of more than 10% of\nthe global population dying within a five-year period) and\n<em>extinction risks<\/em> (the probability of human extinction or a\nreduction of the global population below 5,000). The tournament assessed\nthese risks across multiple domains: genetically-engineered and\nnaturally-occurring pathogens, artificial intelligence, nuclear weapons,\nnon-anthropogenic causes (such as asteroids or supervolcanoes), and\noverall risk from all causes combined.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">A natural question is whether more accurate near-term forecasters\nmade systematically different long-term risk predictions. Figure 4.1\nsuggests that there is <strong>no meaningful relationship between\nnear-term accuracy and long-term risk forecasts<\/strong>. Across\naccuracy quartiles (from least accurate in quartile 1 to most accurate\nin quartile 4), median risk estimates remain fairly flat for all risk\ncategories, and there is no statistically significant correlation\nbetween accuracy and long-term risk forecasts. The correlation\ncoefficients all cluster around zero, ranging from -0.08 to 0.14, and\nthey are not statistically significant.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">In the Appendix (<a href=\"https:\/\/forecastingresearch.org\/pdf\/near-term-xpt-accuracy.pdf#page=43\" target=\"_blank\" rel=\"noreferrer noopener\">Figure A3.2<\/a>), we examine how long-term risk forecasts relate to near-term accuracy in our sample of public participants. An advantage of using this sample is that most public participants provided a forecast on every question, eliminating issues surrounding self-selection into questions. In particular, the median public participant answered 36 out of the 38 resolved subquestions. For the public participants, unlike the main XPT sample, we observe a statistically significant <em>negative<\/em> correlation (i.e., the most accurate public forecasters predicted lower risks).<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Overall, <strong>our findings challenge the hope that near-term accuracy can reliably identify forecasters with more credible long-term risk predictions<\/strong>. These results are consistent with the analysis from the original XPT report. The original XPT report found that, for \u201cAI-concerned\u201d (the third of participants with the highest forecast of AI extinction risk by 2100) and \u201cAI-skeptic\u201d (the third of participants with the lowest forecast of AI extinction risk by 2100) groups, their near-term forecasts were in strong agreement (see Table 26 in <a href=\"https:\/\/static1.squarespace.com\/static\/635693acf15a3e2a14a56a4a\/t\/64f0a7838ccbf43b6b5ee40c\/1693493128111\/XPT.pdf#page=88\" target=\"_blank\" rel=\"noreferrer noopener\">Appendix 4<\/a>). The same was also true for superforecasters and domain experts (see Table 28 in <a href=\"https:\/\/static1.squarespace.com\/static\/635693acf15a3e2a14a56a4a\/t\/64f0a7838ccbf43b6b5ee40c\/1693493128111\/XPT.pdf#page=88\" target=\"_blank\" rel=\"noreferrer noopener\">Appendix 4<\/a>).<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"1536\" height=\"2048\" src=\"https:\/\/forecastingresearch.org\/wp-content\/uploads\/2026\/03\/paper_2025-09-02_near-term-xpt-accuracy_fig-04-1.png\" alt=\"\" class=\"wp-image-820\" srcset=\"https:\/\/forecastingresearch.org\/wp-content\/uploads\/2026\/03\/paper_2025-09-02_near-term-xpt-accuracy_fig-04-1.png 1536w, https:\/\/forecastingresearch.org\/wp-content\/uploads\/2026\/03\/paper_2025-09-02_near-term-xpt-accuracy_fig-04-1-350x467.png 350w, https:\/\/forecastingresearch.org\/wp-content\/uploads\/2026\/03\/paper_2025-09-02_near-term-xpt-accuracy_fig-04-1-700x933.png 700w, https:\/\/forecastingresearch.org\/wp-content\/uploads\/2026\/03\/paper_2025-09-02_near-term-xpt-accuracy_fig-04-1-768x1024.png 768w, https:\/\/forecastingresearch.org\/wp-content\/uploads\/2026\/03\/paper_2025-09-02_near-term-xpt-accuracy_fig-04-1-1152x1536.png 1152w, https:\/\/forecastingresearch.org\/wp-content\/uploads\/2026\/03\/paper_2025-09-02_near-term-xpt-accuracy_fig-04-1-1200x1600.png 1200w, https:\/\/forecastingresearch.org\/wp-content\/uploads\/2026\/03\/paper_2025-09-02_near-term-xpt-accuracy_fig-04-1-150x200.png 150w\" sizes=\"auto, (max-width: 1536px) 100vw, 1536px\" \/><figcaption class=\"wp-element-caption\"><strong>Figure 4.1:<\/strong> XPT participants\u2019 forecasts on\ncatastrophic and extinction risks by 2100. \u201cCatastrophic risk\u201d is\ndefined as the probability of 10% or more of humans dying within a\nfive-year period (except for pathogen risks, which use a 1% threshold).\n\u201cExtinction risk\u201d is defined as the probability of human extinction or a\nreduction of the global population below 5,000. Participants are divided\ninto quartiles based on their near-term accuracy, from least (1) to most\n(4) accurate. Error bars represent 95% bootstrap confidence intervals\nfor the median risk forecast within each quartile. Only forecasters with\nat least 10 resolved near-term forecasts are included. Labels show the\nSpearman rank correlation between individual-level accuracy and\nlong-term risk forecasts as well as the corresponding p-value.<\/figcaption><\/figure>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"conclusions\">5. Conclusions<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">This report provides the first empirical assessment of forecasting\naccuracy in the Existential Risk Persuasion Tournament (XPT). We\nconclude by discussing the limitations of this work and highlighting\nnext steps.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"limitations\">5.1 Limitations<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Some methodological limitations should be considered when\ninterpreting our results:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Limited statistical power.<\/strong> With only 38 resolved\nsubquestions\u2014further subdivided across different domains\u2014our ability to\ndetect statistically significant differences between forecaster groups\nis constrained. Most observed accuracy differences between groups did\nnot reach statistical significance, limiting any conclusions about\nrelative expertise.<\/li>\n\n\n\n<li><strong>Limited implications for long-term risks.<\/strong> This\nanalysis covers only questions resolved by mid-2025. Despite observing,\nfor example, faster-than-expected AI progress, this short timeframe\nprovides limited basis for updating beliefs on long-term existential\nrisks.<\/li>\n\n\n\n<li><strong>Non-representative expert sample.<\/strong> The XPT relied on a nonrepresentative expert sample with a 34% attrition rate by the end of the tournament. (See Appendix 1 in the original <a href=\"https:\/\/forecastingresearch.org\/research\/existential-risk-persuasion-tournament\" target=\"_blank\" rel=\"noreferrer noopener\"><u>XPT report<\/u><\/a>.) The experts who participated may not accurately represent the broader expert communities in their respective fields.<\/li>\n\n\n\n<li><strong>Post-hoc benchmark definition.<\/strong> Simple\nalgorithmic benchmarks (no change, extrapolation) were developed after\ndata collection rather than defined a priori. This post-hoc approach may\nintroduce hindsight bias and make tournament participants appear less\naccurate than they actually were.<\/li>\n\n\n\n<li><strong>Ambiguous resolutions.<\/strong> While 38 subquestions have resolved, our confidence in each resolution varies from question to question. While many questions have been definitively resolved (i.e., according to the criteria specified in the original XPT report), others have provisional resolutions that may change in the future. For more details on potentially ambiguous resolutions, see <a href=\"https:\/\/forecastingresearch.org\/pdf\/near-term-xpt-accuracy.pdf#page=20\" target=\"_blank\" rel=\"noreferrer noopener\">Appendix 1<\/a>.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"looking-forward-to-2030\">5.2 Looking Forward to 2030<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">While the questions resolved by mid-2025 have provided valuable\ninitial insights, we are looking forward to the next wave of questions\nset to resolve in 2030. These questions will offer deeper insights into\npotential existential risks:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>AI development and impact.<\/strong> Given the\nfaster-than-expected progress on AI benchmarks, we are interested to\ntrack how this acceleration continues in the coming years. Question #51\nasks whether Nick Bostrom affirms the existence of AGI by 2030, where\nsuperforecasters estimated just a 1% probability compared to domain\nexperts\u2019 9%. Another key milestone is Question #44 (\u201cDate of first\npublicly known advanced AI\u201d). For this question, superforecasters\npredicted 2060 while domain experts predicted 2046. Beyond technical\nadvancements, we will assess broader economic impacts through forecasts\non US computer R&amp;D spending (Question #37), labor force\nparticipation in OECD countries (Question #38), and the percentage of US\nGDP from software and information services (Question #36).<\/li>\n\n\n\n<li><strong>Climate trajectory and technology.<\/strong> Critical\nclimate questions with 2030 resolution dates include global surface\ntemperature change (Question #25), where superforecasters predicted\n1.47\u00b0C warming versus domain experts\u2019 1.4\u00b0C estimate. We will also\nassess progress on climate technologies through questions about green\nhydrogen production costs (Question #30), direct air carbon capture\n(Question #29), and electricity share from solar and wind energy\n(Question #28). These resolutions will be particularly telling given the\ncurrent overestimation of climate technology development.<\/li>\n\n\n\n<li><strong>Global risk forecasts.<\/strong> While most existential\nrisk forecasts for 2030 were very low, we will track several important\nrisk predictions that resolve by this date. For public health\nemergencies, both superforecasters and domain experts predicted\napproximately 2 declarations of a public health emergency of\ninternational concern (PHEIC) with at least 10,000 deaths by 2030\n(Question #22). We will also monitor forecasts about nuclear weapon use\ncausing significant casualties (Question #31). As these and other 2030\nquestions resolve, they will also enable us to answer crucial\nmeta-questions: What is the relationship between near- and medium-term\n(five to eight years) forecasting accuracy? Do forecasters with high\nmedium-term accuracy make systematically different predictions on\nlong-term existential risks?<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"next-steps-1\">5.3 Next Steps<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Building on the insights from this initial analysis, we plan to take\nthe following next steps:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Develop specialized AI insights.<\/strong> Given the\nfaster-than-expected progress on AI benchmarks, researchers at the\nForecasting Research Institute are in the process of launching multiple\ndedicated projects to better understand the likely future trajectories\nand impacts of AI. These projects include establishing a longitudinal\npanel of AI experts and conducting a survey of economists on AI\u2019s\npotential economic and labor market effects.<\/li>\n\n\n\n<li><strong>Track future resolutions.<\/strong> We will continue\ntracking the resolution of questions posed in the XPT. We may also\nre-engage the original XPT participants and gather data on how their\nforecasts have changed in light of recent AI advances and other\ndevelopments.<\/li>\n<\/ul>\n\n\n\n<h2 class=\"wp-block-heading\">Notes<\/h2>\n\n\n<ol class=\"wp-block-footnotes\"><li id=\"b74fb551-a400-43f3-87ca-68d3140be3e5\">Karger, Ezra, Josh Rosenberg, Zach Jacobs, et al. &#8220;Forecasting Existential Risks: Evidence from a Long-Run Forecasting Tournament.&#8221; FRI Working Paper #1. Forecasting Research Institute, 2023. <a href=\"https:\/\/forecastingresearch.org\/research\/existential-risk-persuasion-tournament\" id=\"876\" target=\"_blank\" rel=\"noreferrer noopener\">https:\/\/forecastingresearch.org\/research\/existential-risk-persuasion-tournament<\/a>. <a href=\"#b74fb551-a400-43f3-87ca-68d3140be3e5-link\" aria-label=\"Jump to footnote reference 1\">\u21a9\ufe0e<\/a><\/li><li id=\"432ddbe7-e811-4731-9a04-d548b6479c81\">Karger, Ezra, Josh Rosenberg, Zach Jacobs, et al. &#8220;Forecasting Existential Risks: Evidence from a Long-Run Forecasting Tournament.&#8221; FRI Working Paper #1. Forecasting Research Institute, 2023. <a href=\"https:\/\/forecastingresearch.org\/research\/existential-risk-persuasion-tournament\" id=\"https:\/\/forecastingresearch.org\/research\/xpt\" target=\"_blank\" rel=\"noreferrer noopener\">https:\/\/forecastingresearch.org\/research\/existential-risk-persuasion-tournament<\/a>. <a href=\"#432ddbe7-e811-4731-9a04-d548b6479c81-link\" aria-label=\"Jump to footnote reference 2\">\u21a9\ufe0e<\/a><\/li><li id=\"29474279-fe20-4eb4-bedd-f607b5805c27\">Leech, Gavin, and Misha Yagudin. &#8220;Can Policymakers Trust Forecasters?&#8221; Institute for Progress, March 7, 2023. <a href=\"https:\/\/ifp.org\/can-policymakers-trust-forecasters\/\" target=\"_blank\" rel=\"noreferrer noopener\">https:\/\/ifp.org\/can-policymakers-trust-forecasters\/<\/a>. <a href=\"#29474279-fe20-4eb4-bedd-f607b5805c27-link\" aria-label=\"Jump to footnote reference 3\">\u21a9\ufe0e<\/a><\/li><li id=\"0dc7ab8f-55a4-41f9-a26b-b1f679ef9aba\">Clemen, Robert T. &#8220;Combining Forecasts: A Review and Annotated Bibliography.&#8221; <em>International Journal of Forecasting<\/em> 5, no. 4 (1989): 559\u2013583. <a href=\"https:\/\/doi.org\/10.1016\/0169-2070(89)90012-5\" target=\"_blank\" rel=\"noreferrer noopener\">https:\/\/doi.org\/10.1016\/0169-2070(89)90012-5<\/a>. <a href=\"#0dc7ab8f-55a4-41f9-a26b-b1f679ef9aba-link\" aria-label=\"Jump to footnote reference 4\">\u21a9\ufe0e<\/a><\/li><li id=\"3c07f1eb-5415-496b-a0b6-e6e41bdab770\">Karger, Ezra, Joshua Monrad, Barbara Mellers, and Philip Tetlock. &#8220;Reciprocal Scoring: A Method for Forecasting Unanswerable Questions.&#8221; SSRN Working Paper, October 31, 2021. <a href=\"https:\/\/doi.org\/10.2139\/ssrn.3954498\" target=\"_blank\" rel=\"noreferrer noopener\">https:\/\/doi.org\/10.2139\/ssrn.3954498<\/a>. <a href=\"#3c07f1eb-5415-496b-a0b6-e6e41bdab770-link\" aria-label=\"Jump to footnote reference 5\">\u21a9\ufe0e<\/a><\/li><li id=\"1cc0c597-1d54-468f-a1d6-ab72167f5257\">Google DeepMind. &#8220;Advanced Version of Gemini with Deep Think Officially Achieves Gold-Medal Standard at the International Mathematical Olympiad.&#8221; Google DeepMind Blog, July 2025. <a href=\"https:\/\/deepmind.google\/discover\/blog\/advanced-version-of-gemini-with-deep-think-officially-achieves-gold-medal-standard-at-the-international-mathematical-olympiad\/#:~:text=become%20an%20aspirational%20challenge%20for%20AI%20systems%20as%20a%20test%20of%20their%20advanced%20mathematical%20problem%2Dsolving%20and%20reasoning%20capabilities.\" id=\"https:\/\/deepmind.google\/discover\/blog\/advanced-version-of-gemini-with-deep-think-officially-achieves-gold-medal-standard-at-the-international-mathematical-olympiad\/#:~:text=become%20an%20aspirational%20challenge%20for%20AI%20systems%20as%20a%20test%20of%20their%20advanced%20mathematical%20problem%2Dsolving%20and%20reasoning%20capabilities.\" target=\"_blank\" rel=\"noreferrer noopener\">https:\/\/deepmind.google\/blog\/advanced-version-of-gemini-with-deep-think-officially-achieves-gold-medal-standard-at-the-international-mathematical-olympiad\/<\/a>. <a href=\"#1cc0c597-1d54-468f-a1d6-ab72167f5257-link\" aria-label=\"Jump to footnote reference 6\">\u21a9\ufe0e<\/a><\/li><li id=\"eee4537a-752a-44af-b634-1fc8d4b8a6c2\">Google DeepMind. &#8220;Advanced Version of Gemini with Deep Think Officially Achieves Gold-Medal Standard at the International Mathematical Olympiad.&#8221; Google DeepMind Blog, July 21, 2025. <a href=\"https:\/\/deepmind.google\/discover\/blog\/advanced-version-of-gemini-with-deep-think-officially-achieves-gold-medal-standard-at-the-international-mathematical-olympiad\/\" target=\"_blank\" rel=\"noreferrer noopener\">https:\/\/deepmind.google\/discover\/blog\/advanced-version-of-gemini-with-deep-think-officially-achieves-gold-medal-standard-at-the-international-mathematical-olympiad\/<\/a>; OpenAI. &#8220;We achieved gold medal-level performance \ud83e\udd47on the 2025 International Mathematical Olympiad with a general-purpose reasoning LLM!&#8221; X (formerly Twitter), July 19, 2025. <a href=\"https:\/\/x.com\/OpenAI\/status\/1946594928945148246\" target=\"_blank\" rel=\"noreferrer noopener\">https:\/\/x.com\/OpenAI\/status\/1946594928945148246<\/a>. <a href=\"#eee4537a-752a-44af-b634-1fc8d4b8a6c2-link\" aria-label=\"Jump to footnote reference 7\">\u21a9\ufe0e<\/a><\/li><li id=\"68ce7432-717c-40c6-948f-e75b48d752e3\">Cotra, Ajeya, and Kelsey Piper. &#8220;Language Models Surprised Us.&#8221; <em>Planned Obsolescence<\/em> (blog), August 2023. <a href=\"https:\/\/www.planned-obsolescence.org\/language-models-surprised-us\/\" target=\"_blank\" rel=\"noreferrer noopener\">https:\/\/www.planned-obsolescence.org\/language-models-surprised-us\/<\/a>. <a href=\"#68ce7432-717c-40c6-948f-e75b48d752e3-link\" aria-label=\"Jump to footnote reference 8\">\u21a9\ufe0e<\/a><\/li><\/ol>\n\n\n<div class=\"wp-block-buttons is-layout-flex wp-block-buttons-is-layout-flex\">\n<div class=\"wp-block-button\"><a class=\"btn orange\" href=\"https:\/\/forecastingresearch.org\/pdf\/near-term-xpt-accuracy.pdf#page=20\" target=\"_blank\" rel=\"noreferrer noopener\">The&nbsp;Appendix is available in the&nbsp;full PDF report <svg width=\"7\" height=\"9\" viewBox=\"0 0 7 9\" fill=\"none\" xmlns=\"http:\/\/www.w3.org\/2000\/svg\">\n  <path d=\"M0.000156283 8.60806L4.22416 4.33606V4.24006L0.000156283 6.10352e-05H1.80816L6.06416 4.28806L1.80816 8.60806H0.000156283Z\" fill=\"#102B23\"\/>\n<\/svg>\n<svg width=\"8\" height=\"10\" viewBox=\"0 0 8 10\" fill=\"none\" xmlns=\"http:\/\/www.w3.org\/2000\/svg\">\n  <path d=\"M0.601719 8.85794L4.82572 4.58594V4.48994L0.601719 0.249939H2.40972L6.66572 4.53794L2.40972 8.85794H0.601719Z\" fill=\"#102B23\"\/>\n<\/svg><\/a><\/div>\n<\/div>\n","protected":false},"excerpt":{"rendered":"This report assesses the accuracy of short-term forecasts made during the Existential Risk Persuasion Tournament (XPT)\u2014a 2022 study that convened 169 superforecasters and domain experts to make predictions on long-term risks including AI, climate change, nuclear war, and pandemics.","protected":false},"featured_media":869,"template":"","meta":{"footnotes":"[{\"content\":\"Karger, Ezra, Josh Rosenberg, Zach Jacobs, et al. \\\"Forecasting Existential Risks: Evidence from a Long-Run Forecasting Tournament.\\\" FRI Working Paper #1. Forecasting Research Institute, 2023. <a href=\\\"https:\/\/forecastingresearch.org\/research\/existential-risk-persuasion-tournament\\\" type=\\\"research\\\" id=\\\"876\\\" target=\\\"_blank\\\" rel=\\\"noreferrer noopener\\\">https:\/\/forecastingresearch.org\/research\/existential-risk-persuasion-tournament<\/a>.\",\"id\":\"b74fb551-a400-43f3-87ca-68d3140be3e5\"},{\"content\":\"Karger, Ezra, Josh Rosenberg, Zach Jacobs, et al. \\\"Forecasting Existential Risks: Evidence from a Long-Run Forecasting Tournament.\\\" FRI Working Paper #1. Forecasting Research Institute, 2023. <a href=\\\"https:\/\/forecastingresearch.org\/research\/existential-risk-persuasion-tournament\\\" id=\\\"https:\/\/forecastingresearch.org\/research\/xpt\\\" target=\\\"_blank\\\" rel=\\\"noreferrer noopener\\\">https:\/\/forecastingresearch.org\/research\/existential-risk-persuasion-tournament<\/a>.\",\"id\":\"432ddbe7-e811-4731-9a04-d548b6479c81\"},{\"id\":\"29474279-fe20-4eb4-bedd-f607b5805c27\",\"content\":\"Leech, Gavin, and Misha Yagudin. \\\"Can Policymakers Trust Forecasters?\\\" Institute for Progress, March 7, 2023. <a href=\\\"https:\/\/ifp.org\/can-policymakers-trust-forecasters\/\\\" target=\\\"_blank\\\" rel=\\\"noreferrer noopener\\\">https:\/\/ifp.org\/can-policymakers-trust-forecasters\/<\/a>.\"},{\"id\":\"0dc7ab8f-55a4-41f9-a26b-b1f679ef9aba\",\"content\":\"Clemen, Robert T. \\\"Combining Forecasts: A Review and Annotated Bibliography.\\\" <em>International Journal of Forecasting<\/em> 5, no. 4 (1989): 559\u2013583. <a href=\\\"https:\/\/doi.org\/10.1016\/0169-2070(89)90012-5\\\" target=\\\"_blank\\\" rel=\\\"noreferrer noopener\\\">https:\/\/doi.org\/10.1016\/0169-2070(89)90012-5<\/a>.\"},{\"id\":\"3c07f1eb-5415-496b-a0b6-e6e41bdab770\",\"content\":\"Karger, Ezra, Joshua Monrad, Barbara Mellers, and Philip Tetlock. \\\"Reciprocal Scoring: A Method for Forecasting Unanswerable Questions.\\\" SSRN Working Paper, October 31, 2021. <a href=\\\"https:\/\/doi.org\/10.2139\/ssrn.3954498\\\" target=\\\"_blank\\\" rel=\\\"noreferrer noopener\\\">https:\/\/doi.org\/10.2139\/ssrn.3954498<\/a>.\"},{\"id\":\"1cc0c597-1d54-468f-a1d6-ab72167f5257\",\"content\":\"Google DeepMind. \\\"Advanced Version of Gemini with Deep Think Officially Achieves Gold-Medal Standard at the International Mathematical Olympiad.\\\" Google DeepMind Blog, July 2025. <a href=\\\"https:\/\/deepmind.google\/discover\/blog\/advanced-version-of-gemini-with-deep-think-officially-achieves-gold-medal-standard-at-the-international-mathematical-olympiad\/#:~:text=become%20an%20aspirational%20challenge%20for%20AI%20systems%20as%20a%20test%20of%20their%20advanced%20mathematical%20problem%2Dsolving%20and%20reasoning%20capabilities.\\\" id=\\\"https:\/\/deepmind.google\/discover\/blog\/advanced-version-of-gemini-with-deep-think-officially-achieves-gold-medal-standard-at-the-international-mathematical-olympiad\/#:~:text=become%20an%20aspirational%20challenge%20for%20AI%20systems%20as%20a%20test%20of%20their%20advanced%20mathematical%20problem%2Dsolving%20and%20reasoning%20capabilities.\\\" target=\\\"_blank\\\" rel=\\\"noreferrer noopener\\\">https:\/\/deepmind.google\/blog\/advanced-version-of-gemini-with-deep-think-officially-achieves-gold-medal-standard-at-the-international-mathematical-olympiad\/<\/a>.\"},{\"id\":\"eee4537a-752a-44af-b634-1fc8d4b8a6c2\",\"content\":\"Google DeepMind. \\\"Advanced Version of Gemini with Deep Think Officially Achieves Gold-Medal Standard at the International Mathematical Olympiad.\\\" Google DeepMind Blog, July 21, 2025. <a href=\\\"https:\/\/deepmind.google\/discover\/blog\/advanced-version-of-gemini-with-deep-think-officially-achieves-gold-medal-standard-at-the-international-mathematical-olympiad\/\\\" target=\\\"_blank\\\" rel=\\\"noreferrer noopener\\\">https:\/\/deepmind.google\/discover\/blog\/advanced-version-of-gemini-with-deep-think-officially-achieves-gold-medal-standard-at-the-international-mathematical-olympiad\/<\/a>; OpenAI. \\\"We achieved gold medal-level performance \ud83e\udd47on the 2025 International Mathematical Olympiad with a general-purpose reasoning LLM!\\\" X (formerly Twitter), July 19, 2025. <a href=\\\"https:\/\/x.com\/OpenAI\/status\/1946594928945148246\\\" target=\\\"_blank\\\" rel=\\\"noreferrer noopener\\\">https:\/\/x.com\/OpenAI\/status\/1946594928945148246<\/a>.\"},{\"id\":\"68ce7432-717c-40c6-948f-e75b48d752e3\",\"content\":\"Cotra, Ajeya, and Kelsey Piper. \\\"Language Models Surprised Us.\\\" <em>Planned Obsolescence<\/em> (blog), August 2023. <a href=\\\"https:\/\/www.planned-obsolescence.org\/language-models-surprised-us\/\\\" target=\\\"_blank\\\" rel=\\\"noreferrer noopener\\\">https:\/\/www.planned-obsolescence.org\/language-models-surprised-us\/<\/a>.\"}]"},"research_type":[4],"class_list":["post-812","research","type-research","status-publish","has-post-thumbnail","hentry","research_type-working-paper"],"acf":[],"yoast_head":"<title>Assessing Near-Term Accuracy in the Existential Risk Persuasion Tournament &#8211; Forecasting Research Institute<\/title>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/forecastingresearch.org\/research\/near-term-xpt-accuracy\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Assessing Near-Term Accuracy in the Existential Risk Persuasion Tournament &#8211; Forecasting Research Institute\" \/>\n<meta property=\"og:description\" content=\"This report assesses the accuracy of short-term forecasts made during the Existential Risk Persuasion Tournament (XPT)\u2014a 2022 study that convened 169 superforecasters and domain experts to make predictions on long-term risks including AI, climate change, nuclear war, and pandemics.\" \/>\n<meta property=\"og:url\" content=\"https:\/\/forecastingresearch.org\/research\/near-term-xpt-accuracy\" \/>\n<meta property=\"og:site_name\" content=\"Forecasting Research Institute\" \/>\n<meta property=\"article:modified_time\" content=\"2026-05-04T17:42:48+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/forecastingresearch.org\/wp-content\/uploads\/2025\/09\/FRI-illustration-library-13.jpg\" \/>\n\t<meta property=\"og:image:width\" content=\"1376\" \/>\n\t<meta property=\"og:image:height\" content=\"864\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/jpeg\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\\\/\\\/schema.org\",\"@graph\":[{\"@type\":\"WebPage\",\"@id\":\"https:\\\/\\\/forecastingresearch.org\\\/research\\\/near-term-xpt-accuracy\",\"url\":\"https:\\\/\\\/forecastingresearch.org\\\/research\\\/near-term-xpt-accuracy\",\"name\":\"Assessing Near-Term Accuracy in the Existential Risk Persuasion Tournament &#8211; Forecasting Research Institute\",\"isPartOf\":{\"@id\":\"https:\\\/\\\/forecastingresearch.org\\\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\\\/\\\/forecastingresearch.org\\\/research\\\/near-term-xpt-accuracy#primaryimage\"},\"image\":{\"@id\":\"https:\\\/\\\/forecastingresearch.org\\\/research\\\/near-term-xpt-accuracy#primaryimage\"},\"thumbnailUrl\":\"https:\\\/\\\/forecastingresearch.org\\\/wp-content\\\/uploads\\\/2025\\\/09\\\/FRI-illustration-library-13.jpg\",\"datePublished\":\"2025-09-02T12:00:00+00:00\",\"dateModified\":\"2026-05-04T17:42:48+00:00\",\"breadcrumb\":{\"@id\":\"https:\\\/\\\/forecastingresearch.org\\\/research\\\/near-term-xpt-accuracy#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\\\/\\\/forecastingresearch.org\\\/research\\\/near-term-xpt-accuracy\"]}]},{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\\\/\\\/forecastingresearch.org\\\/research\\\/near-term-xpt-accuracy#primaryimage\",\"url\":\"https:\\\/\\\/forecastingresearch.org\\\/wp-content\\\/uploads\\\/2025\\\/09\\\/FRI-illustration-library-13.jpg\",\"contentUrl\":\"https:\\\/\\\/forecastingresearch.org\\\/wp-content\\\/uploads\\\/2025\\\/09\\\/FRI-illustration-library-13.jpg\",\"width\":1376,\"height\":864},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\\\/\\\/forecastingresearch.org\\\/research\\\/near-term-xpt-accuracy#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\\\/\\\/forecastingresearch.org\\\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"Assessing Near-Term Accuracy in the Existential Risk Persuasion Tournament\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\\\/\\\/forecastingresearch.org\\\/#website\",\"url\":\"https:\\\/\\\/forecastingresearch.org\\\/\",\"name\":\"Forecasting Research Institute\",\"description\":\"\",\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\\\/\\\/forecastingresearch.org\\\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"en-US\"}]}<\/script>","yoast_head_json":{"title":"Assessing Near-Term Accuracy in the Existential Risk Persuasion Tournament &#8211; Forecasting Research Institute","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/forecastingresearch.org\/research\/near-term-xpt-accuracy","og_locale":"en_US","og_type":"article","og_title":"Assessing Near-Term Accuracy in the Existential Risk Persuasion Tournament &#8211; Forecasting Research Institute","og_description":"This report assesses the accuracy of short-term forecasts made during the Existential Risk Persuasion Tournament (XPT)\u2014a 2022 study that convened 169 superforecasters and domain experts to make predictions on long-term risks including AI, climate change, nuclear war, and pandemics.","og_url":"https:\/\/forecastingresearch.org\/research\/near-term-xpt-accuracy","og_site_name":"Forecasting Research Institute","article_modified_time":"2026-05-04T17:42:48+00:00","og_image":[{"width":1376,"height":864,"url":"https:\/\/forecastingresearch.org\/wp-content\/uploads\/2025\/09\/FRI-illustration-library-13.jpg","type":"image\/jpeg"}],"twitter_card":"summary_large_image","schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"WebPage","@id":"https:\/\/forecastingresearch.org\/research\/near-term-xpt-accuracy","url":"https:\/\/forecastingresearch.org\/research\/near-term-xpt-accuracy","name":"Assessing Near-Term Accuracy in the Existential Risk Persuasion Tournament &#8211; Forecasting Research Institute","isPartOf":{"@id":"https:\/\/forecastingresearch.org\/#website"},"primaryImageOfPage":{"@id":"https:\/\/forecastingresearch.org\/research\/near-term-xpt-accuracy#primaryimage"},"image":{"@id":"https:\/\/forecastingresearch.org\/research\/near-term-xpt-accuracy#primaryimage"},"thumbnailUrl":"https:\/\/forecastingresearch.org\/wp-content\/uploads\/2025\/09\/FRI-illustration-library-13.jpg","datePublished":"2025-09-02T12:00:00+00:00","dateModified":"2026-05-04T17:42:48+00:00","breadcrumb":{"@id":"https:\/\/forecastingresearch.org\/research\/near-term-xpt-accuracy#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/forecastingresearch.org\/research\/near-term-xpt-accuracy"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/forecastingresearch.org\/research\/near-term-xpt-accuracy#primaryimage","url":"https:\/\/forecastingresearch.org\/wp-content\/uploads\/2025\/09\/FRI-illustration-library-13.jpg","contentUrl":"https:\/\/forecastingresearch.org\/wp-content\/uploads\/2025\/09\/FRI-illustration-library-13.jpg","width":1376,"height":864},{"@type":"BreadcrumbList","@id":"https:\/\/forecastingresearch.org\/research\/near-term-xpt-accuracy#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/forecastingresearch.org\/"},{"@type":"ListItem","position":2,"name":"Assessing Near-Term Accuracy in the Existential Risk Persuasion Tournament"}]},{"@type":"WebSite","@id":"https:\/\/forecastingresearch.org\/#website","url":"https:\/\/forecastingresearch.org\/","name":"Forecasting Research Institute","description":"","potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/forecastingresearch.org\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"}]}},"_links":{"self":[{"href":"https:\/\/forecastingresearch.org\/api\/wp\/v2\/research\/812","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/forecastingresearch.org\/api\/wp\/v2\/research"}],"about":[{"href":"https:\/\/forecastingresearch.org\/api\/wp\/v2\/types\/research"}],"version-history":[{"count":69,"href":"https:\/\/forecastingresearch.org\/api\/wp\/v2\/research\/812\/revisions"}],"predecessor-version":[{"id":2171,"href":"https:\/\/forecastingresearch.org\/api\/wp\/v2\/research\/812\/revisions\/2171"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/forecastingresearch.org\/api\/wp\/v2\/media\/869"}],"wp:attachment":[{"href":"https:\/\/forecastingresearch.org\/api\/wp\/v2\/media?parent=812"}],"wp:term":[{"taxonomy":"research_type","embeddable":true,"href":"https:\/\/forecastingresearch.org\/api\/wp\/v2\/research_type?post=812"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}