Published: Nov 10, 2025

Working Paper #5

Working paper

Working paper
Working Paper #5

The Longitudinal Expert AI Panel: Understanding Expert Views on AI Capabilities, Adoption, and Impact

The Longitudinal Expert AI Panel (LEAP) is a three-year project tracking the views of leading computer scientists, industry professionals, policy researchers, and economists on the trajectory of artificial intelligence. In this paper, we introduce LEAP and summarize findings from the first three waves.

Connacher Murphy¹, Josh Rosenberg¹, Jordan Canedy¹, Zach Jacobs¹, Nadja Flechner¹, Rhiannon Britt¹, Alexa Pan¹, Charlie Rogers-Smith¹, Dan Mayland¹, Cathy Buffington¹, Simas Kučinskas¹, Amanda Coston², Hannah Kerner³, Emma Pierson², Reihaneh Rabbany⁴, Matthew Salganik⁵, Robert Seamans⁶, Yu Su⁷, Florian Tramèr⁸, Tatsunori Hashimoto⁹, Arvind Narayanan⁵, Philip E. Tetlock¹⁰, Ezra Karger*¹¹ ,

1 Forecasting Research Institute
2 University of California, Berkeley
3 Arizona State University
4 McGill University
5 Princeton University
6 New York University
7 The Ohio State University
8 ETH Zürich
9 Stanford University
10 University of Pennsylvania
11 Federal Reserve Bank of Chicago

*Corresponding author: Ezra Karger, ezra.karger@chi.frb.org

Published: Nov 10, 2025

Abstract

Public debates about AI revolve around bold claims and counterclaims, but rarely culminate in precise, falsifiable forecasts of AI capabilities, adoption, and impact. The Longitudinal Expert AI Panel (LEAP) improves this signal-to-noise ratio by gathering such forecasts, monthly, from a carefully chosen panel of 339 experts spanning industry, academia, and policy. The median expert foresees that by 2030 AI will be responsible for 7% of U.S. electricity usage, assist in 18% of work hours in the U.S., and provide daily companionship for 15% of adults—roughly 7x, 4x, and 2.5x current levels, respectively. The median expert also gives a 60% chance that AI systems solve or substantially assist in solving a Millennium Prize Problem by 2040, which would be a major achievement in mathematics. There is substantial within-individual uncertainty and between-individual disagreement among experts, each accounting for roughly half of the total variation in expert forecasts across all questions. Nevertheless, the vast majority of LEAP forecasts fall far short of the warnings from AI lab leaders about imminent artificial superintelligence. We analyze 1.7 million words of participant rationales to provide a complementary qualitative overview of the key mechanisms underpinning fast and slow forecasts of AI progress.

View the full PDF report

Acknowledgments

We thank Victoria Schmidt, Morgane Bascle, and Jonah Black for research assistance.

Disclaimers

The views expressed in this paper do not necessarily represent the views of the Federal Reserve Bank of Chicago or the Federal Reserve System. This research was funded with support from Open Philanthropy and Craig Falls.

Executive Summary

Despite the clashing narratives around AI, there is little work systematically mapping the full spectrum of views among experts (computer scientists, economists, technologists) and the general public. What do these groups believe about AI’s future capabilities, adoption, and effects? And why do they believe what they do? The Longitudinal Expert AI Panel (LEAP) fills this gap with monthly surveys tracking the quantified beliefs of experts, historically accurate forecasters (“superforecasters”),¹ and the general public.

Since we launched LEAP in June 2025, we have completed three survey waves focused on (1) high-level predictions about AI progress; (2) the application of AI to scientific discovery; and (3) the adoption and social impact of AI. Experts provided thoughtful engagement, spending a median of 44 minutes per survey and writing over 460,000 words of rationales explaining their beliefs, across all surveys and questions.² In this paper, we share results from these first three waves along with a detailed methodological description of the project as a whole.

Across the first three waves of LEAP, five patterns stand out:

Experts expect sizable near‑term societal effects from AI by 2040.
Substantial disagreement and uncertainty underlie expert forecasts.
The median expert expects much slower progress than prominent leaders of frontier AI labs.
Experts anticipate faster progress than the public on most outcomes.
Experts and superforecasters mostly agree. Where they disagree, experts tend to expect more AI progress. Also, there are no systematic differences between the predictions of different types of experts: computer scientists, economists, industry professionals, and policy professionals.

First, experts expect sizable near‑term societal effects: by 2030, the median expert predicts that 18%³ of all U.S. work hours will be assisted by generative AI, up from 4.1% in November 2024 (Bick et al. 2025);⁴ AI training and deployment will consume 7% of U.S. electricity;⁵^,⁶ autonomous vehicles will provide 20% of U.S. ride-hailing trips, and annual global private investment (as reported by Our World in Data)⁷ will reach $260 billion, up from the $130 billion reported total for this series in 2024. The median expert predicts that by 2030, 15% of adults will self-report using AI for companionship, emotional support, social interaction, or simulated relationships at least once daily, up from 6% today. By 2040, that number doubles to 30% of adults. Experts expect substantial improvements in the ability of AI systems to complete difficult math questions: the median expert believes that AI systems will achieve performance of 75% on the FrontierMath benchmark⁸ by 2030; and 23% of experts expect saturation of the benchmark.⁹ The median expert also believes it is more likely than not (a 60% chance) that AI solves or substantially assists in solving a Millennium Prize Problem by 2040.¹⁰ We summarize these expert forecasts in the figure below.

**Figure:** Median expert forecasts for various questions. We display the 10^th, 25^th, 50^th, 75^th, and 90^th percentiles of the median forecasts given by experts at each date. For example, if 25% of experts give a median forecast of $10 billion or less, the 25^th percentile series in the graph will lie at $10 billion; these series are *not* confidence intervals. Where available, we include a historical baseline in light blue.

Even experts with lower forecasts of AI progress, capabilities, and adoption still expect substantial impacts relative to current levels. Recall that, by 2030, the median expert predicts that 18% of all U.S. work hours will be assisted by generative AI, up from 4.1% in November 2024 (Bick et al. 2025);¹¹ however, respondents were shown a historical baseline value of 2%, based on an earlier version of the cited paper. We also asked forecasters for their 25^th percentile forecast,¹² and panelists on this question gave a forecast of 9%, still more than a fourfold increase from the historical baseline level provided to forecasters at the time of the survey. In other words, the median expert gives a 75% chance that at least 9% of work hours will be assisted by generative AI in 2030.

Second, substantial disagreement and uncertainty underlie expert forecasts. The top quartile of experts estimate that the majority of revenue from newly approved U.S. drugs in 2040 will be from AI-discovered drugs, but the bottom quartile of experts predicts that less than 10% of new drug sales will be from AI-discovered drugs.¹³ On the question of whether AI will independently solve or substantially assist in solving a Millennium Prize Problem by 2040, a quarter of experts think it is quite likely (>81% chance), while another quarter of experts believe it is unlikely (<30% chance). On many questions, individual experts also express substantial uncertainty in their own forecasts. We construct a composite measure of the total variation in expert beliefs in the figure below (for two questions), taking into account both the within-forecaster uncertainty and the between-forecaster disagreement. We discuss the methodology in greater detail in Uncertainty and Disagreement, where we describe the underlying assumptions that we rely on to construct these figures. These “pooled distributions” represent the full distribution of outcomes predicted by the average expert. We find that, across all forecasting questions where we allow forecasters to express their uncertainty, within-forecaster uncertainty explains 49% of the total variation in forecasts, while 51% of variation is explained by between-forecaster disagreement.

**Figure:** Pooled distributions for expert forecasts on *Work Hours Assisted by Generative AI* (top panels) and *FrontierMath* *scores* (bottom panels). These pooled distributions combine within-expert uncertainty and between-expert disagreement. Densities are normalized to the same peak for comparability. See Uncertainty and Disagreement for details.

Third, the median expert expects much slower progress than prominent leaders of frontier AI labs. These lab leaders predict human-level or superhuman AI by 2026–2029, while most of our expert panel rejects these shorter timelines. We ask respondents to forecast whether the average LEAP panelist will say, in 2030, that we are closest to a “slow-,” “moderate-,” or “rapid-” AI progress scenario. We define these scenarios in detail in the appendix (see Appendix E.I. 4. General AI Progress). The average expert thinks that 23% of LEAP panelists in 2030 will say the world most closely mirrors a “rapid” AI progress scenario that, of our three scenarios, most closely matches some of the lab leaders’ claims. On the other hand, the average expert believes that 28% of panelists will indicate that progress plateaued at close-to-current levels, with fewer improvements in capabilities relative to today (a “slow” AI progress scenario).¹⁴

Fourth, experts anticipate faster progress than the public¹⁵ on most outcomes (e.g., the accuracy AI systems will achieve on the FrontierMath benchmark by 2030: median expert 75% vs. median member of the public 50%, p < 0.001; generative AI work assistance by 2030: median expert 18% vs. median member of the public 10%, p < 0.001). Experts predict a 32% chance that AI will be at least as impactful as a “technology of the millennium”—like the printing press or the Industrial Revolution—whereas the public gives this only a 22% chance. Across forecasts that exhibit a clear relationship with AI capabilities and impacts, a randomly selected expert is 16% more likely than a randomly selected member of the public to predict faster progress than would be expected by random chance.

We summarize some of the differences in aggregate forecasts in the figure below.

**Figure:** Differences between the expert and public median 50^th percentile forecasts for several questions where the unit is a percentage. Points indicate the median of each group’s 50^th percentile forecasts. We apply transformations to create valenced forecasts, where values closer to the left indicate slower progress and values to the right indicate faster progress.

Fifth, highly accurate forecasters (superforecasters) and experts are largely aligned, with superforecasters expecting slightly less progress overall. Differences across expert subgroups—those specializing in computer science or economics, those working in industry, and those working in the policy and think tank space—are small and rarely statistically significant. We summarize some of the differences in aggregate forecasts in the two figures below. The first compares experts and superforecasters, while the second groups by category of expertise.

**Figure:** Differences between the expert and superforecaster median 50^th percentile forecasts for several questions where the unit is a percentage. Points indicate the median of each group’s 50^th percentile forecasts. We apply transformations to create valenced forecasts, where values closer to the left indicate slower progress and values to the right indicate faster progress.

**Figure:** Expert category median 50^th percentile forecasts for several questions where the unit is a percentage. Points indicate the median of 50^th percentile forecasts for each category. We apply transformations to create valenced forecasts, where values closer to the left indicate slower progress and values to the right indicate faster progress.

Introduction

As artificial intelligence (AI) reshapes culture,¹⁶ science,¹⁷ labor markets,¹⁸ and the aggregate economy,¹⁹ experts debate its value, risks, and how quickly it will integrate into everyday life. Leaders of AI companies forecast transformative AI systems that cure all diseases,²⁰ replace whole classes of jobs,²¹ and supercharge GDP growth by the 2030s.²² Skeptics see small gains at best, with AI’s impact amounting to little more than a modest boost in productivity—if anything at all (Acemoglu 2024).

Despite these clashing narratives, there is little work systematically mapping the full spectrum of views among computer scientists, economists, technologists in the private sector, and the public. What do these groups believe about AI’s future capabilities, adoption, and effects? Why do they believe what they do, and what mechanisms support those beliefs? Prior surveys offer opinions, but rarely comprehensively quantify those opinions, hampering policy guidance and evaluation. We fill this gap with the Longitudinal Expert AI Panel (LEAP), a monthly survey tracking the probabilistic forecasts of experts, historically accurate forecasters (“superforecasters”),²³ and the public.

Policymakers and stakeholders routinely consult experts to ground their decision-making in coherent, informed beliefs, especially when faced with new technologies and high levels of uncertainty. LEAP is designed to facilitate this process by cutting through the anecdote and speculation that currently dominates AI discourse. We specifically target prominent experts whom policymakers would be most inclined to consult regarding the progression of AI capabilities and its technological impact. LEAP expert invitees (and participants) include top-cited AI and ML scientists, prominent economists, key technical staff at frontier AI companies, and influential policy experts from a broad range of NGOs. Our goal is to provide a clear summary of expert views, analyzed both across and within these key groups. To address concerns that our respondents might be a biased or selective sample, we reweight our data to account for different response propensities within our expert sampling frame, providing a representative summary of our expert frame’s beliefs.²⁴ The Panel Construction section contains more details. This reweighting leaves our headline conclusions unchanged (see Sensitivity of Results to Reweighting).

We evaluate LEAP respondents’ forecasts against detailed, pre-specified resolution criteria—avoiding the problem of ambiguous resolution criteria that introduce noise in forecasts by permitting both different interpretations of a question by forecasters, and disagreement about how a question is later resolved. In contrast, clear resolution criteria put the debate in common terms and enable policymakers to understand the range of opinion. LEAP’s resolution criteria are also used to measure accuracy—and tie respondents’ compensation to this accuracy—to encourage participants to provide high-quality forecasts. Lastly, we can use this accuracy measure to identify the most accurate forecasters and explore how their views compare to other participants.²⁵

LEAP captures not just forecasts, but participants’ rationales: 1.7 million words of detailed explanations across the first three survey waves of detailed explanations. Over 600,000 of these words come from our expert and superforecaster samples, with the remainder coming from public respondents. We use this data to identify key sources of disagreement and to analyze why participants express significant uncertainty about the future effects of AI.

Since we launched LEAP in June 2025, we have completed three survey waves focused on (1) high-level predictions about AI progress; (2) the application of AI to scientific discovery; and (3) widespread adoption and societal impact. In this paper, we share results from these first three waves along with a detailed methodological description of the project as a whole.

In what follows, we first present aggregate forecasts from experts. Aggregation exploits the “wisdom of the crowd” phenomenon, in which the accuracy of aggregated predictions exceeds the accuracy of a large majority of their constituent parts. This practice is supported by work in numerous fields, such as prediction markets (Bassamboo, Cui, and Moreno 2015), political forecasting (Sjöberg 2009, Murr 2011), and more (Hueffer et al. 2023, Adjodah et al. 2021). Aggregated predictions of many individuals tend to be remarkably accurate (Da and Huang 2020; Lichtendahl Jr, Grushka-Cockayne, and Pfeifer 2013; Surowiecki 2004).²⁶ While aggregate forecasts are helpful, it is important to understand the extent of disagreement between experts and where individuals have more and less uncertainty in their forecasts. We discuss our approach to answering these questions below in the Uncertainty and Disagreement section.

We will continue to field new LEAP surveys every month for at least three years. We will revisit questions from previous waves beginning in year two, in order to assess how expert views are changing, both in aggregate and within individuals. First, when we report results, participants can compare their own forecasts to other participants. As forecasts resolve, participants will be able to see how the accuracy of their past forecasts compares to other participants and adjust their future forecasts accordingly. We will also explore the relationship between short-term accuracy and long-run forecasts about AI capabilities, effects, and diffusion.

In this report, we discuss LEAP in the context of prior AI forecasting surveys (Connection to Prior Work), detail the procedures we use to build our panel (Panel Construction), describe the surveys and forecasting questions (Monthly Surveys and Forecasting Questions), share preliminary results (Results), outline plans for future work (Next Steps), and conclude (Conclusion).

Connection to Prior Work

While prior work on AI progress has taken a variety of forms, we focus here on studies that, like LEAP, use surveys of the AI expert community and the general public to measure the range of opinions and expectations on the progress of AI, its adoption, and the associated societal impacts. This type of work faces five challenges. First, it is difficult to specify a comprehensive sampling frame, and nonresponse bias is challenging to avoid. Both of these hurdles can distort aggregate estimates. Second, views on AI could change over time. Third, ambiguous resolution criteria introduce noise in forecasts and limit our ability to identify the most accurate forecasters, whom policymakers may want to rely on in the future. Fourth, while we might care about long-term outcomes, it is challenging to assess the credibility of long-term forecasts on short-run decision timelines. Fifth, participants often devote limited bandwidth to surveys, and, even when thoughtfully engaged, might not report their true beliefs on forecasting questions, instead using the survey to express their preferences. While no survey can fully overcome these limitations, we designed LEAP to thoughtfully engage with and mitigate each of these five challenges.

Challenge 1. Narrow frames and selective response can distort estimates of AI progress

A survey can fail to represent aggregate views in two ways. First, the sampling frame might fail to capture the population of interest.²⁷ Second, survey respondents might systematically differ from nonrespondents. Both problems affect AI forecasting surveys.

Most surveys of AI experts that predate LEAP target participants in top AI/ML or computer science conference proceedings (Muller and Bostrom 2014; Walsh 2017; Grace et al. 2018; Zhang et al. 2021; Stein-Perlman 2022; Zhang et al. 2022; Grace et al. 2024).²⁸ LEAP seeks to broaden the range of opinions as well as the expertise of participants and thus defines four target populations of experts: computer scientists researching topics in AI, economists studying the economic impacts of AI, AI industry professionals, and AI policy professionals. LEAP is the first survey to comprehensively measure the beliefs of each of these groups on AI-related questions so that we can compare each group’s forecasts and accuracy in the future. The Sampling section details our sampling procedures.

LEAP additionally includes members of the general public²⁹ and forecasters with a demonstrated track record of uniquely high accuracy on forecasting tasks.³⁰

Publicly available demographic data on AI experts is often collected and used to stratify results in these studies, but have not historically been used to reweight results to match a target expert population (Grace et al. 2018; Zhang et al. 2022). Many studies find associations between collected demographics and outlook,³¹ suggesting that results may be sensitive to nonresponse bias. LEAP directly addresses issues of nonresponse bias by using common survey weighting methods, discussed in greater detail in Reweighting.

Challenge 2. Views on AI can change over time

One-time surveys reflect respondent views at a given time, so they cannot track the evolution of individual or group opinions over time. As a panel survey that will continue for many years,³² LEAP enables this type of tracking over time.

Challenge 3. Ambiguous forecasting questions complicate ex post evaluation and comparison between forecasters

When forecasting questions lack detailed resolution criteria, assessing accuracy and comparing forecasts is fraught. If a question is ambiguous, forecasters may have different interpretations of how the question will be resolved, so those with disparate views can each claim to have superior accuracy. In contrast, forecasting questions with prespecified resolution criteria permit evaluation of forecasting accuracy after events resolve, which proves crucial for the subsequent two challenges.

Early surveys of AI experts focused on measuring the range of opinions on AI (Muller and Bostrom 2014) or the relationship between those opinions and other forecasts (Walsh 2017) rather than defining resolvable forecasting questions. Research that followed set a goal of generating useful, more resolvable, forecasts. Grace et al. (2018) ask experts to forecast on AI progress milestones. Stein-Perlman and Grace (2022) and Grace et al. (2024) use slightly modified versions of the earlier questions. Zhang et al. (2022) modify many of the AI milestone forecasts in Grace et al. (2018). LEAP builds on these recent efforts and attempts to create clear resolution criteria for all questions, establishing a longitudinal panel of beliefs from the same set of experts to understand how opinions change over time and to better quantify uncertainty and disagreement across a wider set of experts. O’Donovan et al. (2025) and McClain et al. (2025) present forward-looking questions but do not ask respondents for forecasts; Pew asks respondents their opinions on the impact of AI over the next twenty years but does not generate measurable outcomes (McClain et al. 2025). LEAP forecasting questions use precise and specific language and are designed with measurable outcomes; survey respondents are provided with the resolution criteria at the time of the forecast, a feature that enables us to address the remaining two challenges. Monthly Surveys and Forecasting Questions discusses this feature in greater detail.³³

Challenge 4. Long-term forecasts are difficult to evaluate in the near-term

When forecasters disagree about the long-run, how can policymakers evaluate those forecasts on short-run timelines? One potential path forward is to evaluate the short-run accuracy of forecasters, relying on the most accurate forecasters over the short-run to better understand long-run outcomes. Resolving this challenge first requires clear resolution criteria, discussed in the preceding challenge. However, it also requires forecasts on both short- and long-run questions.

When using fixed-year format questions, earlier work used 10-, 20-, and 50-year time frames for predictions (Muller and Bostrom 2014; Grace et al. 2018; Zhang et al. 2019; Zhang et al. 2022; Stein-Perlman and Grace 2022; Grace et al. 2024). LEAP includes questions with a wide array of resolution dates, including both near- (as early as the end of 2025) and long-term resolution, allowing us to explore whether short-term accuracy correlates with longer-run beliefs.³⁴

Challenge 5. Forecasting is time-intensive and cognitively taxing, and truthfulness is rarely incentivized

Expert and public participants have limited time to complete surveys, and long surveys risk dissuading potential participants from completion. Three features of LEAP counteract this challenge. First, we provide historical baselines and relevant background information for each question (see Appendix E.I. Survey Questions: Wave 1 for representative examples). Second, the survey instrument contains interactive interfaces—which integrate historical baseline data where available—to facilitate the forecasting process (see Appendix B.V. Survey instrument). Third, much like other surveys, LEAP compensates forecasters for their participation. In the first three waves, the median expert respondent spent 44 minutes on each survey, while the median member of the public and superforecaster spent 29 and 90 minutes, respectively. In contrast, the American Trends Panel (ATP) from Pew Research, a popular public opinion survey, targets ten to fifteen minutes per survey (Pew Research Center 2024).

Thoughtful engagement need not translate into forecasters reporting their true beliefs. If forecasters’ rewards are not positively related to the quality of their forecasts, they lack the incentive to provide accurate forecasts. Much past work relies on traditional participation payments that are independent of forecasting accuracy (Grace et al. 2018; Grace et al. 2024). Surveys of the general public (Zhang 2019; McClain et al. 2025) tend to provide similar recruitment incentives. In contrast, LEAP’s detailed resolution criteria on timely forecasting questions allow us to incentivize participants to provide accurate forecasts by tying rewards to proper scoring rules (Brier 1950; Jose and Winkler 2009; see Appendix B.III. Scoring). Past work demonstrates that proper scoring rules yield more accurate forecasts than unincentivized forecasts (Karger et al. 2021).

Panel Construction

We discuss in this section our procedures for sampling and reweighting, as well as summarize the characteristics of our panel.

Sampling

We target prominent experts whom policymakers, business and nonprofit leaders, and other stakeholders would be most inclined to consult regarding the progression of AI capabilities and its technological impact. Our complete panel consists of experts, forecasters with a demonstrated track record of high, differential accuracy on forecasting tasks (“superforecasters”), and members of the general public.

Expert Sampling

We target four expert communities. First, we include computer scientists researching topics in AI by including top-cited authors, stratified by age, and the authors of the top-rated papers at leading AI and ML conferences. Second, we identify leading economists, both across fields and within the subfield of economics focused on the economic effects of AI and new technology. We include top-cited authors of papers on AI and technology, members of the U.S. Economic Experts Panel (Clark Center 2025), and attendees of economics conferences on AI. Third, we include industry professionals, identified via their contributions to frontier models or employment at AI-related companies with extensive fundraising. Fourth, we identify institutions leading the discussion on AI development, policy, and impacts and invite research staff.

We sample from two other sources and sort them into one of the four communities above. First, we invited the honorees from TIME’s 100 Most Influential People in AI in 2023 and 2024 (Barker Bonomo and Javed 2024). Second, we allowed invited respondents to recommend other qualified candidates for the survey, yielding 172 additional invitees. In order to filter this group, we require that an individual:

meets the requirements of another sampling category;
has over 1,000 academic citations; or
has over 300 academic citations if a PhD student or postgraduate researcher.

These requirements excluded only 7 of the recommended candidates that ultimately enrolled. After exclusion, the referred group has 11.6 years of experience on average and 75% have a postgraduate degree. Like other expert sample expansions, referred contacts are not included in frame targets, but their responses do receive positive weights through the reweighting process.

Within each community, we build our initial frame by identifying potential respondents from a number of sources, described in greater detail in Appendix A. Panel Construction and Sampling. We largely create non-probability samples composed of all respondents who meet criteria for inclusion in the frame, but we randomly sample from some sources that yielded a large frame (e.g., industry and policy professionals).

To reach sufficient respondent counts, we expand beyond our initial frame in a number of categories below—this results in our “full frame,” as shown below. To correct for the consequent change in sample composition, we use our initial frame to define reweighting targets (see Reweighting for further details on reweighting and Appendix A. Panel Construction and Sampling for a detailed specification of our initial frame). While individuals identified in these late expansions do not contribute to our targets for weighting, they do receive positive weights in our results.

Expert Type	Initial Frame Count	Full Frame Count	Participant Count
Computer Science	454	719	61
Economics	391	773	66
Industry	561	1,640	57
Policy	367	881	96

Table 1: Counts by expert category in our initial frame (target population), full frame (the initial frame plus expanded sampling), and participant pool (individuals who have completed at least 1 survey).

When an individual enrolls in LEAP, we collect information on their primary affiliation, which we use to reassign their category. For example, a respondent might have been sampled through a well-cited computer science publication, but they currently work at a leading AI lab. In such cases, we use their current affiliation, as self-reported in the enrollment survey, as their assigned category. However, we use the initial, rather than updated, categorization to define our weighting targets, as we intend to measure the characteristics of the broader population reflected by this sampling group, rather than the identity of particular individuals. Additionally, we only have this updated affiliation information for enrollees and not experts in the broader sampling frame who do not enroll. This section will accordingly use these initial classifications. All sample statistics outside of this subsection instead use this final classification, rather than a classification based on how we sampled an individual.

Superforecaster Sampling

We include a sample of superforecasters, sourced through Good Judgment Inc., a company that maintains and adds to a list of highly accurate forecasters, many of whom were among the top 2% of forecasters by accuracy in IARPA’s ACE tournament (Good Judgment Inc. n.d.). These superforecasters have a demonstrated track record of providing the most accurate forecasts across a wide array of topics.³⁵ We invited 67 superforecasters through this search.

Public Sampling

We include in our sample highly-engaged participants³⁶ from past FRI research projects on the CloudResearch Connect platform. We initially invited approximately 2,600 individuals. To ensure representativeness across underrepresented populations in our initial sample, we also reach out to additional respondents who are either (1) over the age of 50 and identify as Republican; or (2) have a high school degree (or equivalent) as their highest level of completed education.

Reweighting

Individuals with certain viewpoints might be disproportionately likely to respond to our survey conditional on receiving an invite, skewing our results towards these viewpoints associated with a high propensity to respond to the survey. To address concerns about nonresponse bias, we use a standard approach in the public polling field, raking, to adjust aggregate statistics to be representative of the sampling frames.³⁷

These adjustments do not substantially change any of our results. The section Sensitivity of Results to Reweighting shows the impact of weighting on our aggregate results. We default to reporting weighted summary statistics for any results in this paper, but any tests or statistics related to differences in distributions (Mann–Whitney U tests and Cliff’s δ) are currently unweighted.

For the expert sample, we use our initial invite pool to generate benchmarks for reweighting. Appendix A. Panel Construction and Sampling provides more information on the composition and selection of this initial invite pool. We reweight on years of relevant experience, age, affiliation with Effective Altruism,³⁸ gender, continent, education, and affiliation with top AI labs.³⁹ We also equally weight participants based on their category of expertise.⁴⁰ Respondents from expanded sampling and referred contacts do not contribute to our frame targets for weighting, but they do receive positive weights in our results. The target populations for each reweighting category are displayed in Appendix A.VIII. Reweighting.

We do not reweight our superforecaster sample.

For the public sample, we reweight on age, gender, race/ethnicity, household income, educational attainment, and political party identification. We target U.S. population demographics.⁴¹ After our initial public invites, we conducted two targeted recruitment waves in cells with low response counts in our initial sample: first, we recruited individuals who were over age 50 and identified as Republicans. Second, we recruited individuals with a high school degree (or equivalent) as their highest level of completed education.

We assessed our sample for differences in experience, prominence (measured by citations), ideology (measured by affiliation with Effective Altruism), and top-lab affiliation. First, there are similar levels of experience in the sampling frame and actual respondents; 50% of respondents have more than 10 years of experience in their field, contrasted with 56% of invitees. Second, there is no clear difference in academic or research prominence for the invitees versus respondents; for example, top-cited computer scientist respondents averaged 120,000 citations, similar to the 148,000 in our invited pool. Third, however, there is a clear difference in ideological affiliation; we found that 28% of respondents had ties to Effective Altruism, in contrast with 14% of invitees. But, when we downweighted this group of respondents to match the invitee proportion during reweighting, we saw no meaningful change in aggregate results. Lastly, there is also a difference in leading lab membership. 18% of invitees were employed at one of the top-20 AI labs (as defined above), but only 8% of our respondents belong to this group.

Table 2 presents a comparison of several key demographic and professional features between the target sampling frame and the Wave 1 respondent population before and after weighting. The table illustrates how our reweighting process adjusted the respondent sample to more closely reflect the target population’s composition.

Characteristic	Target	Responded (unweighted)	Responded (weighted)
Average age	40 years	37 years	40 years
Affiliated with Effective Altruism	14%	28%	12%
Men	77%	78%	78%
Lives in North America	60%	67%	62%
Average years of experience	14 years	12 years	13 years
Has a postgraduate degree	75%	77%	79%
Affiliated with “Industry” category	25%⁴²	23%	25%
Affiliated with top AI lab	18%	8%	16%

Table 2: Comparisons between characteristics of the sampling frame population and the population of unweighted/weighted Wave 1 respondents.

Respondent Characteristics

339 experts provided complete forecasts for at least one survey. We report respondent counts by domain of expertise in Table 3.

Expert type	Number of respondents
Computer scientists	76
AI Industry employees	76
Economists	68
AI policy experts	119
Total	339

Table 3: Number of respondents per expert category. Respondents completed at least one of the first three survey waves.

To better characterize these experts, our respondent sample includes:

Top computer scientists: 41 of our 76 computer science experts (54%) are professors, and 30 of these 41 (73%) are from top-20 institutions (Berger 2025). Twenty-three (30%) had top-rated (top-40 or better) papers at NeurIPS or ICLR in recent years, and eight others (11%) are PhD students or postdocs who are highly cited according to our criteria.⁴³ Ten (13%) are among the 200 top-cited authors in AI (OpenAlex n.d.). This category also includes researchers at academic and non-academic research institutions. Our CS respondents have a median of 7,100 citations (for the 95% of panelists for whom data is available).
AI industry experts: 20 of our 76 industry respondents (26%) work for one of five leading AI companies: OpenAI, Anthropic, Google DeepMind, Meta, and Nvidia. Twenty-one of the remaining industry respondents (28% of the total) work for either a top AI company (top-20 model providers, by training compute, as measured by EpochAI 2024b), were identified as contributors to top-15 LLMs according to training compute or performance on Chatbot Arena in our sampling procedure (Epoch AI 2024b; LMArena 2024), or work for one of the top-30 AI-related companies, as measured by total funds raised (Crunchbase 2025). The remaining respondents were recategorized from our CS literature sampling pools, referral sampling, or other categories. Our industry respondents have a median of 9,100 citations (for the 59% of panelists for whom data is available).
Top AI economists: 54 of our 68 economist respondents (79%) are professors, and 30 (44%) are from top-50 economics institutions (RePEc 2025).⁴⁴ Our economist respondents have a median of 2,200 citations (for the 96% of panelists for whom data is available).
Policy and think tank group: Of our 119 AI policy respondents, 75 (60%) work for the following most-represented organizations (unordered): Brookings, RAND, Epoch AI, Federation of American Scientists, Center for Security and Emerging Technology, AI Now, Carnegie Endowment, Foundation for American Innovation, Stanford’s Institute for Human-Centered Artificial Intelligence and related groups, GovAI, Institute for AI Policy and Strategy, Future of Life Institute, Institute for Law & AI, Center for a New American Security, Data & Society Research Institute, Abundance Institute, and the Centre for International Governance Innovation.
TIME 100: Our panel includes 12 honorees from TIME’s 100 Most Influential People in AI in 2023 and 2024 (Bajekal 2023; Barker Bonomo and Javed 2024). TIME 100 honorees are categorized by their expertise and distributed among the above categories.

In addition to domain experts, respondents included:

60 highly accurate forecasters (“superforecasters”), based on performance in prior geopolitical forecasting tournaments.
1,400 members of the public, largely consisting of especially engaged participants in previous research, reweighted to be nationally representative of the U.S.

We report the count of respondents in each wave in Figure 1 below. The drop in expert completions from Wave 1 to Wave 2 was much larger (23%) than from Wave 2 to 3 (4%).⁴⁵

**Figure 1.** The number of survey completions per participant group, across the first three survey waves of LEAP.

In the first three waves, the median expert respondent spent 44 minutes on each survey,⁴⁶ while the median member of the public and superforecaster spent 29 and 90 minutes, respectively. Respondents provided 1.7 million words of rationales across three waves. Over 600,000 of these words come from our superforecaster and expert samples, with the remainder from public respondents.

Monthly Surveys and Forecasting Questions

We conduct surveys approximately every month, each consisting of 5–6 forecasting questions. We expect each survey to take experts approximately 30–40 minutes to complete, and respondents are informed of this time estimate. Respondents receive a standardized payment for each survey they complete.⁴⁷ In addition to quantitative forecasts, we collect rationales for each forecasting question, in the form of plain text. We use these rationale data to understand the underlying reasons for forecaster responses.

LEAP includes forecasting questions across five categories:

AI inputs: drivers of AI progress like investment, electricity consumption, and other AI R&D inputs, such as talent.
AI capabilities: measures of AI progress such as benchmark performance on difficult and consequential tasks.
AI adoption: drivers of AI impact such as the prevalence of AI applications and the intensity of AI use in economically meaningful and high-stakes contexts.
AI impacts: downstream societal effects, such as AI incidents and labor market disruption.
AI scenarios: bundles of outcomes that represent different trajectories for the technology. See an example from Wave 1 here (General AI Progress).

We source forecasting questions and topics from academic papers, technical and policy reports, prediction platforms, public writings by leading AI figures, our past research output, our academic advisory board, and suggestions from survey respondents. We then create resolvable forecasting questions from these various input sources. You can view example questions with associated resolution criteria in Appendix E. Survey Questions. The resolution criteria are intended to reduce noise in forecast collection by minimizing the space for different question interpretations among forecasters, as well as permit accuracy assessment.

Second, we ask various types of forecasting questions:

Probabilistic: We ask participants to assign a probability to a binary or discrete event. For example, “Will AI solve or substantially assist in solving a Millennium Prize Problem in mathematics by the following resolution dates?” (Wave 2)⁴⁸
Quantile: We ask participants to forecast quantiles of a continuous outcome (typically the 25^th, 50^th, and 75^th percentiles). For example, “What will be the highest percentage accuracy achieved by an AI model on FrontierMath, by the following resolution dates?” (Wave 1)⁴⁹
Point Estimate: We ask participants for a point estimate of a continuous outcome. For example, “By the end of 2030, what percent of LEAP expert panelists will agree that each of the following is a serious cognitive limitation of state-of-the-art AI systems?” (Wave 2; list omitted for brevity but available in Appendix E.II. Survey Questions: Wave 2).

Scoring first requires us to resolve the values being forecasted. Then, these scoring rules are used to provide accuracy prizes to incentivize truthful reporting. Respondents are provided with detailed resolution criteria and relevant historical baseline data in order to inform their forecasts. We resolve questions using either ground truth data or data generated from LEAP itself. We discuss these methods in greater detail in Appendix B. Monthly Surveys and Forecasting Questions.

We score rationales with a combination of human and LLM judges, and provide prizes to the highest quality rationales in each survey wave. We do not publicly share our scoring criteria to prevent gaming by participants.

Results

Uncertainty and Disagreement

On many questions, we ask respondents to express their uncertainty in the form of quantile forecasts. We describe quantile forecasts in greater detail in Monthly Surveys and Forecasting Questions. This approach allows us to quantify two sources of variation: between-expert disagreement (i.e., how much experts disagree with each other) and within-expert uncertainty (i.e., how wide each expert’s predicted range is). To measure total variation, we generate a “pooled” distribution of respondent beliefs, representing the full variation in expert views.⁵⁰ We discuss this procedure in greater detail in Appendix C. Pooled Distribution Estimation. We plot these pooled distributions for several questions in Figure 4, and we then decompose the variation in expert views into the proportion due to within-forecaster uncertainty and the component due to between-forecaster disagreement. For example, if we look at the share of work hours participants forecast to be assisted by generative AI, the variance of the distribution grows over time: the standard deviation is 5.7% in 2025, 13% in 2027, and 22% in 2030, and 65% to 53% of the variation is explained by within-forecaster uncertainty across the three time horizons.

While this approach to understanding uncertainty and disagreement is used across questions and groups of forecasters through the rest of the paper, we outline it with expert responses to one example question here. By 2030, the median expert predicts that 18% of all U.S. work hours will be assisted by generative AI. We will discuss this finding in more detail below. The 18% median forecast should be interpreted alongside information about variation in beliefs. To capture the full extent of expert uncertainty and disagreement, we construct an aggregate distribution of expert beliefs by combining each expert’s distributional forecasts on the question into a pooled (mixture) distribution. The 25^th–75^th percentile range of this pooled distribution is (7.3%, 34.6%). The ranges represent the total variation of beliefs in the expert pool. We use the law of total variance to decompose the variance of this pooled distribution into between-forecaster disagreement and within-forecaster uncertainty. For example, 53% of the variation in forecasts on this question is due to within-forecaster uncertainty and 47% are due to between-forecaster disagreement. Appendix C. Pooled Distribution Estimation contains more details on the method we use for this analysis.

Because fitting a distribution to a respondent’s forecasts relies on a parametric assumption, especially about the tails of individual distributions, we also present two other measures of variation. First, we present a basic measure of the amount of between-forecaster disagreement in the forecast: the interquartile range of experts’ 50^th percentile forecasts was 9%–30%, indicating substantial variation among experts in their predictions of the median outcome. In other words, a quarter of all experts believe that 9% of work hours (or less) will be assisted by generative AI, and a quarter believe it is likely to be above 30%.

Second, we present a measure of individuals’ uncertainty based on the quantile forecasts we elicited. In addition to the median, we elicited 25^th and 75^th percentile quantiles from each forecaster on this question. The median 25^th percentile forecast across all experts was 9% and the median 75^th percentile forecast was 28%. This 9%–28% range cannot be interpreted as a typical confidence interval, but it can indicate the degree of uncertainty forecasters had about the outcome of interest.⁵¹

In the footnotes for each point estimate, we report three measures of uncertainty as follows: we first summarize the total variation, pooling all expert beliefs and discussing the variation in that distribution. We then report the variation in median forecasts. If we only collected a central estimate, we do not report additional statistics. If we did collect other quantile forecasts, we report the median forecasts across experts of those quantiles. For example, on this example question, we would report the median 25^th (and 75^th) percentile forecasts for each quantity of interest.

Key Insights

We draw five insights from the results for Waves 1, 2, and 3. First, most experts expect sizable near‑term societal effects from AI. Second, substantial disagreement underlies these forecasts. Third, the median expert expects much less progress than prominent leaders of frontier AI labs. Fourth, experts anticipate faster progress than the public on most outcomes. Fifth, highly accurate forecasters (superforecasters) and experts are largely aligned, with superforecasters expecting slightly less progress overall. Differences across expert subgroups (CS, economics, industry, policy) are small and rarely statistically significant, and reweighting our expert sample to match a carefully constructed expert sampling frame leaves our headline conclusions unchanged.

1. Experts expect sizable societal effects from AI by 2040.

In particular, the median expert expects substantial impacts on the ability of AI systems to solve difficult math problems, the use of AI for companionship and work, electricity usage from AI, and investment in AI. Even the lower end of the expert belief distribution still implies substantial impacts of AI:

Work: The median expert forecast is that 18% of work hours will be assisted by generative AI in 2030,⁵² up from approximately 4.1% in November 2024 (Bick et al. 2025), over a 4x increase.⁵³ The bottom quartile of experts gives a forecast of 9%, while the top quartile gives a forecast of 30%. The median expert gives a 25% chance the value is 9% or lower (still over a 2x increase), and a 25% chance it exceeds 28%.⁵⁴
Private AI investment: The median expert predicts that Our World in Data will report $260 billion of global private AI investment by 2030, up from the $130 billion baseline for the series in 2024.⁵⁵ The median expert gives a 25% chance that investment will be at or below $175 billion, over a third higher than the baseline value, and another 25% chance that investment matches or exceeds $400 billion, just over 3x larger than the baseline level.
Electricity usage: The median expert predicts that 7% of U.S. electricity consumption will be used for training and deploying AI systems in 2030, and close to double that (12%) in 2040. For context, 7% is 1.5x today’s entire data-center load, 13% is all of Texas’ electricity use, 23% is almost all of the industrial sector’s electricity use, and 40% accounts for all residential electricity use. Even experts expecting less electricity consumption give substantial median forecasts: the bottom quartile of experts still predict values of 5% in 2030 and 8% in 2040.
Math research: 23% of experts predict that the FrontierMath benchmark will be saturated by the end of 2030,⁵⁶ meaning that AI can autonomously solve a set of math problems that resemble those a math PhD student might spend several days completing. The bottom quarter of experts expect 60% or less of these problems to be solved in the same timeframe, substantially more than the 19% baseline at the time of the survey. By 2040, experts predict it is more likely than not (60%) that AI will substantially assist in solving a Millennium Prize Problem, a set of problems comprising some of the most difficult unsolved mathematical problems.
Companionship: The median expert predicts that by 2030, 15% of adults will self-report using AI for companionship, emotional support, social interaction, or simulated relationships at least once daily, up from 6% today. By 2040, that number doubles to 30% of adults.

To assess the broader scope of AI’s impacts, we asked experts to assess “slow” versus “moderate” or “fast” scenarios for AI progress, and how AI will compare to other historically significant developments such as the internet, electricity, and the Industrial Revolution. We found:

Speed of AI progress: By 2030, the average expert thinks that 23%⁵⁷ of LEAP panelists will say the state of AI most closely mirrors a “rapid” progress scenario, which we described as: AI writes Pulitzer Prize-worthy novels, collapses years-long research into days and weeks, outcompetes any human software engineer, and independently develops new cures for cancer.⁵⁸ Conversely, they give a 28% chance of a slow-progress scenario, in which AI is a widely useful assisting technology but falls short of transformative impact.
Societal impact: By 2040, the median expert predicts that the impact of AI will be comparable to a “technology of the century,” akin to electricity or automobiles. Experts also give a 32% chance that AI will be at least as impactful as a “technology of the millennium,” such as the printing press or the Industrial Revolution and a 16% chance the AI is equally or less impactful than a “technology of the year” like the VCR.⁵⁹

The median expert predicts 2%⁶⁰ growth in white-collar jobs between January 2025 and December 2030. This is significantly slower than a recent linear trend, which would predict 6.8% growth. However, we did not collect forecasts on the causal effect of AI on white-collar employment.⁶¹ While some experts expect AI to cause white-collar job loss (See Occupational Employment Index), this question does not allow for a clear understanding of that causal relationship.

We summarize the expert forecasts for these various indicators in Figures 2 and 3.

**Figure 2:** Median expert forecasts for various questions. We display the 10^th , 25^th , 50^th , 75^th , and 90^th percentiles of the median forecasts given by experts at each date. For example, if 25% of experts give a median forecast of 5% or less, the 25^th percentile series in the graph will lie at 5%; these series are not confidence intervals. Where available, we include a historical baseline.

**Figure 3:** Average expert forecasts on the Technological Richter Scale question.

Experts lend insight into the aggregate results above with their rationales. For example, when considering the societal impact of AI (as defined by a technological Richter scale),⁶² one expert is well aligned with the median in that they assign the greatest probability to AI being the Technology of the Century, but they also argue:

…if progress can continue at its present rate, Technology of the Millennium is a possibility. Given the level of investment and the scramble for talent, Technology of the Decade is assured. We are on a long runway stretching back 50 years and have finally achieved liftoff… AI could conceivably rival the printing press at giving [the] everyman a level of intelligence where it once provided [the] everyman with information. The industrial revolution greatly increased the material productivity of society; AI could provide the same boost for both material and service products by trading electrical energy for intellect.

Echoing that sentiment, an expert who forecasted a slightly above-the-median impact writes, “While there are parallel examples in the rise of agriculture and industrial production, particularly in terms of general-purpose innovation (steam, fossil fuels, electricity, etc.), AI is unique because it both augments human intelligence and will eventually surpass it.”

2. Experts disagree and express substantial uncertainty about the trajectory of AI.

While the median expert predicts substantial AI progress, and a sizable fraction of experts predict fast progress, experts disagree widely. Notably, the top quartile of experts gives a median forecast that 50% of newly approved drug sales in the U.S. in 2040 will be from AI-discovered drugs, compared to a median forecast of just 10% for the bottom quartile of experts.⁶³ Further, the top quartile of experts gives a forecast of at least 81% that AI will solve or substantially assist in solving a solution to a Millennium Prize Problem by 2040, compared to a forecast of just 30% from the bottom quartile of experts.⁶⁴ We use our pooled distributions to express the relative importance of within-forecaster uncertainty and between-forecaster disagreement. We find that, across all forecasting questions that allow forecasters to express their uncertainty, within-forecaster uncertainty explains 49% of the total variation in forecasts, compared to the 51% explained by between-forecaster disagreement.

In Figure 4 below, we plot the pooled distributions for expert forecasts on the share of work hours assisted by generative AI and FrontierMath scores by the ends of 2025, 2027, and 2030.

**Figure 4:** Pooled distributions for expert forecasts on Work Hours Assisted by Generative AI (top panels) and FrontierMath scores (bottom panels). These pooled distributions combine within-expert uncertainty and between-expert disagreement. Densities are normalized to the same peak for comparability. See Uncertainty and Disagreement for details.

We report summary statistics for the expert pooled distributions for select questions in Table 4 below. You can find tables for all other relevant questions on the LEAP website.⁶⁵

Question	Resolution Year	25^th Pctl.	50^th Pctl.	75^th Pctl.	Within Share	Between Share
AV Trips	2027	2.7	7.1	20.1	0.53	0.47
AV Trips	2030	6.9	19.2	46.2	0.53	0.47
Drug Discovery	2027	1.1	2.5	6.8	0.41	0.59
	2030	2.4	6	15.7	0.41	0.59
	2040	8.4	22.8	53.8	0.48	0.52

Table 4: Summary of expert pooled distributions for U.S. Ride-Hailing Trips by Autonomous Vehicles (AV Trips) and AI-discovered share of Drug Sales (Drug Discovery). “Within Share” and “Between Share” denote the proportion of variance from within-forecaster uncertainty and between-forecaster disagreement, respectively.

The degree of within-forecaster uncertainty revealed above is also well-represented in the rationales. As one expert writes, “Predicting when such breakthroughs come is notoriously difficult, hence broad confidence intervals,” reflecting a common sentiment. Another expresses a more expansive view:

My forecasts reflect the ambivalence I feel between competing narratives—one being that AI’s potential to change the course of human history has been overhyped, and that we’ll run out of decent training data, and so on; the other being that we’re in a delusional, pre-Copernican state where we humans still cling to a ‘we’re the center of the universe’ notion that intelligence is something unique to our species, or at least has to be rooted in biological entities, even as it becomes blindingly self-evident that this is not true.

The sharp between-forecaster disagreement documented by the forecasts is also revealed in the rationales, particularly through their juxtaposition. For example, when considering pace-of-AI-progress scenarios, one expert writes, “ChatGPT was first publicly released in late 2022. I don’t believe what we witnessed over these past 2.5 years would justify expecting rapid progress over the next 5,” while another offers, “I think the last three years of progress have been qualitatively immense, so the next five years seem like they could lead to highly autonomous systems capable of very impressive things.” In other instances, the contrast is starker: “I believe that there is wide consensus on the rapid evolution of AI,” writes one expert. “Rapid progress scenario is unhinged,” writes another.

3. The median expert expects significantly less AI progress than leaders of frontier AI companies.

Leaders of frontier AI companies have made aggressive predictions about AI progress. Dario Amodei, co-founder and CEO of Anthropic, predicts:

January 2025: “By 2026 or 2027, we will have AI systems that are broadly better than almost all humans at almost all things.” (World Economic Forum 2025)
March 2025: Anthropic also claimed in a response to the U.S. Office of Science and Technology Policy that it anticipates that by 2027 AI systems will exist that equal the intellectual capabilities of “Nobel Prize winners across most disciplines—including biology, computer science, mathematics, and engineering.” (D’Souza 2025)
May 2025: Amodei has stated that AI could increase overall unemployment to 10-20% in the next one to five years, a prediction highlighted by Barack Obama. (Allen and VandeHei 2025; Obama 2025)

Sam Altman of OpenAI states that:

January 2025: “I think AGI will probably get developed during [Donald Trump’s second presidential] term, and getting that right seems really important.” (Tyrangiel 2025)

Elon Musk, leader of xAI and Tesla, writes:

December 2024: “It is increasingly likely that AI will superset [sic] the intelligence of any single human by the end of 2025 and maybe all humans by 2027/2028. Probability that AI exceeds the intelligence of all humans combined by 2030 is ~100%.” (Musk 2024)
August 2025: When a user posted “By 2030, all jobs will be replaced by AI and robots,” Musk responded: “Your estimates are about right.” (Musk 2025)

Demis Hassabis, CEO and co-founder of Google DeepMind predicts:

August 2025: “We’ll have something that we could sort of reasonably call AGI, that exhibits all the cognitive capabilities humans have, maybe in the next five to 10 years, possibly the lower end of that.” (Rose 2025)
August 2025: “It’s going to be 10 times bigger than the Industrial Revolution, and maybe 10 times faster.” (Rose 2025)

While we cannot directly compare these claims to LEAP questions, we offer clear evidence based on LEAP forecasts that the median expert expects substantially smaller effects of AI than is expected by frontier AI company leaders:

General capabilities: Lab leaders predict human-level or superhuman AI by 2026-2029, while our expert panel indicates longer timelines for superhuman capabilities. By 2030, the average expert thinks that 23% of LEAP panelists will say the state of AI most closely mirrors an (“rapid”) AI progress scenario that matches some of these claims.⁶⁶^,⁶⁷
White-collar jobs: The median expert predicts 2% growth in white-collar employment by 2030 (compared to a 6.8% trend extrapolation).⁶⁸ This contrasts with Elon Musk’s suggestion that all jobs might be replaced by 2030.⁶⁹ Relatedly, Dario Amodei predicts 10-20% overall unemployment within the next five years.
Millennium Prize Problems: The median expert gives a 60% chance that AI will substantially assist in solving a Millennium Prize Problem by 2040⁷⁰ (and 20% by 2030).⁷¹ Amodei’s prediction of general “Nobel Prize winner” level capabilities by 2026-2027 could imply a much more aggressive timeline, but the implication of Amodei’s predictions are somewhat unclear.⁷²

The rationales help explain why the median expert forecasted slower progress than AI company leaders predict. Some reference specific claims made by company leaders: “Anthropic CEO’s forecast of 90% of coding in the USA done by AI ‘within six months’ has been a fantastic dud,” wrote one expert, and another, “Musk has been talking about autonomous driving for ages, and it’s always been worse in the end than he said.”

Others offer broader arguments as to why they believe the timelines predicted by the leaders of frontier AI companies are unlikely to materialize. Below are two examples of these arguments:

Radical change in major systems just takes longer than 4-5 years. I also think that [in] many of these domains, even unexpectedly fast advancement in AIs will not easily translate to improvements for quite some time because of unexpected barriers. That is, at least until we have strong artificial general intelligence (AGI), which we will not by 2030. To paraphrase an old saying, every job looks easy for those not actually doing it.

The force of its impact will likely be slowed by bottlenecks in areas AI hasn’t yet conquered. An important concept is that an economic bottleneck grows in significance when productivity elsewhere increases. For instance, if global shipping dramatically increases, a bottleneck in the Suez or Panama Canal becomes much more costly. There likely exist thousands (millions?) of potential bottlenecks in the economy which will only become legible as other processes are sped up by orders of magnitude.

4. Experts predict much faster AI progress than the general public.

Of 68 total forecasts⁷³ (across 14 questions with multiple time horizons and quantiles) with a clear valence of AI capabilities,⁷⁴ the general public holds views about AI progress, capabilities, and diffusion that are statistically indistinguishable⁷⁵ from experts in 9% of cases, predict less progress⁷⁶ than experts in a large majority (71%) of all cases, and predict more progress in 21% of forecasts. Where experts and the public disagree, the public predicts less progress over three times as often as more progress. Across these forecasts that exhibit a clear valence of AI capabilities, a randomly selected expert is 16% more likely than a randomly selected member of the public to predict faster progress than would be expected by random chance. Note, all Mann–Whitney U tests and Cliff’s δ⁷⁷ values are calculated on unweighted data, but we plan to integrate weights into all such analyses in future work. We summarize some of the differences in aggregate forecasts in Figure 5, and Figure 6 plots pooled distributions for experts against those for the public. The questions selected in Figure 5 reflect the progress-valenced questions that easily map onto a percentage scale.

**Figure 5:** Differences between the expert and public median 50^th percentile forecasts for several questions where the unit is a percentage. Points indicate the median of each group’s 50^th percentile forecasts. We apply transformations to create valenced forecasts, where values closer to the left indicate slower progress and values to the right indicate faster progress.

**Figure 6:** Pooled distributions for expert and public forecasts on various questions across years. These pooled distributions combine within-expert uncertainty and between-expert disagreement. Densities are normalized to the same peak for comparability. See Uncertainty and Disagreement for details.

We report statistical comparisons between experts and the public for select questions in Table 5 below. You can find tables for all other relevant questions on the LEAP website.⁷⁸

Question	Resolution Year	Percentile	p-value	Cliff’s δ
AV Trips	2027	25	0.017	0.085
	2027	50	<0.001	0.16
	2027	75	<0.001	0.23
	2030	25	<0.001	0.12
	2030	50	<0.001	0.26
	2030	75	<0.001	0.35
Drug Discovery	2027	25	<0.001	-0.26
	2027	50	<0.001	-0.19
	2027	75	0.018	-0.093
	2030	25	<0.001	-0.14
	2030	50	0.28	-0.043
	2030	75	0.15	0.056
	2040	25	<0.01	0.12
	2040	50	<0.001	0.28
	2040	75	<0.001	0.37

Table 5: Statistical comparisons for experts and the public for U.S. Ride-Hailing Trips by Autonomous Vehicles (AV Trips) and AI-discovered share of Drug Sales (Drug Discovery). We report p-values from Mann–Whitney U tests and Cliff’s δ values. A positive value indicates that expert forecasts tend to be larger than public forecasts.

We summarize some of the major differences below:

Societal impact: On average, experts give a 63% chance that AI will be at least as impactful as a “technology of the century”—like electricity or automobiles—whereas the public gives this a 43% chance. Further, experts give a 32% chance that it will be at least as impactful as a “technology of the millennium” (akin to the printing press or the Industrial Revolution), while the public gives this a 22% chance.
Autonomous vehicles: The public predicts about half as much autonomous vehicle progress as experts by 2030, as suggested by each group’s 50^th percentile forecasts. The median expert in our sample predicts that usage of autonomous vehicles will grow dramatically—from a baseline of 0.27% of all U.S. rideshare trips in Q4 2024 to 20% by the end of 2030.⁷⁹ In comparison, the general public predicts 12%⁸⁰ (p < 0.001, Cliff’s δ = 0.26).
Generative AI use: The public predicts about half as much generative AI use in 2030. Experts predict that 18% of U.S. work hours will be assisted by generative AI in 2030, whereas the general public predicts 10% (p < 0.001, Cliff’s δ = 0.29).
Mathematics: 23% of experts predict that FrontierMath⁸¹ will be saturated by the end of 2030 in the median scenario,⁸² meaning that AI can autonomously solve a typical math problem that a math PhD student might spend multiple days completing. Only 6% of the public predict the same, about 3x less.
Diffusion into science: Experts predict a roughly 10x increase (from 3% to roughly 30%) in AI-engaged papers across Physics, Materials Science, and Medicine between 2022 and 2030.⁸³ The general public predicts two-thirds as much diffusion into science: that roughly 20% of papers in these fields will be AI-engaged.⁸⁴
Drug discovery: By 2040, the median expert predicts that 25% of sales from newly approved U.S. drugs will be from AI-discovered drugs, compared to 15% for the public (p < 0.001,Cliff’s δ = 0.28). The median expert also thinks there’s a 25% chance that AI-discovered drugs will account for more than 43% of recent drug sales, whereas the general public predicts there’s a 25% chance of a share greater than 23%—about half the expert forecast. In contrast, the public expects a larger share from AI-discovered drugs in the short-run, predicting that AI-discovered drugs will account for 2.5% of recent drug sales in 2027; experts predict 1.6% of 2027 sales will be attributed to AI-discovered drugs (p < 0.001, Cliff’s δ = 0.19).⁸⁵

Contrary to this result, the public assigns more weight to the “Rapid Progress” scenario in the General AI Progress question: the average member of the public expects 26% of LEAP panelists in 2030 will select the rapid scenario (95% confidence interval [25.5%, 26.4%]),⁸⁶ compared to 23% for experts (95% confidence interval [22.1%, 23.7%]).⁸⁷

To assess the extent to which low-effort or relatively lower comprehension from the public could drive these results, we compare members of our public sample with high forecasting accuracy in other studies to those with low accuracy. We do not find that one group systematically expects more or less progress. Public Accuracy Stratification details this analysis. Additionally, while within-forecaster uncertainty explains 49% of the total variation in expert forecasts, this component explains just 37% of the variation in public forecasts. As forecasting questions resolve, we will compare the calibration (and accuracy) of expert and public forecasts.

5. There are few differences in prediction between superforecasters and experts, but, where there is disagreement, experts tend to expect more AI progress. We don’t see systematic differences between the beliefs of computer scientists, economists, industry professionals, and policy professionals.

There are no discernible differences between forecasts from different groups of experts. Across all pairwise comparisons of expert categories for each of the questions with a clear AI progress valence, only 32 out of 408 combinations (7.8%) show statistically significant differences (at a 5% threshold), similar to what you would expect from chance. This means that computer scientists, economists, industry professionals, and policy professionals largely predict similar futures as groups, despite there being significant disagreement about AI among experts. This raises questions about popular narratives that economists tend to be skeptical of AI progress and that employees of AI companies tend to be more optimistic about fast gains in AI capabilities. In other words, while we do see widespread disagreement among experts about the future of AI systems, capabilities, and diffusion, we fail to find evidence that this disagreement is explained by the domain in which experts work. As LEAP continues, we plan to study what factors most drive expert disagreement. However, the groups used for these comparisons are subsets of our expert sample, so these comparisons are necessarily less powered.

Superforecasters and expert groups predict similar futures. Superforecasters are statistically indistinguishable from experts in 69% of valenced forecasts, predict less progress than experts in 26% of forecasts, and more progress in 4% of forecasts. A randomly selected expert is 9.8% more likely than a randomly selected superforecaster to predict faster progress than would be expected by random chance.

Where superforecasters and experts disagree, superforecasters usually (86% of such cases) predict less progress. Further, some of these disagreements are quite large. For example, the median expert predicts that use of autonomous vehicles will grow dramatically—from 0.27% of all U.S. rideshare trips in 2024 to 20% by the end of 2030,⁸⁸ whereas the median superforecaster predicts less than half that, 8%⁸⁹ (p < 0.001). A randomly selected superforecaster predicts, in the median scenario for 2030, less AV penetration than a randomly selected expert 37% more often than would be expected by random chance. Superforecasters also predict less societal impact from AI and less AI-driven electricity use. Drug discovery is the only setting where superforecasters are more optimistic than experts: By 2040, experts predict that 25% of sales from recently approved U.S. drugs will be from AI-discovered drugs. Superforecasters predict 45%, almost double (p < 0.01). Here, a randomly selected superforecaster predicts a higher share, in the median scenario, than a randomly selected expert 23% more often than would be expected by chance.

This pattern is consistent with the follow-up to our Existential Risk Persuasion Tournament, where a small sample of experts were more optimistic about AI progress than around 80 superforecasters about AI progress and capabilities in a 2022 (pre-ChatGPT) survey, although both experts and superforecasters significantly underestimated AI progress by 2025 (Kučinskas et al. 2025; Karger et al. 2025). We summarize some of the differences in aggregate forecasts in Figure 7, and Figure 8 plots pooled distributions for experts against superforecasters. In Figure 9, we compare aggregate forecasts of the various expert groups.

**Figure 7:** Differences between the expert and superforecaster median 50^th percentile forecasts for several questions where the unit is a percentage. Points indicate the median of each group’s 50^th percentile forecasts. We apply transformations to create valenced forecasts, where values closer to the left indicate slower progress and values to the right indicate faster progress.

**Figure 8:** Pooled distributions for expert and superforecaster forecasts on various questions across years. These pooled distributions combine within-expert uncertainty and between-expert disagreement. Densities are normalized to the same peak for comparability. See Uncertainty and Disagreement for details.

**Figure 9:** Expert category median 50^th percentile forecasts for several questions where the unit is a percentage. Points indicate the median of 50^th percentile forecasts for each category. We apply transformations to create valenced forecasts, where values closer to the left indicate slower progress and values to the right indicate faster progress.

We report statistical comparisons between experts and superforecasters for select questions in Table 6 below. You can find tables for all other relevant questions on the LEAP website.⁹⁰

Question	Resolution Year	Percentile	p-value	Cliff’s Delta
AV Trips	2027	25	<0.001	0.34
	2027	50	<0.001	0.45
	2027	75	<0.001	0.44
	2030	25	<0.001	0.31
	2030	50	<0.001	0.37
	2030	75	<0.001	0.32
Drug Discovery	2027	25	0.58	0.042
	2027	50	0.084	0.14
	2027	75	0.09	0.14
	2030	25	0.44	0.064
	2030	50	0.66	0.037
	2030	75	0.81	0.021
	2040	25	0.014	-0.2
	2040	50	<0.01	-0.23
	2040	75	<0.01	-0.22

Table 6: Statistical comparisons for experts and superforecasters for U.S. Ride-Hailing Trips by Autonomous Vehicles (AV Trips) and AI-discovered share of Drug Sales (Drug Discovery). We report p-values from Mann–Whitney U tests and Cliff’s δ values. A positive value indicates that expert forecasts tend to be larger than superforecaster forecasts.

Highlights from Question-by-Question Analyses

To provide a window into some of the details available in the question-level analysis, below we highlight some key reasoning offered by experts in their written rationales. We present these highlights for a handful of questions, but more detail is available in Appendix F. Question-by-Question Results.

FrontierMath:⁹¹ Experts who forecast a high degree of progress on this benchmark often point to recent trends. One argues, “We’ve seen jumps of around 5 points⁹² on this benchmark every couple of months so far and these jumps will only accelerate as scores approach 50% (benchmark scores tend to be roughly sigmoid-shaped over time).” Many also emphasize inference scaling potential, with one observing that “the current top scorer is a reasoning model, and the reasoning model paradigm is relatively new; this suggests that rapid improvements are likely as the paradigm evolves,” and another opining that “with very large amounts of inference compute, it’s possible that o3 or o4-mini could already get well over 30% today.” Beyond advances in technical capabilities, a significant minority of high-forecast respondents point to the likelihood of sustained investment in math-related AI capabilities due to its prestige and R&D value. Writes one: “Math is highly relevant to many R&D domains, so progress in math has been, and is highly likely to continue to be, a focus for leading AI companies.” Among low-forecast respondents, a common sentiment is that “the fastest . . . progress is behind us and we are now approaching the flat/end point portion of the S-Curve of advancement.” Many also express skepticism that inference scaling will be enough to overcome fundamental architectural limitations.

Autonomous vehicles:⁹³ A common consideration among experts who forecast that a high percentage of U.S. ride-hailing trips will eventually be provided by Level 4 autonomous vehicles was that Level 4 technology has a proven track record. One writes, “Historically, when a technology finally gets to be used in the wild, it improves very rapidly.” Another highlights Waymo’s exponential expansion, noting that “Waymo is currently more-than-doubling every year,” and that this could result in a data flywheel where “broader deployment will generate more data, which in turn enhances safety—creating a positive feedback loop.” Conversely, low-forecast respondents often emphasize historical overpromising and argue the current technology is difficult to scale “because it requires lots of case-by-case optimization for a particular region (down to individual intersections),” and “progress in Phoenix or Miami does not generalize easily to New York, Boston, or Chicago.”

White-collar jobs:⁹⁴ Experts who forecast that the percentage of white-collar jobs will be higher in the future relative to 2025 often cite historical precedent when arguing that white-collar employment is more likely to adapt than collapse: “Historically, human labor patterns have experienced quite radical transformations over time, even in established sectors. Emerging technologies, rather than sucking people out of the labor market of white-collar work, are more likely to make them work differently and lead to new white-collar roles that can capitalize on this transition.”Low-forecast respondents tend to view AI’s speed and cognitive focus as a key differentiator from prior periods of transition: “White collar workers…work on symbolic tasks, generate language, and make decisions based on analysis of data, all tasks to which LLMs are well suited,” writes one. Another points to evidence of layoffs already occurring, particularly in the tech sector: “From Intel to Microsoft, many top executives and management staff were laid off to make room for other investments at the organization. Google laid off 10% of its managerial staff last December.” Still another adds, “software engineering is already being hugely impacted and this is only accelerating.”

Speed of AI progress:⁹⁵ Many experts who forecast a high likelihood that AI progress will be rapid mention the consistent pace of capability improvements, with one noting, “historically AI system development has followed a steep scaling curve and increases in model size, data and compute have led to rapid capability gains.” Another points out that, “METR [Model Evaluation and Threat Research] results imply a roughly 4 to 10x improvement in time horizon every year, which means that we’ll have systems capable of doing weeks or months of work by the late 2020s.”⁹⁶ Many also note the potential for a recursive self-improvement feedback loop, in which AI improves itself, to significantly accelerate growth. Slow-progress forecasters often highlight physical constraints, pointing to the need for better training data and the cost of compute—especially as it relates to energy needs: “I expect energy to be the chief bottleneck to AI progress such that it will be a rate-limiter for progress in general.” Another common sentiment shared by slow-progress forecasters is that “for any of the moderate and rapid progress criteria to be met there would need to be a massive paradigm shift in AI technology,” and that such a shift in the underlying LLM architecture is unlikely to materialize in the near term.

Technological Richter Scale:⁹⁷ Intended to be analogous to the measurement of earthquake magnitudes, the technological Richter scale instead attempts to measure the impact of technologies. Experts who predict AI will have an exceptionally high impact tend to see it as uniquely positioned to fundamentally restructure society, with one going so far as to suggest it could “challenge capitalism” and “has the potential to replace human labor in most fields [and] might force our societies to shift to a new economic model.” This transformative potential is believed by many to stem from AI’s dual nature in that “it both augments human intelligence and will eventually surpass it.” High-impact respondents also point to the quick pace of AI deployment and emphasize the likelihood of rapid AI progress: “People fundamentally don’t think in exponentials. 2040 is a LONG time away, technologically. And AI will modify AI, at which point its improvement will go even more second-order.” Low-impact respondents often question the sustainability of recent growth patterns, with one noting that “upper levels, and in particular 10, require sustained exponential growth; this is unlikely to materialize given that natural growth (e.g., of bacteria) follow a sigmoid shape.” Another points to regulations and physical bottlenecks as constraints and challenges whether AI will deliver tangible benefits sufficient to drive transformation: “The average citizen will not have much benefit to buy from AI. Improved games or art? Cheaper manufactured goods? A robot to clean your house? How does AI deliver things that humans want, like better, cheaper healthcare?”

Millennium Prize:⁹⁸ The most common reason given by experts who think the likelihood is high that AI will solve (or substantially assist in solving) one of the notoriously difficult Millennium Prize math problems is that in January of 2025 the CEO of Google’s DeepMind claimed that, in partnership with a team of mathematicians, they’re “close to solving” one of the problems (later identified as Navier-Stokes) within “a year or year and a half” (Ansede 2025). Many also point to AI’s gold medals in the International Mathematical Olympiad (Luong and Lockhart 2025) and progress on FrontierMath as evidence of rapid capability growth in mathematical reasoning. Low-forecast respondents generally don’t mention the DeepMind claim; a few do, but dismiss corporate pronouncements. One instead cites the president of the Clay Mathematics Institute (the organization responsible for awarding the Millennium Prize) who in June of 2025 claimed, “We’re very far away from AI being able to say anything serious about any of those problems” (Heaven 2025). Regarding purported progress, one mathematician notes: “The Math Olympiad is targeted toward gifted high school students spending an afternoon on a problem solvable with known techniques…[whereas] Millennium Prize problems can consume entire careers without a solution.” Multiple low-forecast respondents also note FrontierMath Tier 4 (which poses much harder problems than Tiers 1-3) has <10% solve rates.

Diffusion into science:⁹⁹ When considering the percentage of publications in the fields of physics, materials science, and medicine that will be ‘AI-engaged’, experts predicting high engagement commonly extrapolate recent exponential growth, noting the 2018-2022 data shows engagement roughly tripling across all three fields, and emphasize this baseline predates ChatGPT: “What we see in the baseline data is just the beginning, resulting from the application of foundational AI models but largely without the generative AI models exploding on the scene.” Many cite domain-specific breakthroughs (AlphaFold in protein folding, GNoME and MatterGen in materials discovery, AI-driven imaging and diagnostics in medicine) as evidence AI can deliver transformative results that will drive rapid diffusion. Low-forecast respondents instead tend to focus on interpretability and reliability issues that could slow diffusion: “AI is a black box that hallucinates,” writes one. Others add that diffusion may be slow due to, “natural resistance to change from the existing body of researchers in these fields,” or because, in the case of medical studies, AI isn’t useful: “A significant portion of papers are observational, often reporting causal effects. There isn’t much room for AI in these sorts of papers, as current statistical methods are more reliable and bias-free.”

Drug discovery:¹⁰⁰ Several experts who forecast that AI will accelerate drug discovery-to-market timelines point to faster design-make-test-analyze loops and potentially AI-enabled pharmacodynamic simulations that could streamline clinical trials. Others note AI-discovered drugs already demonstrate significantly higher Phase I success rates, and that extrapolating from current growth trends suggests “these will constitute a majority of new clinical trial submissions.” Some also point out that discovery-to-market timelines can be shortened significantly during times of crisis via EUAs (emergency use authorizations). Low-forecast respondents commonly emphasize regulatory realities that may limit AI’s impact on approval timelines. One writes: “Given that the median time it takes to get through the FDA approval process is over 10 years, and no AI-discovered drugs appear to have started Phase III trials yet,¹⁰¹ 2027 is likely too soon for many, if any, new AI drugs to be approved.” Although some low-forecast respondents acknowledge improved Phase I results, several make the point that “the turnaround time between Phase I and approval will not speed up substantially for AI-invented drugs,” because “early entrants sped through Phase I but then quickly reverted to the mean in Phase II.”

Electricity:¹⁰² When assessing the percent of U.S. electricity consumption that will be used for training and deploying AI systems, high-forecast respondents often emphasize that the massive capital expenditure plans already announced by competing AI companies signal the type of unprecedented infrastructure investment that could result in “an explosion of energy usage.” Others note that the geopolitical competition for AI supremacy may trigger an energy arms race, with one forecaster warning, “China is also investing massive amounts in datacenters,” leading to the possibility that “we enter an arms race that is mostly determined by who can pump the most electricity into AI.” Low-forecast respondents tend to focus more on potentially formidable constraints, particularly when considering “the material and political investments necessary to get significant growth—physical data centers, chips, permitting, water for cooling, transmission lines,” and suggest these constraints could push infrastructure development offshore: “Major developers will possibly respond by increasingly outsourcing the physical infrastructure of data processing to locales outside of the US—there’s no particular reason why models need to be trained inside of U.S. borders where the economic and political expenses are potentially much higher.”

Private investment in AI:¹⁰³ Among experts who predict high future levels of global private investment in AI, most view the current level of investment as fundamentally justified and advance arguments along the lines of, “AI adoption is still in its early stages across many industries,” and “the strong rebound to ~$130 billion in 2024 is critical. It occurred despite higher interest rates, signaling powerful, non-speculative belief in the transformative potential of generative AI.” In contrast, frequently expressed sentiments among low-forecast respondents include doubts that productivity gains “will materialize quickly enough to justify high levels of investment,” and that this could lead to the bursting of an AI bubble, with one expert noting, “both Deutsche Bank and Bain & Co. have just warned that the current AI boom is not sustainable” (Edwards 2025) and another likening the current situation to, “the dot com bubble in 2000.”

AI companions:¹⁰⁴ Many experts who believe that a high proportion of U.S. adults will eventually use AI for companionship cite loneliness as a key driver. One points out that “the U.S. Surgeon General declar[ed] loneliness an epidemic in 2023, with about half of U.S. adults experiencing measurable levels of loneliness” (Office of the Surgeon General 2023). Others predict that as AI capabilities advance, AI companions will become more “sophisticated, emotionally intelligent, and capable of forming deeper connections with users,” and that this will facilitate their integration into existing platforms and devices, driving use and normalization to the point where “ambient access through devices turns companionship into a series of micro-interactions throughout the day.” Low-forecast respondents tend to believe humans are likely to have a strong preference for human companionship, with one arguing that “most people would find [AI] companionship unfulfilling, perhaps even viewing reliance on it as a kind of failure.” Others point to lower saturation limits than the number of people who experience frequent loneliness, with one stating: “About a quarter of U.S. adults go to therapy.¹⁰⁵ If that’s the market size, then I expect AI to eventually saturate [at] that.”

Example of a Question-Level Analysis: Millennium Prize Problem

For each question, we conduct an analysis like the Millennium Prize example below, and present them in Appendix F. Question-by-Question Results. Each analysis summarizes the question and background information, summarizes the results, and analyzes rationales to uncover the core differences in view between low and high forecasts. In the first wave alone, experts and superforecasters wrote over 600,000 words supporting their beliefs. Analyzing these rationales alongside predictions provides significantly more context on why experts believe what they believe, and the drivers of disagreement, than the forecast alone.

Question. Will AI solve or substantially assist in solving a Millennium Prize Problem in mathematics by 2027, 2030, and 2040?

Background. The seven Millennium Prize Problems¹⁰⁶ were chosen by the founding Scientific Advisory Board of the Clay Mathematics Institute (CMI) of Cambridge, Massachusetts to be the most significant and difficult mathematics problems unsolved by 2000.

Historical Baseline. As of July 2025, only one of the seven problems has been solved (Clay Mathematics Institute, 2025).

For full question background and resolution details, see Appendix E.II. 1. Millenium Prize.

**Figure 10:** In this question, participants make 50^th percentile forecasts for various resolution dates. This figure shows the 10^th , 25^th , 50^th , 75^th , and 90^th percentiles of these 50^th percentile forecasts, split by participant group. The 25^th expert percentile for Dec 2027 represents the number that 25% of experts’ median forecasts are lower than.

Results. Experts estimate a 10% chance that AI will solve or substantially assist in solving a Millennium Prize Problem by 2027,¹⁰⁷ up to 20% by 2030,¹⁰⁸ and 60% by 2040.¹⁰⁹ All categories of experts, superforecasters, and the public largely predict similarly across timescales. However, there is wide disagreement between experts: the top quartile of experts think there’s at least an even (50%) chance of AI assistance by 2030, whereas the bottom quartile of experts think there’s only a 10% chance. The disagreement by 2040 is even larger: the interquartile range for expert medians is 30%–81%, while the top decile of experts think there’s a 95% chance and the bottom decile of experts think there’s only a 15% chance.

For full results tables, see here.

The rationales experts wrote to explain their forecasts lend considerable insight into their core areas of disagreement, in particular:

DeepMind/Navier-Stokes: High-forecast respondents frequently cite DeepMind CEO’s January 2025 statement that, in partnership with a team of mathematicians, they’re “close to solving” one of the problems (later identified as Navier-Stokes) within “a year or year and a half” (Ansede 2025). This is treated as strong concrete evidence. Low-forecast respondents generally don’t mention this or dismiss corporate pronouncements. One expert cites the Clay Institute president’s June 2025 claim: “We’re very far away from AI being able to say anything serious about any of those problems” (Heaven 2025).
Benchmarks: Many high-forecast respondents point to International Mathematical Olympiad gold medals and FrontierMath progress as evidence of rapid capability growth in mathematical reasoning that will likely continue (Luong and Lockhart 2025). Low-forecast respondents tend to argue these are fundamentally different challenges. One mathematician notes: “The Math Olympiad is targeted toward gifted high school students spending an afternoon on a problem solvable with known techniques…[whereas] Millennium Prize problems can consume entire careers without a solution.” Multiple forecasters note FrontierMath Tier 4 (which poses much harder problems than Tiers 1-3) has <10% solve rates.
The nature of Millennium problems: High-forecast respondents commonly emphasize that math is verifiable, has clear structure, and that some problems (Navier-Stokes, Birch–Swinnerton-Dyer) may be suited to AI-assisted numerical exploration or pattern recognition. Low-forecast respondents often express doubts that Millennium Problems are solvable with the current AI paradigm, emphasizing doing so requires “deep conceptual breakthroughs,” “developing new concepts and mathematical rules,” and “truly out of the box thinking.” One domain expert writes: “The current generation of AI does not seem to be able to do this sort of creative mathematical work at all. It can apply known techniques and get novel results, but these results would be very easy for top working mathematicians.”
Base rates and timelines: High-forecast respondents mostly don’t engage with base rates, or they argue that AI changes the game fundamentally. By contrast, many low-forecast respondents emphasize that only one out of seven problems have been solved in the 25 years since the prize was announced, meaning some have remained unsolved for more than a century. They also highlight Millennium Prize rules: upon the publication of a solution, a minimum of two years must pass before a prize can be awarded, to allow time for adequate verification. (In the case of the one prize that was awarded, the gap between the publication of the solution and the awarding of the prize was over seven years.) This, many low-forecast respondents point out, renders the 2027/2030 dates almost impossible regardless of technical progress.
“Substantially assist” interpretation: High-forecast respondents tend toward a broad interpretation—any meaningful acceleration of human-AI hybrid research counts, whereas low-forecast respondents tend toward restrictive interpretation. One notes the resolution criteria require contribution “likely not producible without AI,” which is a higher bar.
Architecture sufficiency: Most high-forecast respondents believe incremental improvements over current LLM capabilities will be sufficient, especially when paired with specialized tools (Lean, AlphaProof) and human collaboration. Low-forecast respondents frequently argue the current LLM paradigm fundamentally cannot do this. Multiple forecasters say we need “entirely new architectures” (neurosymbolic systems were mentioned several times) or that a “pattern matching paradigm doesn’t extend to the deep creativity required.”
Difficulty of achieving superhuman performance: Although rarely discussed by high-forecast respondents, a few low-forecast respondents expressed doubts that this could be achieved, with one writing, “Training a model to do math at the level of human experts might be a qualitatively different ML problem from training a model to do mathsurpassing expert capabilities. RL training requires creating problems with reward functions…We haven’t achieved that with reasoning post-training yet.”

High-forecast rationale examples:

“I guess the elephant in the room is that DeepMind says they are close: The so-called Navier-Stokes Operation, underway for three years with a team of 20 people, has so far been carried out with complete discretion, although the chief of Google DeepMind, Demis Hassabis let slip in a January interview that they are ‘close to solving a Millennium Prize Problem’ without mentioning which one. ‘We’ll see that in the next year or year and a half.”¹¹⁰

“Some of the problems, like the Riemann Hypothesis or the Birch and Swinnerton-Dyer Conjecture, are especially well-suited to AI-supported exploration. They bear a kind of family resemblance to the Four-Color Theorem in their relationship to computer-assisted mathematics. The Four-Color Theorem was famously solved through a hybrid of human conceptual framing and extensive computer verification. As Donald MacKenzie details in his socio-history of that episode [reference below¹¹¹], much of the intellectual labor wasn’t in the computation itself but in formalizing the problem in a way that machines could meaningfully engage with it and in managing the institutional consequences of proof-by-machine.”

“My optimism that AI could achieve high-level original mathematics is revised upward significantly since the Bubeck announcement about GPT-5 a few weeks ago regarding the first confirmed example of novel mathematical reasoning generated by a LLM.”¹¹²

Low-forecast rationale examples:

“I have domain expertise here as a mathematician. The current generation of AI does not seem to be able to do this sort of creative mathematical work at all. It can apply known techniques, and get novel results, but these results would be very easy for top working mathematicians. The kind of pattern matching paradigm we have seen so far apparently doesn’t extend at all to deep creativity required.”

“If the millennium problems all require new insights absent from the training data, then current LLM technology is simply not up to the task: we will need instead new AI paradigms that are better at creating non-combinatorial insights (i.e., insights that do not originate from the recombination of patterns already learned by the AI). This will take time: it is not only the time to develop these new AI techniques, but also the time for the humans now riding the wave of machine learning and LLMs to accept that it might be worth their time to look into alternative approaches (more so after extensive efforts to trivialize these approaches as a loss of time). It is the second factor which I think will be the true time bottleneck and could push the resolution of this question further in time.”

“Given the progress on Tier 3 FrontierMath problems, a Millennium Problem seems well away, notwithstanding bullish predictions from corporate spokespeople with vested interests.”

“I would put 60% as some hard limit on whether any of the conjectures can be solved at all.”

“I don’t think there are many economic incentives to develop those kinds of systems. Millennium problems are very, very hard – much harder than most directly economically useful tasks. They require developing new mathematical theories and techniques to even approach them. As far as I know, current top AI models lack this ability, and I don’t see an easy way for them to obtain such an ability (nor are there many economic incentives for building such abilities into them).”

Sensitivity of Results to Reweighting

As described in the Reweighting section, we use a standard approach in the public polling field, raking, to weight aggregate statistics to be representative of the sampling frames. We perform a sensitivity analysis to understand the impact of weighting on aggregate forecasts. We compare the median aggregated forecast to the weighted median aggregated forecast, where positive values of differences indicate that weighting participant responses increases the value of a forecast.

We begin by expressing the difference between the weighted median and unweighted median at the question level as a proportion of the forecast dispersion. This difference is divided by the standard deviation of the unweighted forecasts, within each question-participant group, to standardize the impact across questions with different units. To examine the full distribution of reweighting effects, we calculate this standardized difference for multiple aggregate statistics—including the 25^th percentile, median, and 75^th percentile of the standardized differences themselves. For example, a 25^th percentile value of -0.07 indicates that across all questions, 25% of the standardized reweighting effects fall below -0.07. Table 7 below shows a summary of these differences by survey for experts and the public.

Survey	Participant	Min	p25	Median	Mean	p75	Max
Wave 1: Headliners	Expert	-0.24	-0.07	0	-0.03	0.01	0.2
Wave 1: Headliners	Public	-0.2	0	0	0	0	0.12
Wave 2: AI for science	Expert	-0.38	0	0	-0.02	0	0.19
Wave 2: AI for science	Public	-0.18	0	0.01	0.03	0.06	0.15
Wave 3: Broad Adoption of AI	Expert	-0.23	-0.09	0	-0.03	0	0.28
Wave 3: Broad Adoption of AI	Public	-0.21	0	0	-0.01	0	0.08

Table 7: Summary standardized reweighting effects. The difference between weighted and unweighted medians, expressed in standard deviation units, is summarized across surveys and participant groups.

These values show how much reweighting shifts the forecast relative to the dispersion of forecasts. For example, a value of 0.1 means the weighted forecast is 0.1 standard deviations higher than the unweighted forecast. This table shows that the median effect on both participant types and across all surveys is no change in the aggregate result. Reweighting has a marginally larger effect on expert participants than members of the public. This histogram below shows the effect of reweighting on all the forecasts in more detail.

**Figure 11:** Histogram showing standardized difference between weighted and unweighted aggregate median.

Public Accuracy Stratification

Given that we select for experience in AI in our expert sample but not in our public sample, public forecasts could be distorted by either a lack of comprehension or effort. To investigate this concern, we partition our sample by forecasting accuracy on out-of-sample, prior questions. For the 832 participants we can match to a prior forecasting record,¹¹³ we calculate accuracy scores based on performance on 24 forecasting questions asking about near-term (<6 months) events in an earlier research project.¹¹⁴ We score these forecasts using an S-score, with a lower score indicating better performance.

Public participants are split into two accuracy groups: “High-accuracy,” representing the most accurate 50% of public participants, and “low-accuracy” representing the least accurate 50% of public participants. We do not adjust weights after partitioning the public sample.

Of 68 total forecasts with a clear valence of AI capabilities, the results are mixed. The high-accuracy group holds views about AI progress, capabilities, and diffusion that are statistically indistinguishable from the low-accuracy group in half of all cases. They predict less progress in 28% of cases and more in the remaining 22% of forecasts. We summarize some of the differences in aggregate forecasts in Figure 12.

**Figure 12:** Differences between the high- and low-accuracy public median 50^th percentile forecasts for several questions where the unit is a percentage. Points indicate the median of each group’s 50^th percentile forecasts. We apply transformations to create valenced forecasts, where values closer to the left indicate slower progress and values to the right indicate faster progress.

We present some additional results in Appendix D. Public Accuracy Stratification. We plan to explore the drivers of these disagreements in future work.

Next Steps

This paper reports results from the first three monthly waves of LEAP and describes the project methodology in detail. We will elicit forecasts roughly each month for the next three years on timely topics related to the development, capabilities, adoption, and impact of AI. Both the set of forecasting questions and the space of possible analyses will expand over time: additional questions grow the number of datapoints we can test and learn from, and progressive question resolution enables a continued sharpening of accuracy measurement. We plan to release reports on each new wave of LEAP soon after we complete data collection. Additionally, we will periodically release more detailed reports and papers conducting more extensive and cross-wave analysis. We will discuss some of those analyses below.

Future waves of LEAP will be focused on forecasts related to topics including security and geopolitics, robotics, labor and automation, incidents and harms, and AI safety. For example, we tentatively plan to ask forecasters to predict how much AI will improve the productivity of software engineers, how AI will affect the productivity of workers, and how the use of AI may cause harm. We welcome reader suggestions for LEAP questions or wave themes via outreach to our project team.¹¹⁵ We now describe planned follow-up work for LEAP.

Accuracy Assessment

As questions resolve, we will be assessing the accuracy of forecasts to identify particularly accurate individual forecasters within the expert, superforecaster, and public groups, and to assess the relative accuracy of the different expert subgroups within our sample. We will also present forecasters with information on their own past performance, assessing how this feedback translates into accuracy on new forecasting questions. Kučinskas et al. (2025) perform a similar accuracy analysis, retrospectively analyzing a forecasting study focused on multi-year forecasts about AI-, nuclear-, climate-, and biotechnology-related progress and risks, finding no correlation between short-term accuracy and long-term beliefs in their study context, first introduced in Karger et al. (2023). LEAP is, in many ways, a follow-up to that project, improving on several choices the authors made: LEAP requires that forecasters answer all (or almost all) questions, elicits one-time forecasts, and does not provide a team structure or room for deliberation, since debate in that survey did not generally prove to resolve disagreements (Karger et al. 2025). We will evaluate these findings with significantly more precision in LEAP as questions resolve in 2027 and 2030.

Forecast Updating

We plan to re-survey the LEAP sample on many of our questions to track how respondents’ views evolve over time. For example, one year from now, how will respondents update their forecasts of whether there will be an AI-reliant solution to a Millennium Prize Problem? How will progress (or a lack of progress) for AI systems on key benchmarks change respondents’ views of longer-run effects of AI?

Schools of Thought Analysis and Crux-Finding

A “school of thought” is a cluster of similar responses to a set of forecasting questions. In future work, we will use standard clustering techniques to search for similar groups of forecasters across questions, and we will complement this work with analysis of qualitative information (rationales) for these various schools of thought. Are we able to distinguish consistent differences in sets of forecasts among subsets of our sample—for example, fast versus slow AI progress groups? The search for schools of thought in our forecasting data takes a related but opposite approach to our prior work using adversarial collaborations to identify cruxes for differences of opinion about the likelihood of harms from AI, in which we search for individuals from distinct schools-of-thought and then ask them to forecast on questions about AI on which they disagreed (Rosenberg et al. 2024). Both approaches can help us map beliefs about AI.

Relatedly, are we able to identify “cruxes”—i.e., strongly differential forecasts between schools of thought on near-term questions that enable faster assessment of which school is more likely to be accurate in the long term? This crux-finding effort builds on our earlier work (McCaslin et al. 2024; Rosenberg et al. 2024; Rosenberg et al. 2025), which finds that disagreement on the likelihood of extreme outcomes from AI is not easily resolved by debate, but we can identify nearer-term cruxes that could increase consensus.

Elicitation Experiments

We have already tested the impact of providing various defaults in the interactive forecasting interfaces (see Appendix B.V. Survey instrument) and the ordering of options within questions; we plan to present these results in a future report. Additionally, we plan to experiment with question wording, question ordering, and more.

Expanded Use of Rationale Data

We are exploring scalable and privacy-preserving methods for directly displaying a subset of rationales for high and low forecasts on particular questions, as part of our forecast explorer.¹¹⁶ We will provide detailed rationale analyses (similar to the examples above) for all LEAP questions, available in our monthly reports. We are also awarding prizes to respondents for rationale quality, and we will highlight publicly some particularly high-quality rationales.

Public Engagement

We may experiment with broader public engagement on AI forecasting. For example, we may enable anyone to make their own forecasts on LEAP questions, and provide them with a report on where their forecasts fit among different schools of thought.

Conclusion

Policymakers, nonprofit and business leaders, and other stakeholders routinely consult experts to base their decisions on the perspective of experts, especially when faced with new technologies and high levels of uncertainty. While public discussion of AI and expert anecdotes are widespread, structured quantitative evidence on expert beliefs is lacking, impeding effective decision making. With the launch of LEAP, we fill an important gap by both measuring the full range of expert opinions on AI capability developments and their impact, and by capturing the underlying reasoning and evidence that supports these beliefs.

We have completed three survey waves focused on (1) high-level predictions about AI progress; (2) the application of AI to scientific discovery; and (3) widespread adoption and social impact. The first three rounds of LEAP reveal five key findings.

First, collectively experts expect sizable societal effects from AI by 2040, even if effects materialize more slowly than expected.

Second, and in contrast with the first takeaway, considerable disagreement across experts, and uncertainty within individual experts, underlies these predictions of progress. This dynamic likely arises from multiple sources: the inherent difficulty of forecasting emerging technologies, sharp disagreements between competing schools of thought, and the fundamental uncertainty surrounding AI development and its impact. Forecasting emerging technologies is inherently difficult. For example, historical predictions about fusion power have consistently proven overoptimistic (Takeda et al. 2023). In our own prior work, both domain experts and superforecasters substantially underestimated AI progress (Kučinskas et al. 2025). Nevertheless, aggregate forecasts remain informative and offer the potential to cut through the noise of disagreement, as wisdom-of-the-crowd effects have proven robust across domains. As the project progresses and forecasting questions resolve, LEAP will evaluate the performance of aggregate and individual forecasts.

Third, expert predictions diverge substantially from the timelines articulated by frontier lab executives, with our median expert anticipating considerably slower progress. LEAP provides an expert view free from the potential distorting effects of financial incentives that may influence public statements from industry leaders. While the historical record will be the ultimate yardstick for these predictions, LEAP helps us understand what a broader swath of experts expect from AI.

Fourth, experts generally forecast faster AI progress than the public across most outcomes, and LEAP will continue to track the evolution of expert and public opinion, especially as the technology begins to be more front-of-mind for the public.

Lastly, we observe consistency in predictions between superforecasters and experts; in the instances where their views diverge, experts tend to predict somewhat faster AI progress. Importantly, we also find substantial consistency across our four categories of experts: computer scientists, economists, industry professionals, and policy professionals. The forthcoming resolution of near-term predictions will reveal whether specialized domain knowledge or general forecasting skill proves more valuable for predicting AI trajectories—a question with significant implications for weighing different sources of expertise in technology policy decisions.

Although we designed LEAP to overcome the major challenges that confront AI forecasting efforts, there are still some clear limitations of this work.

First, it is difficult to generate a sample frame that is representative of any key group of experts, and nonresponse bias is difficult to avoid, potentially biasing results. We construct comprehensive sampling frames of experts to minimize coverage bias; however, it is possible that there are experts who hold views different from those found in our frames. We reweight our data based on frame demographics to reduce nonresponse bias, but our set of target variables might not capture all the variation in opinion. These sources of bias may affect the representativeness of our results.

In addition, while we have taken great care in constructing clear, specific, and resolvable questions, some questions contain inherent ambiguity, and for others, the discontinuation or change of a data source may preclude resolution. It is also possible that, going forward, attrition will affect our ability to measure how views on AI progress and diffusion change over time; indeed, leading research organizations often experience high attrition and low response rates (NORC AmeriSpeak 2024). We will monitor attrition from LEAP to ensure sufficient sample sizes in future waves as well as to understand if attrition may be biasing our results.

Finally, survey participants typically have limited time for surveys. LEAP addresses this through three strategies: providing historical context and background for each question, offering interactive interfaces with baseline data to streamline forecasting, and providing significant compensation to participants for their time. These measures contribute to considerable effort by participants—the median expert took 44 minutes per survey, the median member of the public 29 minutes, and the median superforecaster 90 minutes. But, sharing background and baseline information among all participants reduces their independence and may dilute the wisdom-of-the-crowd effect, creating correlated forecasts or echo chambers. Additionally, it remains possible that some participants will not put effort into reporting their true beliefs on each question, speeding through the survey because of time constraints or disinterest. Unlike previous studies that used fixed payments regardless of accuracy, LEAP employs proper scoring rules that link compensation to forecast quality based on clear resolution criteria, in an effort to reduce this risk.

LEAP will continue to explore important questions regarding the future of AI. Public, high-profile proclamations about the technology are not necessarily representative of expert opinion, and we will search for agreement and disagreement among experts, the general public, and professional forecasters. As LEAP forecasting questions begin to resolve as early as the end of 2025, we will assess how short-run accuracy on AI-related questions correlates with long-run AI-related beliefs as we try to bring clarity to the many current high-stakes debates about AI.

Notes

Forecasters are denoted “superforecasters” if they (1) were in the top 2% of the accuracy distribution in a given year of the Intelligence Advanced Research Projects Activity (IARPA) Aggregative Contingent Estimation (ACE) tournament (IARPA ACE Program n.d.; Mellers et al. 2014) or (2) they were a highly accurate performer on Good Judgment Open, an online continuous geopolitical forecasting tournament. Good Judgment Inc., which runs Good Judgment Open, then adds these top forecasters to the “superforecaster” pool. Most superforecasters come from the first selection criteria. Mellers et al. (2015) finds persistent performance of these superforecasters across several years of geopolitical forecasting. ↩︎
Expert rationales averaged 92 words, with 25% of expert rationales exceeding 100 words and 8% exceeding 200 words. Among superforecasters, 47% of rationales exceeded 100 words and 23% exceeded 200 words. ↩︎
If not otherwise stated, we report values from the 50^th percentile forecasts given by each expert. We elaborate on the use of quantile forecasting in Monthly Surveys and Forecasting Questions. ↩︎
Respondents were shown a historical baseline value of 2%, based on an earlier version of the cited paper. A new version of the working paper estimates a range of 1.6% to 6.6%. We select the midpoint, 4.1%, as the historical baseline value. ↩︎
The median expert expects electricity consumption used for AI to rise to 12% by 2040. ↩︎
See Appendix E.II., 4. Electricity Consumption for information on the baseline estimate of 1.0%. ↩︎
Regarding their private AI investment indicator, Our World in Data (2025) notes: 1. “The data likely underestimates total global AI investment, as it only captures certain types of private equity transactions, excluding other significant channels and categories of AI-related spending;” 2. “The source does not fully disclose its methodology and what’s included or excluded. This means it may not fully capture important areas of AI investment, such as those from publicly traded companies, corporate internal R&D, government funding, public sector initiatives, data center infrastructure, hardware production, semiconductor manufacturing, and expenses for research and talent.” More details on what is likely excluded can be found at Our World in Data (2025). ↩︎
The FrontierMath benchmark consists of math problems that resemble those a math PhD student might spend several days solving. ↩︎
This estimate of 23% reflects the fraction of experts whose median forecast is that AI systems will achieve performance of at least 90% (which we call saturation) on Tiers 1–3 of FrontierMath. We take the average of the proportions calculated under weak and strict inequality. ↩︎
The Millennium Prize Problems are seven mathematical problems identified by the Clay Mathematics Institute in 2000 as the most important unsolved questions in mathematics (Clay Mathematics Institute n.d.). Only one has been solved to date. ↩︎
Respondents were shown a historical baseline value of 2%, based on an earlier version of the cited paper. A new version of the working paper estimates a range of 1.6% to 6.6%. We select the midpoint, 4.1%, as the historical baseline value. ↩︎
A forecaster expects the realized outcome to be below their 25^th percentile forecast in just 25% of cases, compared to 50% of cases for a median forecast. We additionally ask for a 75^th percentile forecast whenever we collect a 25^th percentile forecast. ↩︎
Forecasters are asked what percentage of sales revenue from recently FDA-approved drugs will come from those discovered using AI methods available after 2022. See Appendix E.II. 3. Drug Discovery. ↩︎
Note that for two categorical questions about overall AI scenarios, we report averages instead of medians. Since respondents assign probabilities that sum to 100%, we use average aggregation to maintain this property. ↩︎
To assess the extent to which low-effort or relatively lower comprehension from the public could drive these results, we compare members of our public with high levels of forecasting accuracy in other studies to those with low accuracy. We do not find that one group systematically expects more or less progress. Public Accuracy Stratification details this analysis. ↩︎
While surveys show that fewer than half of Americans report using AI products (NORC 2025), nearly all (99%) actually use AI-enabled tools like navigation apps, streaming services, and social media weekly (Maese 2025); this gap reveals that AI has become ubiquitous in some applications, but is sometimes invisible to users. ↩︎
AI is already accelerating scientific discoveries across a wide range of disciplines such as medicine and materials science (Dai et al. 2025; Kay 2025; Russell et al. 2023; Stanford Medicine News Center 2025; Sundermier 2024). While some of these claims may be overblown or overstated to encourage media attention, AI is certainly affecting many aspects of science and research. ↩︎
Despite public figures predicting extensive job destruction—for example Jamie Dimon speaking at Fortune’s Most Powerful Women summit: “[AI] will eliminate jobs. People should stop sticking their head in the sand” (Gerut 2025)—overall employment effects of AI remain small (Chandar 2025; Gimbel et al. 2025; Eckhardt and Goldschlag 2025) , the impact of rapid technological change on jobs in the short run is ambiguous, and in the longer run is neutral under standard economic models (e.g., Aghion, Jones, and Jones 2018; Agrawal, Gans, and Goldfarb 2019). Some recent evidence suggests that AI may be contributing to significant declines in entry-level hiring for workers in especially AI-exposed occupations (Brynjolfsson, Chandar, and Chen 2025), with those workers experiencing increased wages. This could suggest a negative supply shock, rather than a negative demand shock from substitution to AI. The ambiguity of technological change arises because automating some human processes often augments others (Agrawal, Gans, and Goldfarb 2023). ↩︎
Jason Furman, in an interview with Ross Douthat, estimates that 92% of the increase in demand in the U.S. in the first half of 2025 comes from categories related to AI investment and services (information processing equipment and software). Accounting for equilibrium effects, Furman posits that roughly half of GDP growth is from the AI boom (Douthat 2025). This view is echoed by Karen Dynan, an economics professor at Harvard University, who argues, “in a mechanical sense it’s fair to say that AI has been the main driver of U.S. GDP growth this year” (Curran and Niquette 2025). ↩︎
Demis Hassabis: “I think one day maybe we can cure all disease with the help of AI.” (Hassabis 2025). ↩︎
“There Will Be Very Hard Parts like Whole Classes of Jobs Going Away” (Altman 2025). ↩︎
“[A] dream scenario—perhaps a goal to aim for—would be 20% annual GDP growth rate in the developing world” (Amodei 2024). ↩︎
Forecasters are denoted “superforecasters” if they (1) were in the top 2% of the accuracy distribution in a given year of the IARPA ACE tournament (IARPA ACE Program n.d.; Mellers et al. 2014) or (2) they were a highly accurate performer on Good Judgment Open, an online continuous geopolitical forecasting tournament. Good Judgment Inc., which runs Good Judgment Open, then adds these top forecasters to the “superforecaster” pool. Most superforecasters come from the first selection criteria. Mellers et al. (2015) finds persistent performance of these superforecasters across several years of geopolitical forecasting. ↩︎
“Representative” is a contested, and sometimes fraught concept in social science (Chasalow and Levy 2021). Here, we adopt a narrow definition for our expert sample. We specify an expert sampling frame that we believe closely tracks groups of experts that policymakers are most inclined to consult about AI. We then apply standard survey reweighting methods to combat nonresponse bias, reweighting our respondent sample to match the observable characteristics in our initial sampling frame. For our public sample, we use known characteristics of the U.S. population as our reweighting targets. See Reweighting for more details and a discussion of limitations of this approach. ↩︎
The Monthly Surveys and Forecasting Questions section contain further detail on resolution criteria. ↩︎
As typically specified, the “wisdom of the crowd” phenomenon appeals to independent and unbiased judgments, which lead the aggregate to outperform randomly selected components of the aggregate (Davis-Stober et al. 2014). An extensive literature develops improved aggregation approaches (Baron et al. 2014; Himmelstein, Budescu, and Han 2023; Himmelstein, Budescu, and Ho 2023), but a key theme of that work is that a simple median performs quite well as an aggregation mechanism across contexts. ↩︎
LEAP identifies distinct target populations (experts, superforecasters, and the general public) whose views we wish to study. We carefully construct sampling frames, or the part of the target populations that have a chance of being sampled, using multiple data sources to maximize the coverage of these populations. See Sampling for more information on how sampling frames are constructed. ↩︎
Top citations in AI publications have also been used to target experts (Muller and Bostrom 2014), as well as a broader search of the literature using targeted publication classification codes (O’Donovan et al. 2025). ↩︎
Participation from the general public in these types of surveys relies on well-established panels (McClain et al. 2025), online opinion polling platforms (Zhang and Defoe 2019), or by identifying non-experts with a demonstrated interest in AI (Walsh 2017). ↩︎
These individuals, labeled “superforecasters,” are distinguished by their ability to sustain high-accuracy forecasts and “avoid regression to the mean” across multiple prediction instances (Mellers et al. 2015). This was demonstrated by findings from three consecutive years of geopolitical forecasting tournaments conducted by the Good Judgment Project under the U.S. Intelligence Advanced Research Projects Activity (IARPA). ↩︎
Muller and Bostrom (2014) find experts working in theoretical AI were more likely to respond and more likely to be concerned about the negative effects of AI. Grace et al. (2018) find respondents to have less time in the field and lower citation indexes which in turn was associated with more optimistic views on the timing of human-level machine intelligence (HLMI). O’Donovan et al. (2025) find an association between views on AI safety and governance and respondent views on the inevitability of HLMI, as well as an association between their categorization of respondents as AI optimists and the inevitability of AI. ↩︎
A panel survey is a survey that repeatedly collects data from the same group of respondents over time, and LEAP is the first continuous panel survey of AI experts. There is one other repeated expert sample of AI experts that we have identified in the literature: Zhang et al. (2022) and Stein-Perlman and Grace (2022) do “matched panel” analysis by matching their respondents to earlier responses in Grace et al. (2018). ↩︎
Nevertheless, the creation of unambiguous forecasting questions remains difficult. We dropped one question on the use of AI-use during K-12 instructional hours from this analysis, due to substantial misinterpretations. ↩︎
Some of the cited studies extrapolate forecasts backwards to obtain estimates for earlier dates, but these methods require parametric assumptions and generally do not allow for discontinuities in the time paths of forecasts. ↩︎
Forecasters are denoted “superforecasters” if they (1) were in the top 2% of the accuracy distribution in a given year of the IARPA ACE tournament (IARPA ACE Program n.d.; Mellers et al. 2014) or (2) they were a highly accurate performer on Good Judgment Open, an online continuous geopolitical forecasting tournament. Good Judgment Inc., which runs Good Judgment Open, then adds these top forecasters to the “superforecaster” pool. Most superforecasters come from the first selection criteria. Mellers et al. (2015) finds persistent performance of these superforecasters across several years of geopolitical forecasting. ↩︎
Highly-engaged participants were recruited from previous high effort projects spanning multiple weeks. ↩︎
Pew describes “raking,” also known as iterative proportional fitting, as the most common approach to reweighting public opinion surveys, “For public opinion surveys, the most prevalent method for weighting is iterative proportional fitting, more commonly referred to as raking” (Mercer et al. 2018). ↩︎
Effective Altruism (EA) is a philosophical and social movement focused on directing resources towards improving the world. It is not a monolithic group. Many effective altruists strongly disagree about how best to direct resources towards improving the world, the philosophical framework that determines what an ‘improvement’ means, the risks one should take to improve the world, and the focus one should place on the short- and long-run in pursuit of improving the world. We consider someone EA-affiliated if they or their employer have or have had funding ties to EA, they publicly endorse EA, or they self-identify as EA-affiliated or EA-adjacent. Further, if their work focuses on EA, AI safety, global catastrophic risks, or existential risks, we consider them to be EA-affiliated. These criteria are permissive and favor overinclusion, as we chose to measure an upper bound of EA affiliation. Nevertheless, the proportion of our frame tagged as having EA affiliation remains relatively small, at 14.3%. ↩︎
We define “top AI labs” as the 10 unique labs which provide the 20 most computationally intensive models (in terms of training FLOP) on Epoch AI’s Data on AI Models table (Epoch AI 2024b). ↩︎
As seen in the table below, we split our expert population into four categories of expertise: Computer Science, Economics, Industry and Policy. For the purposes of reweighting, we equally weight these categories. Our initial frame slightly overrepresented Computer Science and Industry professionals and underrepresented Economics and Policy professionals. ↩︎
We use the IPUMS USA combined Census and American Community Survey (ACS) data to derive population targets for all variables except party identification (Ruggles et al. 2025). For party identification, we use data from Pew Research Center (Nadeem 2024). ↩︎
Note that the actual proportion of participants invited from Industry was 36%. For the purposes of reweighting, each respondent category of expertise was weighted equally as discussed above. ↩︎
These respondents are either 200 top-cited according to OpenAlex, or part of our age-stratified CS author list. ↩︎
Research rankings for economics schools and journals, based on publication records. Authors at an institution hosting one of the top-50 economics departments according to RePEc are considered. ↩︎
The percentage decrease in completions from Wave 2 to 3 per category were as follows: Economics: 1.7%, Superforecasters: 3.5%, Public: 3.8%, Computer Science: 4.7%, Industry: 5.3%, Policy: 7.2%. ↩︎
The survey platform records the active time spent on the survey window on respondents’ devices. If they leave the page (even to do additional research relevant to the survey), the timer is paused. Hence, this measure likely serves as a lower bound for time spent on the survey. ↩︎
We provide expert participants with $2,000 per each year of full participation (prorated for the number of surveys they complete, with an expectation that we will complete 12 surveys in a typical year). In other words, experts receive $166.67 per survey completed. We provide superforecasters with $1,000, prorated similarly, or $83.33 per survey completed. We pay public participants in line with CloudResearch platform norms, or $8 per survey completed ($13.71 per hour). ↩︎
Respondents are prompted to give their forecasts in the form of probabilities. ↩︎
Respondents are prompted to give their forecasts in the form of 25^th , 50^th , and 75^th percentiles. ↩︎
The pooled distribution is not necessarily the optimal way to aggregate forecasts in terms of forecasting accuracy. For example, Ranjan and Gneiting (2010) show that combining well-calibrated forecasts in this fashion yields forecasts that are miscalibrated. Similar results are reported by Lichtendahl et al (2013). ↩︎
In future work, we will summarize forecaster-level gaps between 25^th and 75^th percentile forecasts. ↩︎
Pooled distribution: IQR (7.3%–34.6%); variance decomposition: 47% between–forecaster disagreement, 53% within–forecaster uncertainty.
Raw data: IQR on the 50^th percentile was (9%–30%); median 25^th and 75^th percentile forecasts were 9% and 28% respectively.
Uncertainty and disagreement metrics on other claims made in this list can be found in the Monthly Reports at https://leap.forecastingresearch.org/reports/. ↩︎
Respondents were shown a historical baseline value of 2%, based on an earlier version of the cited paper. The most recent draft gives a range of 1.6% to 6.6%. We select the midpoint, 4.1%, as the historical baseline value. ↩︎
Sentences of the form, “The median expert gives an X% chance,” report the median of experts’ X^th percentile forecasts. ↩︎
Regarding their private AI investment indicator, Our World in Data (2025) notes: 1. “The data likely underestimates total global AI investment, as it only captures certain types of private equity transactions, excluding other significant channels and categories of AI-related spending;” 2. “The source does not fully disclose its methodology and what’s included or excluded. This means it may not fully capture important areas of AI investment, such as those from publicly traded companies, corporate internal R&D, government funding, public sector initiatives, data center infrastructure, hardware production, semiconductor manufacturing, and expenses for research and talent.” More details on what is likely excluded can be found at Our World in Data (2025). ↩︎
This estimate of 23% reflects the fraction of experts whose median forecast is that AI systems will achieve performance of at least 90% (which we call saturation) on Tiers 1-3 of FrontierMath. We take the average of the proportions calculated under weak and strict inequality. ↩︎
Raw data: IQR on the 50^th percentile was (10%–30%). ↩︎
See Appendix E.I. Survey Questions: Wave 1 for background information. We ask participants the probability that LEAP panelists will choose “slow progress,” “moderate progress,” or “rapid progress” as best matching the general level of AI progress. ↩︎
See Appendix E.I. Survey Questions: Wave 1 for background information ↩︎
Raw data: IQR on the 50^th percentile was (-4%–5%) ↩︎
We have a complementary survey in the field exploring these topics which we plan to release results from in early 2026. ↩︎
See Appendix E.I. Survey Questions: Wave 1, Question 5, Technological Richter Scale for details. ↩︎
The median forecast for this question was 25%.
Pooled distribution: IQR (8.4%–53.8%); variance decomposition: 52% between–forecaster disagreement, 48% within–forecaster uncertainty.
Raw data: IQR on the 50^th percentile was (10%–50%); median 25^th and 75^th percentile forecasts were 10% and 43% respectively. ↩︎
The median forecast for this question was 60%. ↩︎
See https://leap.forecastingresearch.org/reports/ to access these tables. ↩︎
Raw data: The median forecast for this question was 20%. IQR on the 50^th percentile was (10%–30%). ↩︎
See Appendix E.I. Survey Questions: Wave 1 for background information. We ask participants the probability LEAP panelists will choose “slow progress,” “moderate progress,” or “rapid progress” as best matching the general level of AI progress. ↩︎
Raw data: IQR on the 50^th percentile was (-4%–5%). ↩︎
In a future survey wave, we plan to collect forecasts of the predicted relationship between AI capabilities and employment growth in each sector by asking respondents to forecast employment growth conditional on low-, moderate-, and rapid-progress scenarios. ↩︎
Raw data: IQR on the 50^th percentile was (30%–81%). ↩︎
Raw data: IQR on the 50^th percentile was (10%–50%). ↩︎
The degree to which progress on Millennium Prize Problems is serial or parallel, as well as the general difficulty of the Problems, complicates this comparison. Eliciting forecasts from multiple experts on consistent forecasting questions with clear resolution criteria helps us bring clarity to debates often plagued by ambiguous definitions. ↩︎
Questions include FrontierMath, Autonomous Vehicle Trips, Millennium Prize, Diffusion of AI Across Sciences, Drug Discovery, Electricity Consumption, Cognitive Limitations, AI Investment, Generative AI Use Intensity, Open vs Proprietary Polarity, AI Companions, Barriers to Adoption, General AI progress, and Technological Richter Scale. ↩︎
Forecasting questions with clear valence have an unambiguous directional association with progress. For some questions, like employment by sector, it is unclear whether higher or lower levels of unemployment would be associated with more advanced or less advanced AI progress, so we exclude those questions from this analysis. We also transform some forecasts to establish the progress valence. First, we take the average across all fields in Diffusion of AI Across Sciences. Second, for Cognitive Limitations and Barriers to Adoption, we average across all categories and use the complementary probability. Third, we take the average of closed- and open-weight performance for Open vs Proprietary Polarity. Lastly, we take the values assigned to the “Rapid” scenario and TRS levels 8 and above for the General AI Progress and Technological Richter Scale questions, respectively. ↩︎
We use Mann–Whitney U tests for equality in distribution unless otherwise stated, with a 5% significance threshold. All Mann–Whitney U tests and Cliff’s δ values are currently unweighted. ↩︎
We claim a group predicts statistically significantly less progress according to a Mann–Whitney U test and a negative Cliff’s δ. ↩︎
Cliff’s δ performs pairwise comparisons between all values of two empirical distributions. It takes the number of comparisons where the value from the first distribution exceeds the second and subtracts the number of comparisons where the value from the second distribution exceeds the first and reports this difference as a proportion of the count of comparisons. In other words, it reports the probability that a randomly drawn value from the first distribution exceeds a randomly drawn value from the second distribution, over and above what would be expected by pure random chance if the two distributions were identical. ↩︎
See https://leap.forecastingresearch.org/reports/ to access these tables. ↩︎
Pooled distribution: IQR (6.9%–46.1%); variance decomposition: 47% between–forecaster disagreement, 53% within–forecaster uncertainty. Raw data: IQR on the 50^th percentile was (10%–40%); median 25^th and 75^th percentile forecasts were 8% and 35% respectively. ↩︎
Pooled distribution: IQR (4%–31%); variance decomposition: 66% between–forecaster disagreement, 34% within–forecaster uncertainty.
Raw data: IQR on the 50^th percentile was (5%–29%); median 25^th and 75^th percentile forecasts were 5% and 20% respectively.
Uncertainty and disagreement metrics on other claims made in this list can be found in the Monthly Reports at https://leap.forecastingresearch.org/reports/. ↩︎
“To gauge the difficulty of FrontierMath problems, we organized a competition at MIT involving around 40 exceptional math undergraduates and subject-matter experts. Participants formed eight teams of four or five members, each with internet access, and had four and a half hours to solve 23 problems. On a subset of 23 tier 1-3 problems, the average team scored 19%, while 35% of the problems were solved collectively across all teams.” (Epoch AI 2025). ↩︎
This estimate of 23% reflects the fraction of experts whose median forecast is that AI systems will achieve performance of at least 90% (which we call saturation) on Tiers 1–3 of FrontierMath. We take the average of the proportions calculated under weak and strict inequality. ↩︎
Physics: 32%; Materials Science: 37%; Medicine: 37%. ↩︎
Physics: 27%; Materials Science: 30%; Medicine: 30%. ↩︎
For this comparison, we switch to reporting Cliff’s δ values calculated with the public as the first distribution. ↩︎
The median forecast was 23%. Raw data: IQR on the 50^th percentile was (12%–35%) ↩︎
The median forecast was 20%. Raw data: IQR on the 50^th percentile was (10%–30%) ↩︎
Pooled distribution: IQR (6.9%–46.2%); variance decomposition: 47% between–forecaster disagreement, 53% within–forecaster uncertainty.
Raw data: IQR on the 50^th percentile was (10%–40%); median 25^th and 75^th percentile forecasts were 8% and 35% respectively. ↩︎
Pooled distribution: IQR (3%–25%); variance decomposition: 48% between–forecaster disagreement, 52% within–forecaster uncertainty.
Raw data: IQR on the 50^th percentile was (3%–25%); median 25^th and 75^th percentile forecasts were 4% and 20% respectively.
Uncertainty and disagreement metrics on other claims made in this paragraph can be found in the Monthly Reports at https://leap.forecastingresearch.org/reports/. ↩︎
See https://leap.forecastingresearch.org/reports/ to access these tables. ↩︎
Experts were asked, “What will be the highest percentage accuracy achieved by an AI model on FrontierMath, by Jan 1 of 2026, 2028, and 2031?” ↩︎
Actual progress was marginally slower. The top Tier 1-3 accuracy rate rose from 1.03% in June of 2024 to 29% in August of 2025, where it remained as of the publication of this paper. ↩︎
Experts were asked, “What percentage of U.S. ride-hailing trips will be provided by autonomous vehicles that are classified SAE Level 4 or above in the years 2027 and 2030?” ↩︎
Experts were asked, “What will the percent change in the number of jobs (compared to Jan 1, 2025) in the U.S. be for white-collar, blue-collar, and service occupations, by Jan 1 of 2028 and 2031?” ↩︎
Experts were presented with three scenarios that detailed the development of AI capabilities and asked, “At the end of 2030, what percent of LEAP panelists will choose “slow progress,” “moderate progress,” or “rapid progress” as best matching the general level of AI progress?” ↩︎
Actual rate of improvement likely falls within this range, especially given recent acceleration trends, but there is considerable uncertainty and domain variability. See: https://metr.org/blog/2025-03-19-measuring-ai-ability-to-complete-long-tasks/ for more information. ↩︎
Nate Silver’s book “On the Edge” proposes the technological Richter scale (TRS) which, analogous with earthquake magnitudes, rates the impact of technologies on a roughly logarithmic scale with10 representing that greatest impact (Silver 2024). Experts were asked, “At the end of 2040, what is the probability for AI achieving the following levels of net impact [ranging from 5-10] on human society as compared to the impact of past technological events?” ↩︎
Experts were asked, “What is the probability that AI will solve or substantially assist in solving a Millennium Prize Problem in mathematics by Dec 31 of 2027, 2030, and 2040?” ↩︎
Experts were asked, “What percent of publications in the fields of Physics, Materials Science, and Medicine in 2030 will be ‘AI-engaged’ as measured in a replication of this study?” (Duede et al. 2024) ↩︎
Experts were asked, “What percent of sales of recently approved U.S. drugs will be from AI-discovered drugs and products derived from AI-discovered drugs in the years 2027, 2030 and 2040?” ↩︎
This was true at the time this expert completed the survey. ↩︎
Experts were asked, “What percent of U.S. electricity consumption will be used for training and deploying AI systems in the years 2027, 2030 and 2040?” ↩︎
Experts were asked, “What will be the global private investment (in billion USD) in AI in the years 2027 and 2030?” ↩︎
Experts were asked, “What proportion of U.S. adults will self-report using AI for companionship at least once daily by Dec 31 of 2027, 2030, and 2040?” ↩︎
This claim may refer to the ~23% of U.S. adults who, according to a 2024 KFF (formerly Kaiser Family Foundation) study, “say they received mental health counseling and/or prescription medication for mental health concerns in the last year.” See Panchal and Lo (2024). ↩︎
Link provided to participants: https://www.claymath.org/millennium-problems/ ↩︎
Raw data: IQR on the 50^th percentile was (3.0%–20.0%). 90^th percentile of median forecast: 44.5. ↩︎
Raw data: IQR on the 50^th percentile was (10.0%–50.0%). 90^th percentile of median forecast: 65.4. ↩︎
Raw data: IQR on the 50^th percentile was (30.3%–80.8%). 90^th percentile of median forecast: 95.0. ↩︎
The expert is referring to and quoting from Ansede (2025). ↩︎
The expert is referring to MacKenzie (1999). ↩︎
The expert appears to be referring to an August 2025 post by OpenAI researcher Sebastien Bubeck (Bubeck 2025). ↩︎
Recall from Public Sampling that our public sample consists largely of highly engaged, past participants from FRI work. ↩︎
Examples of questions include: (1) What will be the National Average Temperature Rank for May 2025 in the contiguous United States, according to NOAA’s Climate Data Center, where 1 is the coolest and 130 is the highest rank representing the warmest on record?; and (2) What will be the closing stock price of Meta on 30 May 2025? ↩︎
Survey: https://airtable.com/appGCchUyUTPvT90e/pagPpnUX2SiiTUNpp/form. Project Team Contact: leap@forecastingresearch.org. ↩︎
See https://leap.forecastingresearch.org/forecasts. ↩︎