Abstract
We test a new process for generating high-value forecasting questions: asking experts to produce “conditional trees,” simplified Bayesian networks of quantifiably informative forecasting questions. We test this technique in the context of the current debate about risks from AI. We conduct structured interviews with 21 AI domain experts and 3 highly skilled generalist forecasters (“superforecasters”) to generate 75 forecasting questions that would cause participants to significantly update their views about AI risk. We elicit the “Value of Information” (VOI) each question provides for a far-future outcome—whether AI will cause human extinction by 2100—by collecting conditional forecasts from superforecasters (n=8).1 In a comparison with the highest-engagement AI questions on two forecasting platforms, the average conditional trees-generated question resolving in 2030 was nine times more informative than the comparison AI-related platform questions (p = .025). This report provides initial evidence that structured interviews of experts focused on generating informative cruxes can produce higher-VOI questions than status quo methods.
Acknowledgments
This research would not have been possible without the generous support of Open Philanthropy. We thank the research participants for their invaluable contributions. We greatly appreciate the assistance of Page Hedley, Kayla Gamin, Leonard Barrett, Coralie Consigny, Adam Kuzee, Arunim Agrawal, Bridget Williams, and Taylor Smith in compiling this report. Additionally, we thank Benjamin Tereick, Javier Prieto, Dan Schwarz, and Deger Turan for their insightful comments and research suggestions.
Executive summary
Introduction
From May 2022 to October 2023, the Forecasting Research Institute (FRI) (a)2 experimented with a new method of question generation (“conditional trees”). While the questions elicited in this case study focus on potential risks from advanced AI, the processes we present can be used to generate valuable questions across fields where forecasting can help decision-makers navigate complex, long-term uncertainties.
Methods
Researchers interviewed 24 participants, including 21 AI and existential risk experts and three highly skilled generalist forecasters (“superforecasters”). We first asked participants to provide their personal forecast of the probability of AI-related extinction by 2100 (the “ultimate question” for this exercise).3 We then asked participants to identify plausible4 indicator events that would significantly shift their estimates of the probability of the ultimate question.
Following the interviews, we converted these indicators into 75 objectively resolvable forecasting questions. We asked superforecasters (n=8) to provide forecasts on each of these 75 questions (the “AICT” questions), and forecasts on how their beliefs about AI risk would update if each of these questions resolved positively or negatively. We quantitatively ranked the resulting indicators by Value of Information (VOI), a measure of how much each indicator caused superforecasters to update their beliefs about long-run AI risk.
To evaluate the informativeness of the conditional trees method relative to widely discussed indicators, we assess a subset of these questions using a standardized version of VOI, comparing them to popular AI questions on existing forecasting platforms (the “status quo” questions). The status quo questions were selected from two popular forecasting platforms by identifying the highest-engagement AI questions (by number of unique forecasters). We present the results of this comparison in order to provide a case study of a beginning-to-end process for producing quantitatively informative indicators about complex topics. (More on methods)
Results
The conditional trees method can generate forecasting questions that are more informative than existing questions on popular forecasting platforms5
Our report presents initial evidence that structured interviews of experts produce more informative questions about AI risk than the highest-engagement questions (as measured by unique users) on existing forecasting platforms.
Using predictions made by superforecasters (n=8), we compared the status quo questions to a subset of the AICT questions.6 Most of the AICT questions (nine of 13) scored higher on VOI than all 10 status quo questions.7
VOI is based on each respondent’s expected update in their belief about the ultimate question, not on how much a participant would update if an event happened. That is, it takes into account how likely the forecaster believes an event is to occur. If an event would result in a large update to a participant’s forecast, but is deemed vanishingly unlikely to occur, it would have a small VOI. If an event would result in a large update, and is also considered likely to occur, it would have a high VOI.
Table E.1 compares the top five AICT questions to the top five status quo questions, as measured by superforecasters’ ratings of a standardized metric of informativeness, which we call “Percentage of Maximum Value of Information” (POM VOI).8 In this table and throughout the report, we refer to questions by their reference numbers. For a full list of the AICT questions and status quo questions selected from forecasting platforms by reference number, with operationalizations and additional information, see Appendix 1.
| Question | Mean POM VOI |
| AI causes large-scale deaths, ineffectual response (CX50) | 6.34% |
| Administrative disempowerment warning shot (CX30) | 3.55% |
| Deep learning revenue (VL30) | 1.68% |
| Power-seeking behavior warning shot (ZA50) | 1.59% |
| Extinction-level pathogens feasible (CQ30) | 1.37% |
| Superalignment success (STQ205 / STQ215)* | 0.28% |
| Kurzweil/Kapor Turing Test longbet (STQ9)* | 0.27% |
| Brain emulation (STQ196)* | 0.23% |
| Human-machine intelligence parity (STQ247)* | 0.14% |
| Compute restrictions (STQ236)* | 0.13% |
Focusing on questions resolving in the near-term (by 2030), we found that questions generated with the conditional trees method were, on average, nine times more informative than popular questions from platforms (p = .025). While we did not find a statistically significant result for questions resolving in 2050-2070, in our sample AICT questions were still eleven times more informative on average. (More on VOI comparison)
Questions generated through the conditional trees method emphasized different topics than those on forecasting platforms
We also analyzed the extent to which questions taken from existing forecasting platforms effectively captured the topics raised in our expert interviews. We found that some topics (such as AI alignment-related questions and questions related to concrete AI harms) were of substantial interest to experts but had not received proportional attention on existing forecasting platforms, and that questions generated by the conditional trees method were meaningfully different from those taken from existing forecasting platforms.
The table below compares the topical distribution of the AICT questions to the status quo questions. (More on question uniqueness)
| Category | AICT question set | Status quo question set |
| Social / Political / Economic | 24% (29) | 33% (131) |
| Alignment | 20% (25) | 12% (47) |
| AI harms | 20% (25) | 7% (27) |
| Acceleration | 36% (44) | 48% (191) |
We found weak evidence that superforecasters and experts value different types of questions
Given the small sample sizes involved, we are reluctant to make confident claims about the significance of the difference between the opinions of the superforecasters and the experts. However, we do see these results as providing prima facie evidence about which questions are the most informative for each group when making updates on the probability of AI-related extinction.
Our most notable finding when comparing the views of the superforecasters to those of the experts was that the superforecasters tended to value questions that focused on concrete harms caused by AI, rather than the experts’ preference for questions regarding advanced AI capabilities or whether AI had been successfully aligned. (More on AI risk takeaways)
Figure E.1 shows examples of how experts updated on the ultimate question conditional on three of the highest-VOI indicator questions.

The table below provides more detail on each of the questions in the previous figure.
| Question | Summary |
| Administrative disempowerment warning shot (CX30) | This question resolves positively if any of the following occur by 2030: AI systems in a country freeze assets affecting ≥1% of the population or ≥0.5% of GDP for over a year; prevent ≥5% of voting-age citizens from participating in a national election; restrict >3% of residents’ movement for 3+ months; limit >10% of the population’s internet access for 3+ months; or legally restrict >5% of citizens’ access to certain housing or employment for 1+ year. |
| Power-seeking behavior warning shot (ZA50) | This question resolves positively if, by 2050, AI developers must attempt to disable or destroy an AI (costing >$5M) after it exhibits power-seeking behaviors such as acquiring significant resources, controlling financial accounts, manipulating humans, infiltrating key systems, seizing control of weaponry, self-propagation, or engaging in hacking. |
| No aligned AGI (CX70) | This question resolves positively if there is no AGI system which meets several “alignment” criteria by 2070. Those criteria include AGI-like capabilities (high question answering skill on hard questions in diverse subjects, high programming ability) and alignment (performing well on SOTA alignment tests, and expert surveys indicating that experts believe that AI is aligned). |
The conditional trees method still has disadvantages
While this case study suggests that the conditional trees method can generate informative forecasting questions, a primary limitation of the method as implemented is its high labor cost. The process involved conducting more than 20 interviews with subject matter experts, writing 75 forecasting questions, and eliciting conditional forecasts. In future work, we expect it would typically be more efficient to elicit fewer indicators within a conditional tree and to operationalize only 1-2 forecasting questions per interview before eliciting forecasts. The intensive process described in this case study would be most appropriate for particularly high-value topics with large pools of resources for research. Additionally, it may be possible to use LLMs or incentivized crowdsourcing for the question generation or filtering stages, making the process cheaper and less labor intensive. (More on limitations of our research)
Key takeaways
- Preliminary evidence suggests that the conditional trees method of generating forecasting questions can result in questions that perform better on “Value of Information” metrics than popular questions on existing forecasting platforms.
- The conditional trees method produced questions with a markedly different distribution of topic areas compared to those on existing forecasting platforms. Notably, the conditional trees approach led to a greater proportion of questions focused on AI alignment and potential AI harms, reflecting that certain expert priorities may be underrepresented in existing forecasting efforts.
- In our limited sample, experts tended to find questions related to alignment and concrete harms caused by AI to be the most informative. Superforecasters also found questions relating to concrete AI harms to be informative, but were less likely than experts to find questions relating to alignment to be informative.
- The conditional trees method as implemented in this case study is particularly labor intensive. We expect the most broadly useful versions of this process would take the underlying principles and 1) apply them to shorter interviews with smaller numbers of forecasting questions to operationalize, 2) leverage LLMs for elicitation and synthesis, and/or 3) utilize crowdsourcing at the question generation and filtering steps.
Key outputs
In addition to the above takeaways, we highlight key outputs from the report: the tangible resources developed during the course of the conditional trees process which we believe may be useful to others interested in replicating parts of the process.
- We created a guide and replicable process for using conditional tree interviews to generate informative forecasting questions (see Appendix 6). This process can be implemented by organizations and individuals that need high-quality, informative questions.
- We provide details of relevant metrics (e.g., “Value of Information”) that can be used to assess how informative each generated question is. See our public calculator for “value of information” and “value of discrimination” here.
- In total, the conditional trees process generated 75 new questions relating to AI risk. The full operationalizations and resolution criteria of these questions are available in Appendix 1 of this report. We have posted several of the highest-VOI questions to two forecasting platforms and encourage interested readers to submit their own predictions. (See Appendix 7 for links)
- We used our question metrics to create aggregated conditional trees that visually summarize the most important AI risk pathways according to small samples of experts and generalist forecasters. These aggregated trees can be found here.
Limitations of our research
Limitations of our research include:
- The total number of participants in this study was small (n=8 forecasts on most questions, 24 interviewees to generate questions).
- The forecasting tasks in this study were unusually difficult, involving low probability judgments, long time horizons, conditional forecasts, and “short-fuse forecasts” made very quickly.
- Participants were all either experts who are highly concerned about existential risks from AI or superforecasters who are relatively skeptical, so we are not able to separate differences caused by risk assessment from differences caused by forecasting aptitude, professional training, or other factors.
(More on limitations of our research)
Next steps
Further research related to this topic could include:
- Studies on the same questions with larger numbers of forecasters, including by integrating the questions into existing forecasting platforms.
- Replicating the conditional trees process in domains other than AI risk.
- Following up as questions begin to resolve in 2030 to assess whether forecasters update their views in accordance with their expectations.
Glossary
AI Conditional Trees (AICT) question set
The set of questions generated by the AI Conditional Trees process described in this report.
Conditional tree
A simplified Bayesian network, in which each node is an event that may or may not occur, and each connection between nodes has the factor by which the next node is more or less likely if that one happens. In this report, the conditional trees ultimately ask how likely it is that AI causes human extinction by 2100, and each node is an event that affects the likelihood of that ultimate outcome.
Operationalization
The process of making a question about a future event into a resolvable forecasting question. For example, if a prompt said “there is major progress in interpretability by 2030” the operationalized question would contain a specific way to resolve that question so that there can be no future dispute about whether the progress counts as “major.”
Percent of Max (POM)
When we present VOI for a question, we also present the percentage of the maximum VOI (POM VOI) it captured in order to contextualize the magnitude of the results. The POM VOI of a question can be interpreted as the fraction of the uncertainty about the ultimate question U the question resolves, in expectation.
Question prompts
General topics of questions that we then operationalized into forecasting questions. For example, “major progress in interpretability by 2030” could be a question prompt, although it is not a clearly resolvable forecasting question.
Short-fuse forecasts
Very quickly estimated forecasts, in which each participant spent no more than one minute per question and gave a snap judgment.
Status quo questions
Questions on AI that we selected from existing forecasting platforms on the basis of their popularity (largest number of unique users) and other criteria. See 2.3 Selection of status quo questions.
Ultimate question / Ultimate outcome (U)
The “ultimate question” that all of the intermediate questions help predict. In this study: “Will AI cause human extinction by 2100?”
Value of information (VOI)
VOI is a measure of how much knowing the answer to a question would change an individual’s belief, in expectation. This is useful for understanding why individuals believe what they believe and what would change their minds.
1. Introduction
For policymakers to use forecasting in their work, they need accurate forecasts, but—perhaps equally important—the forecasts need to be about decision-relevant questions. Knowing which questions will be the most valuable to forecast on can be difficult. How can policymakers identify the short-term events that are most relevant to important long-term outcomes?
Here we present a tool, the conditional tree method (figure 1.1.1), which can distill complex issues into a few key uncertainties. We apply it to a topic of increasing public concern: “Will advanced artificial intelligence pose an existential threat to humanity in the 21st century?” Using a specialized interview process, we learn what subject matter experts believe are the best warning signs for this risk in the coming decades. Then we use metrics based on conditional forecasting to quantitatively measure the relevance of these warning signs. This allows us to winnow down to a few highly relevant indicators of increased risk to humanity from AI.

The conditional trees approach10 represents a new set of priorities in the field of forecasting. Most previous forecasting research focused almost exclusively on identifying accurate forecasters and improving forecasting accuracy. But comparatively little work was invested in choosing forecasting targets. In order to mature into a practically applicable body of knowledge, the field must look beyond optimizing forecasts and toward optimizing the questions we ask.
1.1 A method for generating and judging high-value questions
Some forecasting tournaments and platforms have already begun to utilize domain experts to generate questions with real-world relevance. However, many of these efforts are relatively ad hoc, producing inconsistent results and plausibly missing many high-value forecasting targets.
For example, for the Existential Risk Persuasion Tournament (XPT),11 the question preparation phase enlisted domain experts to comment on the prospective question set in a relatively unstructured way. While this undoubtedly improved the question set, it did not identify the most informative questions within the set.
To leverage the expertise of domain experts more fully, we propose a more in-depth, systematic approach: expert elicitation structured around conditional trees.
Why conditional trees?
Conditional trees represent beliefs through a tree-like structure, using nodes to represent events that influence the probability of an ultimate outcome. In the tree in Figure 1.1.2, for example, if you know someone is vaccinated, they are half as likely to be infected than if you were unsure whether they were vaccinated. Then, if you know they have been exposed, they are 3.5x as likely to be infected.12

In this study, the ultimate outcome was the probability of extinction due to AI by 2100, and the nodes are events that make that outcome more or less likely. The tree structure makes the conditional probabilities beneath a forecast explicit and visible, and may help forecasters narrow in on specific, important factors.
Participants initially provided an estimate of the probability of AI-related extinction by 2100 (the “ultimate question”), represented by O in Figure 1.1.2. Interviews then focused on identifying key indicators on the pathway to AI-related extinction. Participants selected two to five indicators for deeper analysis to understand how they might alter the risk of AI-related extinction. These factors then became the antecedents in the tree: for each of the indicators selected to be included in the tree, participants gave forecasts for how much their forecast of the ultimate outcome would change if that event happened.
The ultimate outcome (for our purposes, the probability of extinction due to AI by 2100) is an important parameter: the rest of the network’s relevance cascades from the outcome. But provided we’re able to identify an outcome with strong bearing on present policy decisions, we can ask experts to decompose the intervening time into possible events which would reflect a greater or lesser likelihood of reaching that outcome. Thus, these intervening events must themselves possess policy-relevance, in proportion to the strength of their relationship with the outcome, and the likelihood of observing them.
Conditional trees are a type of Bayesian network (BN).13 BNs explicitly represent probabilistic relationships between outcomes and their antecedents.14 This structure encourages experts to generate maximally relevant antecedents, and also provides us with a framework for measuring question relevance. But unlike some other forms of BNs, conditional trees are a relatively easy tool to learn. In our study, interviewees were able to grasp the necessary basics in around 10 minutes. This means that conditional trees may be more practical for interviews with subject-matter experts, who may not be experts in statistics or other domains that more often use BNs.
How does the conditional trees method fit into the forecasting research process?

The AI Conditional Trees project is an in-depth investigation into how to generate informative forecasting questions. Question generation is the first step in the life cycle of an impactful forecasting project, illustrated in Figure 1.1.3.
Many earlier forecasting research projects have focused on identifying the most accurate forecasters and on improved methods for aggregating their forecasts. But to be useful to decision makers, forecasting research must move beyond those questions and incorporate forecasting into a process that includes question generation, considering actions based on forecasts, communicating with policymakers, and generating new questions.
Before the cycle starts, we begin with “scoping and gisting,” in which we consider the questions we want to answer, the scope of the possible project, and the general arguments (“gists”) on each side. We then begin the cycle by generating questions, through processes like the AI Conditional Trees method, aiming to find the forecasting questions that would be most informative to decision makers. Next, we elicit forecasts on those questions, to assess risk and understand which potentially dangerous events are most likely and in what circumstances. We then elicit “risk mitigation forecasts,” asking experts and skilled forecasters to predict which policies would most decrease risk and what the costs might be for implementing them.
Once we have completed these stages, we communicate that information to policymakers, and ask them whether it is useful and what would make it more relevant to their work. Their feedback gives us more information we can use for the next stage of question generation, and we begin the cycle again.
The cycle as depicted is somewhat stylized, and many forecasting projects will not include all of these stages. But thinking of AI conditional trees in the context of the “forecasting life cycle” helps us contextualize this work and think about how to incorporate it into our future research.
Measuring question value
In order to form the feedback loop necessary for a dramatic improvement in the decision-relevance of forecasting questions, we need a means of quantitatively measuring the value of a forecasting question.
Policymakers’ actions are often guided by a few important questions in their domain, like “What will be the effects of climate change over the next century?” or “Will our economy remain competitive in the world in the long-term?” Such questions are difficult to resolve because they refer to the distant future, and they may also be relatively complex or difficult to specify clearly. But often one can find nearer-term antecedent questions which are easier to resolve, and which would reduce some uncertainty about the “ultimate” question. For example, in a study forecasting the effects of climate change, with the ultimate question, “Will more than 2 billion people die or be displaced due to climate change by 2100?,” the question “What will the average global temperature be in 2040?” might be a good antecedent question. It would not give a forecaster the full answer to the main question, but knowing what the global surface temperature will be in 2040 would be at least somewhat helpful for forecasting the effects of climate change by 2100.
Thus, one way of conceptualizing the value of a forecasting question is to ask, “How would the answer to this question affect our expectation about an ‘ultimate’ question we care about?” There are several distinct ways of expressing this mathematically, which we collectively refer to as “Value of Information (VOI).”
Conceptually, VOI measures how important a potential crux question (“C”) is to a participant’s forecast of the ultimate question we care about (“U”, in this case: AI extinction risk by 2100), in expectation. That is, how much would a participant update on AI extinction risk by 2100 based on whether a crux happens, weighted by how likely that crux is to happen. A high VOI question for a given participant will therefore be one that a) that participant thinks has a meaningful chance of happening and b) meaningfully affects that participant’s forecast on the ultimate question.
VOI is a useful metric for understanding why individuals believe what they believe and what would change their minds. A technical explanation of VOI can be found in Appendix 4. To build intuition for using the VOI metric, we provide this calculator (a) in which users can input their own values. We also provide a more comprehensive R software package for calculating it.
2. Methods
2.1 Question generation
Sampling interviewees
Our sample included 24 interviewees in total: 21 “expert” interviewees, and 3 “superforecaster” interviewees. We aimed to include in our sample representatives of four quadrants of a strategically important belief space (see Figure 2.1.1):
- short timeline for AI progress, high estimated risk from AI;
- short timeline for AI progress, low estimated risk from AI;
- long timeline for AI progress, low estimated risk from AI; and
- long timeline for AI progress, high estimated risk from AI.15

We gathered our expert sample via snowball sampling, seeded from recommendations from our funders and our networks. We do not expect our interview sample was particularly representative of any given group, such as AI experts. The goal of this project was to develop the trees process and assess whether it led to higher value questions, which did not require a representative expert sample. Our superforecaster sample was taken from the set of superforecaster participants in the Existential Risk Persuasion Tournament (XPT)16 who had shown particularly high engagement. Candidate interviewees were approached for interview with a monetary incentive for producing the “highest value” questions in our interview-derived question set.

The majority of our expert sample had academic or professional experience pertaining directly to AI risk, such as experience in technical AI safety or AI governance (13/21 expert interviewees). Others were included for having publicly expressed views on AI risk indicating a high level of engagement with the topic and having expertise in a complementary field, such as machine learning (7/21 expert interviewees). Finally, a small number of our expert sample had expertise in a complementary field, but had not expressed detailed views on AI risk in public (2/21 expert interviewees). Most of our expert sample held senior positions within their fields, as professors, directors of organizations, leaders of research teams, or similar (13/21 expert interviewees).
Our expert sample skewed toward the top left quadrant in figure 2.1.1, “high risk/short timelines.” Of 21 expert participants, 13 estimated the risk of extinction from AI by 2100 to be >10%. Only one of our expert sample estimated the risk to be <1% by 2100, whereas the median expert in the XPT predicted 3%. Although we did not solicit AI progress timelines directly from interviewees, interview content generally suggested a positive relationship between beliefs in increased risk and shorter timelines in our sample.
Because of this skew in our expert sample, we chose to ensure some representation of the bottom two quadrants in figure 2.1.1 (low risk from AI) by selecting three superforecaster interviewees who forecast <10% probability of extinction from AI by 2100.
Interview process
Interviews were 1-on-1, ran for roughly 60 minutes and followed a semi-structured format. By default, interviews aimed to trace one plausible path of increasingly strong signals of heightened AI risk at three successive timepoints before 2100.18 Interviewers19 were allowed some latitude for individual approaches, but generally followed this basic structure:20
- Introduction, task instructions
- Elicitation of P(AI-related extinction by 2100)
- Node generation
- Wrap-up questions

Interviewees were first given a very brief summary of the aims of the project, a short explanation of conditional trees, and a statement of the goals of the interview. Interviewees were also told that they would be awarded $1,000 if a forecasting question derived from their interview was one of the “highest value” forecasting questions generated by the project.21 This introductory section of the interview typically took 10 minutes or less.
Next, interviewees were asked to give their best guess probability for the project’s “ultimate question,” namely “AI-related extinction by 2100,” which was operationalized as in the 2022 XPT.22 Following the probability elicitation, we sometimes asked participants warm-up questions, for instance asking them to name possible “driving forces” influencing their views.
Interviewees would then begin the node generating phase of the interview, which comprised the majority of interview time. Although we began the project with a set of three predefined years to ask participants about (2030, 2050 and 2070),23 it soon became clear that this was not the best choice of years for participants with short AI progress timelines. Therefore, we began in the node generating phase to ask participants to propose a suitable set of years for their own trees (see Appendix 3 for the distribution of years chosen).
For each node, we took interviewees through a process of brainstorming, selection, and fleshing out. We would then elicit a probability of AI extinction by 2100 conditional on the node. We will refer to these pre-operationalization nodes as question prompts.
Interviewers took detailed notes, and most interviews were recorded (with participants’ permission). Further details on interview technique can be found in Appendix 6.
Operationalizing question prompts as forecasting questions
Question prompts were generally not fully resolvable forecasting questions, though some were operationalized in more detail than others. We considered it an inefficient use of interview time to focus on constructing forecasting questions with detailed resolution criteria, and also not the comparative advantage of expert interviewees generally. Instead, an internal question-writing team24 turned question prompts into fully operationalized forecasting questions, with the help of notes from the interview and feedback from the interviewer.
The primary goals of question writing in this project were:
- To capture as much of a question prompt’s original intent as possible, while still making questions highly resolvable.
- To optimize the value of information from the question by adjusting thresholds or removing elements which made the probability of a positive or negative resolution too extreme.
We developed a template for the question-writing process, which encouraged question writers to first consider multiple distinct ways the interview node could be operationalized. They then analyzed these options with respect to several important criteria:
- How much the question captured the most relevant aspects of the original interview node;
- How efficiently the question captured relevant aspects of the original interview node;
- Salient hypothetical cases of false positive resolution and false negative resolution;
- How clear cut or practically feasible resolution of the question would be;
- Amount of cognitive load for forecasters.
The question writer and reviewer would then jointly decide which formulations to include in the final question on the basis of these criteria. Finally, a more detailed set of resolution conditions would be written and incorporated into a “conditional tree summary document”, which could then be sent to the interviewee for feedback.
2.2 Judging questions and constructing aggregate trees
The question generation phase yielded 75 questions, some of which were very similar to one another, so our next task was to filter them and select the most useful questions to construct conditional trees. We began by eliciting “short-fuse” forecasts on each question, in which forecasters spent about one minute per question giving quick judgments that allowed us to estimate a rough VOI for each question. For the thirteen questions that passed this initial screen, we conducted a longer survey, asking participants to spend more time forecasting how likely each question is to resolve positively and how much difference it would make to their ultimate forecast of the likelihood of extinction due to AI by 2100.25
Because participants in this study were all either (i) superforecasters who forecasted less than 1% likelihood of extinction due to AI by 2100 or (ii) people with professional AI risk-related experience who forecasted more than 1% likelihood of extinction due to AI by 2100 (with one exception, they forecasted at least 5%), we targeted these two socio-ideological camps separately in our question rating. We denote these groups, respectively, as “skeptical superforecasters” and “concerned experts.”
First pass filtering of the question set
Our full set of operationalized nodes included 75 questions, many of which were relatively overlapping. It would have been inefficient and excessively cognitively taxing to participants if we had attempted to elicit full 20-minute VOI judgments on each of the 75 questions. Therefore, we performed a first-pass filter on the question set using “short-fuse” forecasts.
We elicited VOI judgments in a “short-fuse” format from 8 skeptical superforecasters. This required very quick judgments, approximately 1 minute per question.26 Separately, we also collected question data from a set of 5 “concerned expert proxies,”27 asking them to rank order the question set and provide VOI judgments for a subset.28 However, this method may have been substantially flawed, as actual experts did not ultimately think the questions selected by the proxies were more informative than other questions.
For superforecaster data, we ranked questions according to median VOI in the filtering round.29 The filtered question set included thirteen questions including seven questions for the first tier (dates up to 2030) and six questions for the second tier (2031-2070).30

*Denotes stages which only superforecasters participated in.
Main question-rating survey
After the initial filtering, we further refined our question set using surveys, in which skeptical superforecasters and concerned experts were asked for more detailed forecasts on the filtered question set. We offered a fixed sum as an incentive for survey completion. Superforecasters answered a longer survey containing all thirteen questions. Because of experts’ time constraints, each expert answered a shorter survey containing a random subset of the questions.
The main survey superforecaster sample (n=8) was the same as the filtering survey sample. At this point, the sample had also participated in a lengthy adversarial collaboration with a camp of AI-risk concerned experts.31 Thus they had spent significant time developing their own beliefs on the topic and engaging with opposing beliefs.
The expert sample (n=11) was drawn from the candidate participant list from the AI adversarial collaboration.32
Superforecaster survey
In the superforecaster survey, we presented all 13 questions of the filtered question set in Qualtrics, shown in two parts, first 2030 questions and then 2050-2070 questions. Within each part we randomized question order. Participants were instructed to spend approximately 20 minutes per question, to give their own beliefs, and separately to estimate the beliefs of the concerned expert group.
We first asked for (1) each participant’s own forecast of the probability of AI-related extinction by 2100 and (2) each participants’ forecast of what experts would forecast about the probability of AI-related extinction by 2100.33
We then asked participants for forecasts on each of the 13 questions from the filtered question set. Each forecasting question contained moderately detailed resolution criteria, as well as links to reference information where possible. In the survey, answers were checked for logical coherence, and respondents were prompted to revise if necessary.34 At the end of each part, we gave participants the opportunity to review all questions and answers from that section and revise if they wished.35
A supplementary survey using the same protocol as above with questions drawn from the “status quo” question set (questions from forecasting platforms (see Appendix 3.2) was administered at a later date. This survey also included two further questions from the AI conditional tree set which had initially been eliminated in the filtering stage.36
Expert survey
Experts were given the choice of a long or short version of the survey, including 6 and 3 questions, respectively. Each respondent saw a random subset of the 13 filtered questions. Experts were asked only to provide their own beliefs, without forecasting superforecasters’ beliefs. Apart from these changes, the survey was identical to the superforecaster survey.
Question combinations survey
Because individual question ratings are not sufficient to build a full conditional tree with multiple intermediate nodes, we followed up the main question-rating survey with a survey eliciting judgments for every combination of four top-scoring questions from the main question-rating survey. As this is a relatively sophisticated and labor-intensive task, we administered it only to our skeptical superforecaster sample.
This elicitation was conducted in a Google Sheets form, and included top-scoring questions (either by POM VOI or z-score37) as previously rated by this sample: CX30, CQ30, CX50, and ZA50. VOI judgments were elicited for each of the sixteen combinations of “yes” and “no” resolutions for each of the four questions (i.e., all resolve positively; CX30 resolves positively and the rest negatively; CQ30 resolves positively and the rest negatively; …; all resolve negatively).
See Appendix 5 for further survey details. The image below presents the elicitation format.

2.3 Selection of status quo questions
For comparison, we selected a set of pre-existing AI forecasting questions from popular forecasting platforms. Questions were restricted to those with dichotomous resolution which did not directly ask about AI causing human extinction. We selected questions with the largest number of unique users engaging with them, rather than by forecast or trading volume, which is more vulnerable to individual differences in updating frequency. We also restricted the number of questions written by known public figures (e.g., Scott Alexander, Eliezer Yudkowsky), as their outsized performance relative to other questions seemed primarily due to their personal following. For a later analysis regarding the distribution of question topics (see section 4.2 Distribution of question topics), we tagged these questions as “acceleration,” “alignment,” or “social/political/economic” using our judgment of their subject matter.
From Manifold Markets we selected three unique questions:
- STQ47 (2030 set) – Largest total number of traders (1023), tagged “acceleration”
- STQ149 (2030 set) – Largest number of traders for a non-public figure question (355), tagged “acceleration”
- STQ19 (2030 set) – Largest number of traders for a non-public figure question, tagged “social / political / economic”
From Metaculus we selected four unique questions:
- STQ196 (2050-2070 set) – Largest number of forecasters after those included in the main survey (424), tagged “acceleration”
- STQ152 (2030 set) – Next largest number of forecasters (325), tagged “acceleration”
- STQ232 (2050-2070 set) – Next largest number of forecasters for 2050-2070 set (263), tagged “acceleration”
- STQ236 (2050-2070 set): Large number of forecasters for a 2050-2070 question, tagged “social / political / economic”
We selected two questions found on both platforms:
- STQ9 (2030 set): Large number of forecasters/traders, tagged “acceleration”
- STQ215 / STQ205 (2030 set): Large number of forecasters/traders, tagged “alignment”
3. Value of Information (VOI) Results
In this section we present the results of a quantitative analysis of question quality for our expert-derived “AI Conditional Tree (AICT)” question set. We rate these questions using metrics which factor in conditional and unconditional forecasts from surveyed populations, and reflect the strength of the relationship between the question and a possible future outcome (here, “AI-related extinction by 2100.”) We selected two groups for the survey—subject matter experts and superforecasters—analyzed separately.
These VOI results are presented in the spirit of a demonstration of methods, and we would caution readers not to place undue weight on the question ratings. Given the very limited number of survey participants,38 the views captured here are unlikely to be representative of those of subject matter experts or skilled forecasters more generally. Furthermore, at the time of this report, eliciting conditional probabilities is a relatively new practice, and with many wrinkles still to be ironed out. Nevertheless, there are a few observations worth highlighting.
Among questions resolving in 2030, both groups rated “Administrative disempowerment warning shot” (CX30) as leading to relatively large updates on the probability of AI-caused extinction by 2100 in expectation: it ranked first with both groups for our main metric, POM VOI, and was relatively robust in sensitivity analysis. The question refers to a scenario in which “AI systems are preventing a substantial number of people from accessing financial assets, voting, moving freely, accessing the internet, or accessing certain types of housing or employment.”
As with many of the questions in our sample which performed well, CX30 benefited from one strongly positive opinion. Views about question value, even within the groups, were highly heterogeneous, and for all questions there was at least one respondent who took little or no information from it.
In the rest of this section, we:
- Provide a summary of the methods, metrics, and terminology used in this analysis and explain how to read a conditional tree (More)
- Summarize the question informativeness ratings for superforecasters and subject matter experts (More)
- Present aggregated trees that show the most informative questions at each timepoint for both superforecasters and subject matter experts (More)
- Provide details on the value of information ratings for all forecasting questions we surveyed superforecasters and subject matter experts about (More)
Summary of VOI methods, metrics and terminology
We surveyed two groups: a) forecasters with a strong track record of short-term accuracy, who also estimated a relatively low chance of AI-related extinction by 2100 (“skeptical superforecasters”) (n = 8 total, 7-8 respondents per question); and b) subject matter experts in fields related to AI risk, who also estimated a relatively high chance of AI-related extinction by 2100 (“concerned experts”) (n = 11 total, 4-6 respondents per question).
Due to the high cost of obtaining forecasts on all 75 questions, we evaluate only a subset of questions (13 in total). These were selected for their performance in a preliminary filtering round, though our data suggests that this filtering round was a weak predictor of main question-rating survey results, especially for our expert sample.39 We also include in our survey the most popular (as of July 2023) AI questions from Metaculus, one each for 2030 and for the time period 2050-2070.
For each forecasting question, we asked respondents for their probability that it would resolve TRUE, and for their probability that AI extinction by 2100 would resolve TRUE, conditioned on the forecasting question resolving TRUE. We use Kullback-Leibler VOI (KL VOI, or simply VOI from this point forward) as our VOI measure.40
We focus on the percentage of the theoretical maximum VOI (POM VOI, or simply POM) that a question achieves as our main result.41 In some places we also report the z-score of a question’s POM VOI value for a given respondent (POM-z VOI, or simply POM-z). This value is useful if you believe individual respondents may have a bias toward giving higher or lower answers in general, or toward reporting an overall wider range of VOI values. It is particularly useful in the case of the expert results, as each expert answered only a random subset of all survey questions, and thus the influence of individual response biases on the resulting rank order of questions is potentially problematic. We suggest interpreting POM-z as a robustness check on the main POM results.
We aggregate POM and POM-z over respondents using the arithmetic mean. This sometimes has the effect that a single extreme response dominates the aggregate; however we believe this is appropriate in the context of very small sample sizes for POM values: an apparent “outlier” opinion in a small cohort may reflect the existence of a genuine faction in a larger population.
We also report a “pairwise wins” statistic derived from our sensitivity analysis, roughly indicating the robustness of the ranking to resampling simulations. This was calculated as the percentage of times a given question had higher POM VOI than other questions in the set in a resampling simulation. We use this as an additional robustness check on the main POM results.
Throughout this report, we refer to the probability of the ultimate question resolving positively, “AI causing extinction by 2100”, as P(U), and the probability of indicator questions as P(c). P(U|c) is the probability of the ultimate question, given that an indicator question resolves positively. When we report aggregate probabilities, we use the arithmetic mean. We report relative risk as P(U|c) / P(U).
How to read a conditional tree diagram
A conditional tree diagram begins with an initial node displaying the “start date”, usually the point in time at which the conditional tree survey was elicited. This node also displays a current estimate of the probability of some “ultimate question,” which may be either an individual’s estimate or an average over respondents.
The subsequent node represents an “indicator,” or an event which implies an update to the probability of the ultimate question. It displays a highly abridged question title and question ID, for which question summaries and full texts can be found in Appendix 1. Below the node is an estimate of the probability of TRUE or FALSE resolution.
The first indicator question may be followed by one or more additional indicator question layers. Resolution of these questions is estimated conditional on the outcomes of any previous question layers. That is, when indicator question #1 resolves positively, it may affect the probability of indicator question #2 resolving positively, and this is reflected in the values displayed in Figure 3.1.
Finally, the ultimate question nodes are the terminal point of each branch, and display an updated probability estimate conditional on the path leading to it.

3.1 Question ratings summary
Tables 3.1.1 and 3.1.2 show ratings for thirteen questions from the question generation process and two additional, highest-ranked “status quo” questions drawn from forecasting platforms, for a total of fifteen questions. Summaries of question content can be found in Table 3.1.3.
On average, the experts estimated that the probability of AI-related extinction by 2100 is 16.8%. The superforecasters were more skeptical of the risk, with an average probability of 0.25%.42
Question rating summary
| Superforecasters | Experts* | |||
| VOI rank | Relative risk (P(U|c) / P(U)) | VOI rank | Relative risk (P(U|c) / P(U)) | |
| 2030 Questions | ||||
| Administrative disempowerment warning shot (CX30) | 1 | 13.4 | 1 | 1.9 |
| Deep learning revenue (VL30) | 2 | 2.5 | 4 | 1.2 |
| Extinction-level pathogens feasible (CQ30) | 3 | 1.9 | 6 | 0.8 |
| Deceptive AI warning shot (ZD30) | 4 | 3.2 | 3 | 1.1 |
| AI involvement in nuclear arms (HB30)*** | 5 | 1.5 | NA | NA |
| Kurzweil/Kapor longbet (STQ9)** | 6 | 1.1 | 7 | 0.8 |
| AI arms race, multipolar result (NG30) | 7 | 1.0 | 5 | 1.1 |
| AI autonomous purchasing (EX30) | 8 | 1.0 | 2 | 1.6 |
| 2050-2070 Questions | ||||
| AI causing deaths, ineffectual response (CX50)*** | 1 | 23.2 | NA | NA |
| Power-seeking behavior warning shot (ZA50) | 2 | 2.4 | 4 | 1.4 |
| High AI investment, low safety indicators (VL70) | 3 | 1.3 | 2 | 4.2 |
| No aligned AGI (CX70) | 4 | 0.8 | 1 | 1.5 |
| AI CEOs / Research productivity (EX50) | 5 | 1.3 | 5 | 1.2 |
| Less prosocial behavior / Failing institutions (HS50) | 6 | 1.0 | 6 | 0.9 |
| Human-machine intelligence parity (STQ247)** | 7 | 1.0 | 3 | 1.4 |
*Note that each question was shown to a random subset of experts, not to all experts. This may have the effect of amplifying noise due to individual response biases, for both the VOI ranking and relative risk.
**Denotes external questions not generated as part of the conditional tree process.
***Denotes questions elicited in a supplementary survey round along with the status quo question set (see section 4.1). This round was only administered to the superforecaster sample.
Question ratings (all years)
| Superforecasters | Experts | ||||||
| Question | Res year | Mean POM | Mean POM-z | n | Mean POM | Mean POM-z | n |
| AI causing deaths, ineffectual response (CX50)** | 2050 | 6.34% | 0.08 | 7 | NA | NA | NA |
| Administrative disempowerment warning shot (CX30) | 2030 | 3.55% | 0.13 | 8 | 1.26% | 0.94 | 5 |
| Deep learning revenue (VL30) | 2030 | 1.68% | -0.04 | 7 | 0.64% | 0.16 | 5 |
| Power-seeking behavior warning shot (ZA50) | 2050 | 1.59% | 0.53 | 8 | 3.00% | 0.56 | 5 |
| Extinction-level pathogens feasible (CQ30) | 2030 | 1.37% | 0.57 | 8 | 0.18% | -0.59 | 5 |
| Deceptive AI warning shot (ZD30) | 2030 | 0.98% | 0.23 | 8 | 0.85% | 0.10 | 5 |
| AI involvement in nuclear arms (HB30)** | 2030 | 0.68% | -0.07 | 7 | NA | NA | NA |
| High AI investment, low safety indicators (VL70) | 2070 | 0.54% | 0.67 | 8 | 10.19% | -0.05 | 5 |
| No aligned AGI (CX70) | 2070 | 0.37% | -0.21 | 8 | 14.71% | 0.53 | 6 |
| Kurzweil/Kapor longbet (STQ9)* | 2030 | 0.27% | 0 | 8 | 0.15% | -0.41 | 5 |
| AI CEOs / Research productivity (EX50) | 2050 | 0.26% | -0.17 | 8 | 1.12% | -0.59 | 4 |
| Less prosocial behavior / Failing institutions (HS50) | 2050 | 0.26% | -0.30 | 8 | 0.25% | -0.63 | 6 |
| AI arms race, multipolar result (NG30) | 2030 | 0.26% | -0.28 | 8 | 0.37% | -0.33 | 4 |
| Human-machine intelligence parity (STQ247)* | 2040 | 0.14% | -0.59 | 8 | 4.19% | 0.11 | 4 |
| AI autonomous purchasing (EX30) | 2030 | 0.02% | -0.55 | 8 | 0.98% | 0.06 | 4 |
*Denotes external questions not generated as part of the conditional tree process.
**Denotes questions elicited in a supplementary survey round along with the status quo question set (see section 4.1). This round was only administered to the superforecaster sample.
All main question-rating survey questions
| Year | Title | Concise question summary |
| 2030 | Administrative disempowerment warning shot (CX30) | AI systems are preventing a substantial number of people from accessing financial assets, voting, moving freely, accessing the internet, or accessing certain types of housing or employment. |
| 2030 | Deep learning revenue (VL30) | Revenue from deep learning doubles every two years before 2030. |
| 2030 | Extinction-level pathogens feasible (CQ30) | A panel of relevant experts believes extinction-level pathogens are both possible and will be practically feasible relatively soon, at relatively low cost. |
| 2030 | Deceptive AI warning shot (ZD30) | After a costly incident, AI developers or relevant experts conclude that a top lab’s AI system behaved deceptively. |
| 2030 | AI involvement in nuclear arms (HB30)** | Governments publicly integrate AI decision-making into their nuclear weapons control systems, or nuclear risk experts worry about AI. |
| 2030 | Kurzweil/Kapor longbet (STQ9)* | Before 2030, will an AI complete the Turing Test in the Kurzweil/Kapor Longbet? |
| 2030 | AI arms race, multipolar result (NG30) | AI-related government spending is high for both the US and China, with low research collaboration, and relative economic parity; OR private investment in AI is very high and there is no single dominant company. |
| 2050 | AI causing deaths, ineffectual response (CX50)** | AI systems cause the deaths of >1m humans, which is not intended by any humans, and subsequently a panel of experts believes the collective global response has not noticeably reduced risk of similar events. |
| 2050 | Power-seeking behavior warning shot (ZA50) | AI developers shut down an expensive AI system after it displays a power-seeking behavior, such as hoarding resources, interfering with vital infrastructure, propagating itself, etc. |
| 2070 | High AI investment, low safety indicators (VL70) | Compute spending is high and experts agree that aligning AI systems is very difficult; and there is insufficient political attention to AI safety. |
| 2070 | No aligned AGI (CX70) | No AI system exists which both performs well on general ability benchmarks (e.g. Q&A dataset) and has positive indicators of alignment (performance on alignment benchmarks, confidence of AI safety researchers). |
| 2050 | AI CEOs / Research productivity (EX50) | AI systems are performing entire roles at top companies that currently are performed by C-suite executives; or research productivity is higher than it was in 1930. |
| 2050 | Less prosocial behavior / Failing institutions (HS50) | Charitable donations in the US have fallen dramatically; or corruption rises dramatically in the US or Europe; or autocracy increases dramatically worldwide. |
| 2040 | Human-machine intelligence parity (STQ247)* | Will there be Human-machine intelligence parity before 2040? |
| 2030 | AI autonomous purchasing (EX30) | AI autonomously buying goods or services (e.g. purchasing flights, managing inventories for companies, etc) — >$1 million / yr |
Question IDs link to the full text of the question operationalization in Appendix 1.
*Denotes external questions not generated as part of the conditional tree process.
**Denotes questions elicited in a supplementary survey round along with the status quo question set (see section 4.1). This round was only administered to the superforecaster sample.
3.2 Candidate high VOI trees from two camps
This section displays high VOI trees produced by the main question-rating survey data for skeptical superforecasters and for concerned experts. For each group, we included a selection of the most informative questions in the tree. Only the superforecaster tree is a true conditional tree, as only superforecasters were surveyed on every combination of the top-scoring questions.
Skeptical superforecasters’ conditional tree
We surveyed the superforecasters in our sample for conditional forecasts on sixteen scenarios. These scenarios were combinations of the top-ranked questions: “administrative disempowerment” (CX30), “extinction-level pathogens” (CQ30), “AI-related deaths” (CX50) and “Power-seeking” (ZA50).43 Seven superforecasters responded. The sixteen scenarios are mutually exclusive and exhaust the space of possible outcomes; thus, we ensured that each respondent’s probabilities assigned to the scenarios summed to 100% and showed them their implied P(U), the average of their P(U|scenario)’s weighted by the likelihood they assigned to each scenario (see Figure 2.2.2). We averaged the forecasts for each P(scenario) and P(U|scenario) separately to create an aggregate judgment. The implied P(U) of this aggregate was then used to compute average relative risk (the multiplier in each branch of the tree). A simplified version of the resulting tree is shown in Figure 3.2.1.
For example, conditional on both “Extinction-level pathogens” and “AI-related deaths” resolving positively (superforecasters assign a 2.82% chance to this outcome), the superforecasters would on average update their P(U) from 0.94% to 6.21%.
The scenario that would constitute the biggest update is the case where all four questions that would imply higher risk resolve positively. If the four relevant risk-increasing outcomes were to happen (far right in the full tree (a)), the superforecasters’ relative risk assessment is 10.7 (i.e., they would be 10.7x more concerned than they currently are about the risk of AI-related extinction). Conversely, if none of the questions resolve positively (far left), their relative risk assessment is 0.3.
Note that the average P(U) in this survey (0.94% in Figure 3.2.1) is higher than in the main survey (0.25%), which we used to compute VOI. Two superforecasters made substantial updates to their unconditional probability of AI-related extinction by 2100 (P(U)) between the main survey (conducted in July 2023) and this combinations survey (conducted in February to March 2024 with a follow-up in May), which may be attributable to events of the intervening months or to the exercise of thinking through scenarios. One superforecaster updated from 0.1% to 0.4% and another from 1% to 4.2%. The other five did not update.

This is a collapsed tree of combinations of the superforecasters’ highest-VOI questions. For the purpose of legibility, we are presenting a simplified tree, using two of the four questions. We collapsed the sixteen scenarios into four combinations. Positive resolution (“TRUE”) is a bad outcome for both questions. The far right scenario (both TRUE) constitutes the worst scenario, a 6.6x update, and the far left scenario is the best (both FALSE) with a halving of the superforecasters’ current risk estimate. You can see the full, unpruned tree here (a).
Concerned experts’ conditional trees
Figure 3.2.2 presents the question from each year (2030, 2050, and 2070) that surveyed experts rated the highest, on average, in terms of POM VOI. As a whole, among these highest-POM VOI questions, the experts would be most worried if there were an administrative disempowerment warning shot by 2030 (1.9x update from their current unconditional P(U) of 17%). Conversely, if we do not see a power-seeking behavior warning shot by 2050, the experts would be least worried (0.6x update).

3.3 Skeptical superforecasters’ question ratings
2030 questions
| Question | Mean POM | P(c) | RR (P(U|c) / P(U)) | Mean POM-z | Pairwise wins | n |
| Administrative disempowerment warning shot (CX30) | 3.55% | 16% | 13 | 0.13 | 83% | 8 |
| Deep learning revenue (VL30) | 1.68% | 33% | 2.5 | -0.04 | 59% | 7 |
| Extinction-level pathogens feasible (CQ30) | 1.37% | 39% | 1.9 | 0.57 | 75% | 8 |
| Deceptive AI warning shot (ZD30) | 0.98% | 32% | 3.2 | 0.23 | 64% | 8 |
| AI involvement in nuclear arms (HB30)** | 0.68% | 18% | 1.5 | -0.07 | 50% | 7 |
| Kurzweil/Kapor longbet (STQ9)* | 0.27% | 43% | 1.1 | 0 | 33% | 8 |
| AI arms race, multipolar result (NG30) | 0.26% | 39% | 1.0 | -0.28 | 33% | 8 |
| AI autonomous purchasing (EX30) | 0.02% | 35% | 1.0 | -0.55 | 3% | 8 |
P(c) is the arithmetic mean of this group’s responses. RR (relative risk) is an arithmetic mean of each
individual’s relative risk (P(U|c) / P(U)).
*Denotes external questions not generated as part of the conditional tree process.
**Denotes questions elicited in a supplementary survey round along with the status quo question set (see
section 4.1). This round was only administered to the superforecaster sample.
Skeptical superforecasters’ top-rated question by mean POM was “Administrative disempowerment warning shot” (CX30), referring to a scenario in which “AI systems are preventing a substantial number of people from accessing financial assets, voting, moving freely, accessing the internet, or accessing certain types of housing or employment.” It scored ~3.6% of the theoretical maximum VOI score on average. However, this high value was driven by a single respondent, with the question achieving a remarkable 25% of the theoretical maximum VOI for this individual.45 This is consistent with superforecasters in our sample preferring questions which refer to concrete AI-related harms, though the high variance in VOI ratings for this question suggest that there is no consensus on exactly which harms provide the clearest signal.
The top-rated question by POM-z, “Feasibility of extinction-level pathogens” (CQ30), refers to a scenario in which “A panel of relevant experts believes extinction-level pathogens are both possible and will be practically feasible relatively soon, at relatively low cost.” It is the question that respondents most agreed was informative, though the highest VOI rating any individual gave this question was only 5.2% of the theoretical maximum. Interestingly, this question does not refer to realized harm, but rather to favorable conditions for harm to take place. Such questions may gain a VOI advantage by omitting divisive or low-probability conditions that hinge on human motivations for misusing AI technologies.46 It was the third most likely 2030 question to resolve positively.
No mean POM differences between questions were significant in this sample (after correcting for multiple testing using the Bonferroni correction, all p-values were equal to 1). Survey responses between filtering and main survey rounds were fairly similar, though with some notable differences. See Appendix 2.1 for further details on intra-individual response variability.




2050-2070 questions
| Question | Mean POM | P(c) | RR (P(U|c) / P(U)) | Mean POM-z | Pairwise wins | n |
| AI causing deaths, ineffectual response (CX50)** | 6.34% | 6% | 23 | 0.08 | 67% | 7 |
| Power-seeking behavior warning shot (ZA50) | 1.59% | 38% | 2.4 | 0.53 | 87% | 8 |
| High AI investment, low safety indicators (VL70) | 0.54% | 38% | 1.3 | 0.67 | 64% | 8 |
| No aligned AGI (CX70) | 0.37% | 34% | 0.8 | -0.21 | 48% | 8 |
| AI CEOs / Research productivity (EX50) | 0.26% | 21% | 1.3 | -0.17 | 35% | 8 |
| Less prosocial behavior / Failing institutions (HS50) | 0.26% | 31% | 1.0 | -0.30 | 32% | 8 |
| Human-machine intelligence parity (STQ247)* | 0.14% | 53% | 1.0 | -0.59 | 17% | 8 |
*Denotes external questions not generated as part of the conditional tree process.
**Denotes questions elicited in a supplementary survey round along with the status quo question set (see section 4.1). This round was only administered to the superforecaster sample.
Skeptical superforecasters’ top-rated question by mean POM was “AI causing deaths, ineffectual response” (CX50), referring to a scenario in which “AI systems cause the deaths of >1m humans, which is not intended by any humans, and subsequently a panel of experts believes the collective global response has not noticeably reduced risk of similar events.” It scored ~6.3% of the theoretical maximum VOI score on average. Similarly to the top-scoring 2030 question, this question benefits from a single strong opinion, and thus does relatively worse on both sensitivity analysis and POM-z. It is an even starker example of concrete harm caused by AI systems; but was also among the least likely questions to resolve positively, at 6%.
While neither first-ranked by mean POM or POM-z, the most robust question in sensitivity analysis was “Power-seeking behavior warning shot” (ZA50), in which “AI developers shut down an expensive AI system after it displays a power-seeking behavior, such as hoarding resources, interfering with vital infrastructure, propagating itself, etc.” This question does not fit as clearly with superforecasters’ apparent preference for questions referring to concrete harm.
The highest rated question by POM-z was “High AI investment, low safety indicators” (VL70), in which “Compute spending is high and experts agree that aligning AI systems is very difficult; and there is insufficient political attention to AI safety.” However, in absolute terms it was rated relatively low at 0.5% POM.
“No aligned AGI” (CX70) is unique in this question set as the only question which on average updated superforecasters away from AI-related extinction (mean relative risk = 0.8x). Here respondents may have inferred that a world with no aligned AGI by 2070 was more likely to be a world with no AGI of any kind, than a world with only unaligned AGI.
No mean POM differences between questions were significant in this sample (after correcting for multiple testing, all p-values were equal to 1). Survey responses between filtering and main survey rounds were broadly similar. See Appendix 2.1 for further details on intra-individual response variability.
2030 vs 2050/2070 questions
In the AICT question set (that is, all questions excluding STQ9 and STQ247) the average of POM responses for 2030 questions, 1.2%, was slightly lower than that of 2050-2070 questions, at 1.5%. Due to the small number of questions included and the small absolute size of the difference, this does not seem indicative of a genuine VOI difference between earlier and later questions in our set.
Responses for 2030 and 2050-2070 AICT questions suggested similar probabilities of positive resolution, 30% and 28% respectively (mean; interquartile range (IQR) = 5% – 50% and 10% – 40%).47 However, the relative risk was lower for 2030 questions at 3.6x (mean; IQR = 1x – 1.5x), vs. 2050 – 2070 questions at 4.6x (mean; IQR = 1x – 1.5x).




3.4 Concerned experts’ question ratings
2030 questions
| Question | Mean POM | P(c) | RR (P(U|c) / P(U)) | Mean POM-z | Pairwise wins | n |
| Administrative disempowerment warning shot (CX30) | 1.26% | 37% | 1.9 | 0.94 | 87% | 5 |
| AI autonomous purchasing (EX30) | 0.98% | 54% | 1.6 | 0.06 | 75% | 4 |
| Deceptive AI warning shot (ZD30) | 0.85% | 66% | 1.1 | 0.10 | 66% | 5 |
| Deep learning revenue (VL30) | 0.64% | 17% | 1.2 | 0.16 | 48% | 5 |
| AI arms race, multipolar result (NG30) | 0.37% | 38% | 1.1 | -0.33 | 41% | 4 |
| Extinction-level pathogens feasible (CQ30) | 0.18% | 25% | 0.8 | -0.59 | 18% | 5 |
| Kurzweil/Kapor longbet (STQ9)* | 0.15% | 57% | 0.8 | -0.41 | 15% | 5 |
*Denotes external questions not generated as part of the conditional tree process.
Concerned experts’ top-rated question, “Administrative disempowerment warning shot” (CX30), refers to a scenario in which “AI systems are preventing a substantial number of people from accessing financial assets, voting, moving freely, accessing the internet, or accessing certain types of housing or employment.” It scores best on both POM and POM-z; and while its position is bolstered by an outlier, it is also generally rated well among respondents.48
No mean POM differences between questions were significant in this sample (after correcting for multiple testing, all p-values were equal to 1). The filtering round elicitation for these questions appeared to be a poor proxy for expert judgments in the main survey round (see the ”Methods” section for more details on the filtering round elicitation).49




2050-2070 questions
| Question | Mean POM | P(c) | RR (P(U|c) / P(U)) | Mean POM-z | Pairwise wins | n |
| No aligned AGI (CX70) | 14.71% | 46% | 1.5 | 0.53 | 95% | 6 |
| High AI investment, low safety indicators (VL70) | 10.19% | 19% | 4.2 | -0.05 | 80% | 5 |
| Human-machine intelligence parity (STQ247)* | 4.19% | 60% | 1.4 | 0.11 | 56% | 4 |
| Power-seeking behavior warning shot (ZA50) | 3.00% | 54% | 1.4 | 0.56 | 47% | 5 |
| AI CEOs / Research productivity (EX50) | 1.12% | 46% | 1.2 | -0.59 | 22% | 4 |
| Less prosocial behavior / Failing institutions (HS50) | 0.25% | 43% | 0.9 | -0.63 | 0% | 6 |
*Denotes external questions not generated as part of the conditional tree process.
Concerned experts’ top-rated question by POM was “No aligned AGI” (CX70), which not only ranked well among this set, but also achieved a very high absolute percentage of maximum VOI of nearly 15%. This question also performed very well on sensitivity analysis, and was judged to be highly probable for this question set at 45.86%. It carried the second highest relative risk at 1.5x, but no respondents gave extremely high relative risk estimates.
The top question by POM-z, “Power-seeking behavior warning shot” (ZA50), had only middling rank by POM, but nonetheless an objectively high POM value of 3%. It was judged to be highly probable at 53.6%, with a moderate relative risk (mean=1.4x).
No mean POM differences between questions were significant in this sample (after correcting for multiple testing, the closest to significance was CX70 vs. HS50 at p = 0.638). The filtering round elicitation for these questions appeared to be a moderately good proxy for expert judgments in the main survey round (see ”Methods” section for more details on the filtering round elicitation).50
2030 vs 2050/2070 questions
Overall, this set of experts seems to have judged the 2050/2070 set of questions as more informative than the 2030 set: they on average achieved a POM of 5.9%, vs. 2030 questions at 0.63% (2030 IQR = 0.02% – 0.92%; 2050/2070 IQR = 0.18% – 6.6%). This difference appears to be a genuine result, with p = .043; it is robust to the removal of any particular question or respondent.
Probability of positive resolution looks quite similar between 2030 and 2050-2070 questions, at 42% and 44% respectively (2030 IQR = 15% – 66%; 2050-2070 IQR = 30% – 60%). Relative risk for later questions was higher in our sample, with an average of 1.8x vs. 2030 questions at 1.2x (2030 IQR = 1.0 – 1.2x; 2050-2070 IQR = 1.0 – 2.0x).




4. How does the AI conditional tree question set compare?
Because the conditional trees method is intensive, whether it is ultimately useful depends on whether the questions it generates are substantially better than those generated in cheaper ways.
Hundreds of forecasting questions are publicly available on online forecasting platforms, such as Metaculus, Good Judgment Open, Hypermind, and Manifold Markets. Some of these platforms use a large degree of crowd-sourcing in constructing their question base, though most also employ professional question-writers, and may also receive commissions for forecasting questions on specific topics from other organizations. These questions could be said to represent the “status quo” of question-writing in the field of forecasting.
Forecasting platforms are generally focused on making accurate predictions by aggregating many people’s forecasts and usually allow participants to choose which questions to forecast. The questions that are popular on forecasting platforms are often questions that are important in themselves, more than as indicators of other events.51 Because they are not primarily trying to find high VOI questions, it should not be surprising that a deliberate attempt to maximize for VOI would result in higher VOI questions. Nonetheless, we think this result is useful for people trying to use forecasting for policy and other planning purposes. Higher VOI questions are likely more useful as cruxes for future decisions, so these results suggest that investing resources in finding high VOI questions may result in questions that are more useful than those generated by existing platforms.
We built a dataset of such questions for comparison with those generated by the conditional tree process. Comparable questions, that is, those related to medium- and long-term events connected with AI, were concentrated in a small number of platforms.52 Below we refer to these questions as the “status quo set”.
We compared the questions generated through conditional trees (the AICT set) with questions in the status quo set in three ways:
- Value of Information (VOI): how informative are the questions in expectation? That is, how much would knowing the answer to a question inform forecasts on the ultimate question? See Appendix 2 for more on VOI in this project.
- Based on a survey of skeptical superforecasters, most of the questions from the AICT set were more informative than top questions in the status quo set (n=8 on main survey; 7 on status quo survey).
- Distribution of question topics: do the questions in the AICT set cover substantially different topics than those in the status quo set?
- For both sets, a majority of questions (59% and 72% for AICT and status quo sets, respectively) fell into the “Acceleration” category, which includes questions related to AI capabilities or investment in AI. For the three other topic categories—Social / Political / Economic, Alignment, and AI harms— there was a noticeable difference between the AICT set and the status quo set. In the AICT set, there were similar numbers of questions in each of the three categories, while in the status quo set, there were more “Social / Political / Economic” questions than “Alignment” or “AI harms” questions.
- Uniqueness: within a given topic area, did the questions we generated address specialized expert interests that were not covered by questions in the status quo set?
- This comparison is the most preliminary and speculative: a member of our team simply rated questions on how much and in what ways the questions articulated issues important to experts in ways not addressed by the status quo set. Overall, this analysis suggests that conditional trees may be effective at finding forecast questions not captured by current prediction platforms.
As discussed above, we are comparing the questions generated by the conditional trees method to other questions primarily as a demonstration of the types of analysis that are possible with conditional trees. We expect that the actual results would differ significantly if the study were run again with more participants and do not recommend interpreting these results as decisive evidence.
4.1 VOI comparison (skeptical superforecasters)
Using the same survey methodology as in our main question-rating survey (see Methods), we conducted a followup survey with the skeptical superforecaster sample (n=7) to obtain VOI ratings for a sample of the top AI-related status quo questions. This survey included eight status quo questions selected for their popularity among platform users at time of collection (see choosing criteria in Appendix 3.2). We also included two additional questions from the AICT set that were not included in the main question-rating survey.
Of the ten status quo questions for which we elicited VOI, nearly all were judged to be less informative by our superforecaster sample than nearly all AICT questions for which we elicited VOI (see table 4.1). Notable exceptions are “EX30,” an AICT question which scored lower than all but three status quo questions, and the status quo questions “STQ9” and “STQ205” which scored higher than four AICT questions.
The mean informativeness of AICT questions resolving in 2030 was higher than that of status quo questions resolving in the same year, with p = .025. In this group, AICT questions were deemed, on average, nine times more informative than status quo questions. We did not find a significant effect for 2050-2070 questions (p = .10), although in our sample AICT questions were still eleven times more informative on average.
| POM VOI, mean | |
| AI causing deaths, ineffectual response (CX50) | 6.34% |
| Administrative disempowerment warning shot (CX30) | 3.55% |
| Deep learning revenue (VL30) | 1.68% |
| Power-seeking behavior warning shot (ZA50) | 1.59% |
| Extinction-level pathogens feasible (CQ30) | 1.37% |
| Deceptive AI warning shot (ZD30) | 0.98% |
| AI involvement in nuclear arms (HB30) | 0.68% |
| High AI investment, low safety indicators (VL70) | 0.54% |
| No aligned AGI (CX70) | 0.37% |
| Superalignment success (STQ205 / STQ215)* | 0.28% |
| Kurzweil/Kapor Turing Test longbet (STQ9)* | 0.27% |
| AI CEOs / Research productivity (EX50) | 0.26% |
| Less prosocial behavior / Failing institutions (HS50) | 0.26% |
| AI arms race, multipolar result (NG30) | 0.26% |
| Brain emulation (STQ196)* | 0.23% |
| Human-machine intelligence parity (STQ247)* | 0.14% |
| Compute restrictions (STQ236)* | 0.13% |
| US AI x-risk opinions (STQ19)* | 0.12% |
| AI novel reading (STQ152)* | 0.05% |
| AI autonomous purchasing (EX30) | 0.02% |
| RoboCup (STQ232)* | 0.02% |
| AI movies (STQ47)* | 0.00% |
| LLM chess (STQ149)* | 0.00% |
4.2 Distribution of question topics
To understand whether the expert conditional tree elicitation produced questions with a substantially different topic focus than the crowdsourced “status quo” question set, we developed a category rating scheme and applied it to both question sets. For a description of the rating scheme, see Appendix 3.1.
For both sets, a majority of question categorisations53 (36% and 48% for AICT and status quo sets, respectively) fell into the “Acceleration” category, which includes questions related to AI capabilities or investment in AI, though this was somewhat more pronounced in the status quo set. For the AICT set, the three other categories had relatively similar proportions to one another. However, the status quo set had a larger proportion of “Social / Political / Economic” question categorisations (33%) than “Alignment” questions (12%) or “AI harms” questions (7%).54
| Category | AICT question set | Status quo question set |
| Social / Political / Economic | 24% (29) | 33% (131) |
| Alignment | 20% (25) | 12% (47) |
| AI harms | 20% (25) | 7% (27) |
| Acceleration | 36% (44) | 48% (191) |
4.3 Uniqueness
Beyond high-level topic overlap, to what extent were the interests of our expert sample already represented in the status quo question set, and where did our question set add novel content?
Answering this question thoroughly is beyond the scope of this report, but we will share some observations here. To demonstrate a method for assessing uniqueness, one teammate rated questions from the “Alignment” topic area on several dimensions of uniqueness:55
- Conceptual uniqueness: how much did the question prompts
generated by the conditional trees method capture expert interests not
captured by the status quo set?
- Of the 31 question prompts in the “Alignment” category,56 only two were totally or mostly captured by an existing question in the status quo set. 12 questions were “partly captured,” 12 were “mostly uncaptured,” and five were wholly uncaptured. These ratings suggest that this method may be effective at finding forecasting questions not captured by current prediction platforms.
- We thought that experts’ interests within the “developer perception” and “power-seeking” themes were particularly poorly represented by the status quo set;57 few questions pertaining to these themes existed in the status quo set (one and two, respectively), and those that existed were relatively narrow or dissimilar to the expert prompts.
- Operationalization uniqueness: how unique was the
operationalization generated by the conditional trees method, compared
to the status quo question we thought was most similar?
- Operationalization uniqueness could refer to different subject matter, different operationalization strategies for similar subject matter, or an expectation of uncorrelated question resolutions. Purely linguistic differences between question texts were not considered as part of “uniqueness.”
- Operationalized question texts were rated independently of question prompts; thus, if a question prompt specified unique subject matter and this was reflected in the operationalization of a question, this counted toward both conceptual uniqueness and operationalization uniqueness.
- Overall, our operationalizations were fairly different from those
in the status quo set: none were extremely similar and one had only
minor differences.
- Of the others, 9 had moderate differences, 15 were very different, and 4 were almost entirely different.
- For a preliminary quantitative analysis of these results and a discussion of “conjunctive uniqueness,” see Appendix 3.2.
5. Discussion
5.1 Takeaways relating to the conditional trees method
The conditional trees method produced novel and informative forecasting questions.
Forecasting communities have shown great interest in questions related to AI, which number in the hundreds on forecasting platforms. Yet relatively little has been done to evaluate the extent to which questions on existing platforms are either informative or relevant to the interests of AI experts, and similarly, little has been done to systematically improve the quality of forecasting questions.
By directly targeting expert interests via a specialized interview and question-writing pipeline, the conditional trees process provided an original method of improving on the status quo, producing suggestive evidence that this process could lead to novel and highly informative questions
Drawing on 24 one-hour interviews, our team created 75 AI forecasting questions (the AICT set). In a small sample (n=8 and n=7 for the main and supplementary surveys, respectively) comparison of POM VOI ratings from superforecasters, 12 (out of 13) surveyed AICT questions scored higher than 8 (out of 10) popular status quo questions. The table below shows a comparison of the top 5 questions generated by the conditional trees method to the top 5 questions taken from existing platforms, where the questions taken from existing platforms are marked with an asterisk.
| Question | Mean POM VOI |
| AI causes large-scale deaths, ineffectual response (CX50) | 6.34% |
| Administrative disempowerment warning shot (CX30) | 3.55% |
| Deep learning revenue (VL30) | 1.68% |
| Power-seeking behavior warning shot (ZA50) | 1.59% |
| Extinction-level pathogens feasible (CQ30) | 1.37% |
| Superalignment success (STQ205 / STQ215)* | 0.28% |
| Kurzweil/Kapor Turing Test longbet (STQ9)* | 0.27% |
| Brain emulation (STQ196)* | 0.23% |
| Human-machine intelligence parity (STQ247)* | 0.14% |
| Compute restrictions (STQ236)* | 0.13% |
Crowd-sourced question sets may have some basic practical limits set by the fact that the crowd is often made up largely of laypeople, whereas experts’ specialized knowledge gives them access to other parts of the “question space.” This could suggest that achieving more active expert participation in crowd-sourcing efforts would improve their output. However, it may be difficult to structure such efforts in a way that effectively incentivizes expert engagement, for a number of possible reasons:
- Experts’ time is valuable, so they may feel disinclined to participate in crowd-sourcing efforts where their contributions may seem like a “drop in a bucket”.
- Rewards for high-value contributions may be poorly aligned with experts’ motivations, if for example they are only rewarding in the context of a specific community (e.g., website karma); if they are insufficiently large for the opportunity cost (e.g., a monetary reward that would be lower than the expert’s equivalent hourly consulting fee); or if they are allocated perversely (e.g., preferentially to those more embedded in the forecasting community).
- Expert attrition from friction within the pipeline may be high, if for example a user interface has a steep learning curve. Experts are likely to be both more time-poor and older than the average user of an online forecasting platform.
Beyond simple expert engagement, the conditional tree question generation process likely contributed to the quality of the results. In interviews, many experts remarked that the conditional tree elicitation prompted them to think in novel ways, and to generate content that they otherwise would not have. Additionally, experts were not required to turn this content into fully operationalized forecasting questions, a time-consuming task which few of them had significant experience with, as this step was instead completed by a question-writing team.
However, the value of the AICT question generation exercise rests in part on the response of forecasters. Arguably, the primary object of interest in forecasting to policymakers is the forecasts, without which questions have limited value. And regardless of the AICT questions’ novelty or ostensible “informativeness” (from a VOI standpoint), they may not be so informative if forecasters fail to engage with them.58
The conditional trees method requires significant time and labor to generate forecasting questions.
While the conditional trees method can generate novel and informative questions that align with expert interests, its usefulness may be limited for those who cannot invest significant time and labor into the process. The method requires a considerable amount of effort to implement effectively, which could outweigh its benefits for individuals or organizations with limited resources.
In particular, maintaining consistent expert engagement throughout all phases of the process proved challenging. Although experts were willing to engage in the question-generation phase of the conditional trees process, they showed significantly less enthusiasm for participating in the question-judging phase. Providing VOI estimates is relatively labor-intensive: for each question, one must generate a forecast for that question’s probability of resolving positively, and a further conditional probability of some ultimate outcome given the question’s resolution.
A high quality forecast often requires both a careful reading of the question’s terms, and some amount of research into base rates of relevant phenomena, the forecasts of others on similar questions, and so forth. Relative to professional superforecasters, experts usually have much less practice producing a large volume of forecasts quickly.
5.2 Takeaways for AI risk detection
AI alignment and concrete harms from AI are front of mind for concerned experts.
Over 21 interviews with experts concerned about AI risk, 15 of them named indicators relating directly to AI alignment as the best warning signs of AI-related extinction by 2100. For example, some experts thought we might see compelling evidence that powerful misaligned systems existed, as in the following:
Expert prompt (ID: MD30)
Advanced, planning, strategically-aware AI (as defined by Carlsmith) is developed.
Others worried that AI alignment research agendas may fail to bear fruit, or that AI developers would broadly disregard safety considerations. Questions related to alignment made up 35% of the AICT question set.
Questions relating to concrete harms caused by AI were also prevalent: they featured in 10 expert interviews, and made up 33% of the AICT question set.59 Such questions included themes of interaction between AI technology and other sources of risk, such as biological or nuclear weapons; powerful AI systems manipulating humans; small-scale disasters causing human deaths or financial damage; and threats to political or societal order. The details of some of the questions that experts found the most informative are included in Table 5.2 below.
| Question | Summary |
| No aligned AGI (CX70) | This question resolves positively if there is no AGI system which meets several ‘alignment’ criteria by 2070. Those criteria include AGI-like capabilities (high question answering skill on hard questions in diverse subjects, high programming ability) and alignment (performing well on SOTA alignment tests, and expert surveys indicating that experts believe that AI is aligned). |
| Power-seeking behavior warning shot (ZA50) | This question resolves positively if, by 2050, AI developers must attempt to disable or destroy an AI (costing >$5M) after it exhibits power-seeking behaviors such as acquiring significant resources, controlling financial accounts, manipulating humans, infiltrating key systems, seizing control of weaponry, self-propagation, or engaging in hacking. |
| Administrative disempowerment warning shot (CX30) | This question resolves positively if any of the following occur by 2030: AI systems in a country freeze assets affecting ≥1% of the population or ≥0.5% of GDP for over a year; prevent ≥5% of voting-age citizens from participating in a national election; restrict >3% of residents’ movement for 3+ months; limit >10% of the population’s internet access for 3+ months; or legally restrict >5% of citizens’ access to certain housing or employment for 1+ year. |
By contrast, the set of existing AI forecasting questions on crowdsourced platforms (the “status quo set”) feature a smaller proportion of such questions, just 18% and 10% for “alignment” and “harms” categories, respectively. A larger proportion of questions in this set related to “acceleration” of AI technologies, or to economic, commercial, and sociopolitical topics.
Beyond the implications for the forecasting ecosystem, concerned experts’ preference for direct indicators of AI alignment or harms holds potential lessons for policymakers. For example, if current efforts by governments and regulatory bodies to monitor the nascent AI industry are heavily focused on tracking emerging AI capabilities or industry investment, our results suggest such signals may be overvalued from an existential risk perspective.
However, the expert VOI judgments from this report can only offer relatively weak evidence for experts’ views on the informativeness of questions. The sample of experts who provided forecasts was extremely small (n=11).
Concerned experts and skeptical superforecasters may disagree about which questions best indicated heightened AI risk.
While the skeptical superforecasters and concerned experts had some notable disagreements, they did find a few questions similarly informative. Three out of 13 surveyed questions scored in the top half of questions (by POM VOI) for both groups:
| Superforecasters | Experts | ||||
| Question | Res year | Mean POM | Mean POM-z59 | Mean POM | Mean POM-z |
| Administrative disempowerment warning shot (CX30) | 2030 | 3.55% (1) | 0.28 (4) | 1.26% (5) | 0.94 (1) |
| Power-seeking behavior warning shot (ZA50) | 2050 | 1.59% (3) | 0.75 (1) | 3.00% (4) | 0.56 (2) |
| High AI investment, low safety indicators (VL70) | 2070 | 0.54% (6) | 0.62 (2) | 10.19% (2) | -0.05 (8) |
But they also had nearly opposite opinions of four questions, with one group ranking each of these four among the most informative questions and the other considering it among the lowest:
| Superforecasters | Experts | ||||
| Question | Res year | Mean POM | Mean POM-z | Mean POM | Mean POM-z |
| Extinction-level pathogens feasible (CQ30) | 2030 | 1.37% (4) | 0.57 (3) | 0.18% (12) | -0.59 (12) |
| AI autonomous purchasing (EX30) | 2030 | 0.02% (13) | -0.58 (12) | 0.98% (7) | 0.06 (7) |
| Human-machine intelligence parity (STQ247) | 2050 | 0.14% (12) | -0.61 (13) | 4.19% (3) | 0.11 (5) |
| No aligned AGI (CX70) | 2070 | 0.37% (7) | -0.23 (9) | 14.71% (1) | 0.53 (3) |
Notably, both experts and superforecasters appear to find questions relating to concrete harms from AI to be informative, whereas superforecasters and experts disagree about the relative informativeness of questions relating to AI alignment. Unlike experts, superforecasters do not appear to place significant value on questions relating to AI alignment. However, very small sample sizes, plus the potential for high variation in individual rater responses over time, prevent us ruling out noise as an explanation for these patterns.
6. Limitations of Our Research
Limitations of our research include:
- The total number of participants in this study was very small. It is therefore likely that some of the results would not be replicated in a larger study.
- This study involves eliciting long-range forecasts, but there is little evidence that these forecasts are accurate. Most studies of judgmental forecasting measure accuracy on 0-2 year time horizons, which is likely much easier than forecasting outcomes on 5+ year time horizons (in this study we typically asked for forecasts resolving between 2030 and 2100).60 If forecasts over long time horizons are not generally reliable, then these conditional trees would not be providing a useful signal.
- Since conditional trees are composed of conditional forecasts,
their reliability depends on the assumption that conditional forecasts
are meaningful. However, we do not know whether people are accurate when
making conditional forecasts. There is little experimental evidence on
how best to elicit conditional forecasts. Some reasons to expect that
conditional forecasts may not be robust or accurate include:
- Intuitively, conditional forecasting seems difficult. Our team often finds generating and understanding forecasts on these questions to be challenging, so we would expect others to find it so also.
- Case in point, the forecasters we surveyed often initially struggled to provide conditional forecasts that were logically coherent. Their conditional forecasts implied that the probability of the ultimate questionand the crux resolving positively was greater than the probability of the ultimate question resolving positively, an issue known as theconjunction fallacy.
- This study asked people to make forecasts in an exceptionally short period of time in the filtering stage: one minute per question. These “short-fuse forecasts” may be less reliable than forecasts that involve higher degrees of thought and effort. Participants spent longer amounts of time on the forecasts that inform VOI calculations.
- Participants in this study were all either experts who are highly concerned about existential risks from AI, or superforecasters who are not. As a result, we are not able to separate differences caused by risk assessment from differences caused by forecasting aptitude, professional training, or other factors.
- AI developments seem particularly challenging to predict, and forecasters on this topic in past FRI projects have emphasized their uncertainty. As a result, their predictions about future AI developments, especially those that will not resolve for many years, may not be reliable enough to be practically useful.
7. Next Steps
Further research related to this topic could include:
- Assessing whether the questions identified through this process continue to perform better than status quo forecasting questions (in terms of value of information) when a larger number of people forecast on them. We have added relevant questions from this project to two forecasting platforms (see Appendix 7 for links) and will be interested to see whether they receive many forecasts and how their value of information compares to other questions.
- So far, public forecasting platforms have not applied question metrics like VOI to their questions or incentivized questions that are unusually informative or decision-relevant. It’s possible that incentives on those platforms could produce questions as good as the ones identified by the trees method. In general, we would be interested to see forecasting platforms implement the kinds of question metrics discussed in this report so that questions can be sorted according to value of information on major topics such as AI existential risk.
- We have had some discussions with forecasting platforms like Metaculus and hope that metrics like the ones used in this project can help platforms find the highest-value questions.
- Replicating the conditional trees process with larger sample
sizes and in other domains. For example, would this process also
identify more informative questions on topics such as nuclear policy and
climate change?
- In particular, choosing domains where important questions will resolve sooner could help assess how useful the conditional trees process is.
- As the questions in the trees resolve (beginning in 2030),
participants could be re-surveyed to see how well conditional trees
performed.
- For example, once we know whether the 2030 questions have happened or not, we could ask participants for their new forecast on the probability of extinction due to AI by 2100, and see if it is similar to what was predicted by the conditional trees.
- Would other research groups or organizations be able to replicate and run their own conditional tree interview process based on the information in this report and the resources we provide?
- FRI recently completed another research project with a similar goal: an adversarial collaboration project (a) that brought together generalist forecasters and domain experts who disagreed about the risk AI poses to humanity in the next century and asked them to work together to find questions that underlie their disagreement.
- Comparing the questions from the two methods may help us understand the merits of each approach, so that we can design better forecasting questions and elicitation processes on AI and other topics.
- In particular, in both projects, people who were less concerned
about extinction due to AI by 2100 tended to value questions that
focused on concrete harms caused by AI, while those more concerned were
more likely to value questions regarding advanced capabilities or
whether artificial intelligence had been successfully aligned.
- This may be related to each group’s expectations of how difficult it will be to align a powerful AI model: participants skeptical of AI risk were likely to think that alignment is a technical problem that is not fundamentally different from problems that people have previously solved and that we are likely to come up with workable solutions when we need to. If this is true, there may be useful cruxes related to ease of alignment.
- FRI also conducted a conditional trees experiment focused on forecasting the outcome of baseball games. Future work could examine those results alongside the AI results for additional tests of the conditional trees method.
Data Availability
Survey data from the filtering round, main survey, supplementary survey, and the question combinations survey are available at the previous links.
Notes
- We will refer to this set of forecasters as “superforecasters” henceforth. Note that while seven of the forecasters are Superforecasters ™ as officially designated by Good Judgment Inc., one is a skilled forecaster who does not have that label but has a comparable track record of calibrated forecasts. ↩︎
- To ensure the integrity of links in this report, we include stable archive.org links in parentheses after each citation to an external URL. ↩︎
- More specifically, the ultimate question was defined as the global human population falling below 5,000 individuals at any time before 2100, with AI being a proximate cause of such reduction. ↩︎
- “Plausible” meaning that the forecaster deemed the indicator event to be at least 10% likely to occur. This 10% probability was not necessarily an unconditional probability, but may have been conditional on a previous node in the conditional tree. ↩︎
- By “informative,” we mean that knowing the answer to one of these questions would make a larger difference, in expectation, to a participants’ forecast of the ultimate question, in this case, “Will AI cause human extinction by 2100.” For more on informativeness and the metric we use to assess it, see the section on Value of Information (VOI) . Forecasting platforms are generally focused on making accurate predictions by aggregating many people’s forecasts and usually allow participants to choose which questions to forecast. The questions that are popular on forecasting platforms are often questions that are important in themselves, more than as indicators of other events, and the platforms are not deliberately attempting to find high VOI questions. ↩︎
- For more on the question filtering process, see Section 2.2. ↩︎
- The four lowest-scoring AICT questions – EX50, HS50, NG30, and EX30 – ranked 12th, 13th, 14th, and 20th out of 23, respectively. ↩︎
- At the time of data collection, we had not yet developed the POM VOI metric, so participants were not deliberately optimizing for it. Later, we found that POM VOI captured the idea of question informativeness better than VOI alone, which yields a number that is hard to interpret and contextualize. For a full list of questions analyzed, see Table 3.1.3 . A comprehensive explanation of the POM VOI metric can be found in Appendix 4. ↩︎
- Careful readers will note that the probabilities in this figure do not yield the mean POM VOI values we report (see Table E.1). Mean POM VOI tells us how valuable a crux is for a group, on average, by computing POM VOI at the individual level and then aggregating. The average relative updates, across individuals in the same group, sometimes tell a quite different story. ↩︎
- Several related methods, such as Delphi and Bayesian Network elicitation, may be useful to forecasting research in similar ways. See Bernice B. Brown, “Delphi Process: A Methodology Used for the Elicitation of Opinions of Experts,” Rand Corporation report (September 1968) and Judea Pearl, Probabilistic Reasoning in Intelligent Systems , (New York, Morgan-Kaufman: 1998). ↩︎
- Karger et al., “Forecasting Existential Risks: Evidence from a Long-Run Forecasting Tournament,” 2023. https://forecastingresearch.org/research/existential-risk-persuasion-tournament (a) (XPT report). ↩︎
- These numbers are intended to be illustrative and are not based on actual vaccine data. ↩︎
- `Judea Pearl, “From Bayesian Networks to Causal Networks,” in Mathematical Models for Handling Partial Knowledge in Artificial Intelligence , ed. Giulianella Coletti et al., 160. Boston, MA: Springer, 1995. https://doi.org/10.1007/978-1-4899-1424-8_9 (a) ↩︎
- This relationship can be causal, but it does not need to be; in this project we did not constrain conditional trees to only causal relationships, nor did we probe expert models for causality in the interviews. ↩︎
- We defined “high risk” as forecasting >10% chance of extinction due to AI by 2100, and low risk as <10%. We defined “long AI timelines” as forecasting >30 years until transformative AI or artificial general intelligence and “short AI timelines” as less than 30 years. ↩︎
- Karger et al., XPT report. ↩︎
- One interviewee is not represented in this graph because in the interview, they responded “>0.1%, <50%” rather than give a point estimate. ↩︎
- One interviewee is not represented in this graph because in the interview, they responded “>0.1%, <50%” rather than give a point estimate. ↩︎
- Interviewers were: Tegan McCaslin (11/24 interviews), Josh Rosenberg (10/24 interviews), and Ezra Karger (3/24 interviews) ↩︎
- For a full description of the interview process, see Appendix 6. ↩︎
- This incentive was not explained in further detail given time constraints of the interview. ↩︎
- In the XPT, “Extinction” was defined as “reduction of the global population to less than 5000,” and extinction was considered “due to AI” if AI was the direct or proximate cause of the deaths. This definition encompasses events that would not have occurred or would have counterfactually been extremely unlikely to occur “but for” the substantial involvement of AI within one year prior to the event. For more details, see Karger et al., XPT Report, 134. For some interviewees, this was a question for which they had already devoted substantial time (in the XPT or other contexts) forming a quantitative forecast, and thus such participants were able to offer a relatively quick probability judgment. Most participants had previously spent substantial time thinking about the possibility of AI-related extinction, but not as much time forming a precise quantitative estimate for the date in question, and many expressed hesitancy about their answer in the interview. ↩︎
- These resolution years were chosen to match XPT questions. ↩︎
- Question writers were Tegan McCaslin, Taylor Smith, Josh Rosenberg, Rose Hadshar, Adam Kuzee, Ezra Karger, Arunim Agrawal, and Bridget Williams. One primary question writer was assigned to each question prompt, and would draft several different versions of the question, using the interview notes as an aid to understanding the interviewee’s underlying models. These drafts would receive feedback from the rest of the question-writing team, and in particular from the relevant interviewer. This interviewer had final say over revisions and finalizing the question. ↩︎
- The initial screen was not simply a VOI threshold. To get a diverse question set, we wanted to include at least one question from each of the following categories: 1) high VOI for superforecasters, 2) high VOI for experts, 3) high VOD between experts and superforecasters, 4) jointly high VOI between superforecasters and experts, 5) randomly chosen representative of the bottom half of the AICT question set, and 6) top comparable question from outside the AICT set. Choosing cutoffs separately for each of these categories resulted in thirteen questions. ↩︎
- Participants gave estimates for the probability of the question resolving positively (P(c)), and the probability of AI extinction conditional on the question resolving positively (P(U|c)). We then used these figures to calculate each respondent’s VOI for each question. ↩︎
- The “concerned expert proxies” were teammates or collaborators who had had extensive contact with concerned experts, who we expected to be able to model this group’s views well. ↩︎
- Instead of giving probability judgments on all 75 questions, the concerned expert proxies chose and rank-ordered their top 10 questions from each of: the set of first-tier nodes (usually 2030); the set of second-tier nodes (2035-2050); and the set of third-tier nodes (2040-2070). They then provided short-fuse VOI judgments for only the questions they had ranked in their top 10 for each position. ↩︎
- For the concerned expert-proxy data, we ranked questions via ranked choice voting. We also employed the value of discrimination (VOD) metric, which measures the change in disagreement between two forecasters a question is expected to make (see Appendix 4). VOD was determined by the median of pairwise VOD across both skeptical superforecasters and concerned expert-proxies. We excluded questions which closely resembled other questions ranked higher, those which the question-writing team did not operationalize, and those with the lowest individual-level VOI ranking. ↩︎
- The filtered question set included the following questions. See Table 3.1.3 for concise question summaries. Node 1 (dates up to 2030): CQ30, VL30, NG30, respectively: The VOI top-ranked node for skeptical superforecasters; the VOI top-ranked node for concerned expert proxies (also ranked 2nd for VOD); the VOI 2nd-ranked node for concerned expert proxies; CX30: The top-ranked node for VOD; ZD30: Included for having relatively good agreement on high VOI between groups; EX30: Randomly chosen from the set of nodes ranked in the bottom half by both groups, as a check on the validity of the filtering process; STQ9: A question from outside our question set, the most-upvoted AI question on Metaculus resolving around 2030. Node 2 (2031 – 2070): ZA50, EX50, VL70, respectively: The VOI top-ranked node for skeptical superforecasters; the VOI top-ranked node for concerned expert proxies; the VOI 2nd-ranked node for concerned expert proxies; CX70: The top-ranked VOD node; HS50: Randomly chosen from the set of nodes ranked in the bottom half by both groups; STQ247: The most-upvoted AI question on Metaculus resolving post-2030. ↩︎
- The eight superforecasters in this sample took part in FRI’s Adversarial Collaboration project that brought together generalist forecasters and domain experts with divergent views on AI’s long-term risks to humanity. See Forecasting Research Institute, Roots of Disagreement on AI Risk: Exploring the Potential and Pitfalls of Adversarial Collaboration (2024) (a). ↩︎
- For more detail on the selection pool, see Roots of Disagreement on AI Risk. The “AI-concerned” expert for this project consisted of domain experts referred to us by our funder and the broader effective altruism community. ↩︎
- Respondents were not able to revise this forecast later in the survey. ↩︎
- Once forecasters submitted their answers on a question, the survey checked for coherence, and then prompted the respondent to revise their answers if the coherence condition was not met. Coherence requires that P(U) > P(U|c)P(c), where P(U) is the forecaster’s probability of the ultimate question U resolving positively, P(U|c) is the probability of U resolving positively if the crux c resolves positively, and P(c) is the probability of the crux resolving positively. This coherence prompt was not repeated on a question if the respondent failed to give coherent revised answers on that question. For any answers which remained incoherent after the respondent finished the survey, we followed up and requested revision. ↩︎
- Due to coding errors in an early version of the survey, not all participants were given an opportunity to review their answers in the survey. We instead asked such participants to manually review their answers afterward. ↩︎
- The two questions added to the supplementary survey were HB30 and CX50. See Appendix 1 for full question descriptions. ↩︎
- For details on the selection criteria, see Section 3.2. A z-score indicates how many standard deviations an observation is from the mean and in which direction. David S. Moore, George P. McCabe, and Bruce A. Craig, Introduction to the Practice of Statistics , 6th ed. (New York: W. H. Freeman and Company, 2009), 61. ↩︎
- 8 superforecasters (7-8 respondents per question) and 11 domain experts (4-6 respondents per question). ↩︎
- Most questions in the main question-rating survey were selected based on high scores from either superforecasters or expert “proxy” judges, or both. However, two questions, EX30 and HS50, were randomly selected from the intersection of the bottom half of superforecaster and expert proxy scores. While these questions ranked poorly among superforecasters in the main survey, EX30 notably received the second-highest score from experts. Overall, the correlation between expert “proxy” scores and expert scores in the main question-rating round was weak. ↩︎
- For a description of Kullback-Leibler VOI, see Appendix 4: VOI technical explanation. ↩︎
- The advantages of POM over straight VOI are (i) it is more interpretable; and (ii) it does not penalize respondents with low prior probability P(U). The size of the update is constrained by the prior probability P(U) together with the probability of the crux event P(c) to be less than P(U) / P(c). ↩︎
- In the supplementary survey (see Section 4.1), two superforecasters updated their forecasts slightly, resulting in an average P(U) of 0.26%. ↩︎
- The goal was to choose the most informative questions. The initial selection criteria were to choose the top-ranked question by POM and POM-z for questions resolving in 2030 and 2050-2070 separately, including both where these disagreed. For 2030, we chose CX30 (highest POM) and CQ30 (highest POM-z). For 2050-2070, we chose CX50 based on it having the highest POM. While the selection criteria suggested that VL70 should be selected as the top POM-z question, as a whole the evidence pointed to ZA50 being more informative (higher POM, at 1.59% vs 0.54%; POM-z close to VL70, at 0.53 vs 0.67; and higher under the pairwise wins robustness check, at 87% vs 64%). ↩︎
- Careful readers will note that the probabilities in this figure do not yield the mean POM VOI values we report (see Tables 3.4.1 and 3.4.2). Mean POM VOI tells us how valuable a crux is for a group, on average, by computing POM VOI at the individual level and then aggregating. The average relative updates, across individuals in the same group, sometimes tells a quite different story. ↩︎
- While an extreme data point could typically indicate a coding error, the subcomponents of VOI analysis suggest a genuine answer rather than a common error such as a misplaced decimal. The outlier respondent assigned a low probability (0.5%) to the “administrative disempowerment warning shot” scenario, but provided a substantial update (a 100-fold increase, from 0.1% to 10%) toward AI extinction if the scenario were to occur. In contrast, all other respondents thought the probability of it occurring was higher (mean=18%), but offered smaller updates than the outlying respondent (mean = 1x, with three updating not at all and one updating down). ↩︎
- Or, indeed, the motivations of a misaligned AI system with access to weaponizable technology. ↩︎
- Interquartile range (IQR) is the middle 50%, or the difference between the 25th and 75th percentile forecasts. ↩︎
- The outlier respondent assigns a low probability to the question (5%), but updates substantially (relative risk = 3x), while on average respondents rated the question as having moderate probability (mean=37%) and a moderate relative risk (mean=1.9x). ↩︎
- Proxy ratings for 2030 questions showed strong negative correlation with POM VOI judgments from the small sample of experts in the main survey. They also showed slight negative correlation with the main survey POM-z. Notably, a question randomly chosen from the bottom half of proxy scores ranked second by expert POM (EX30). This suggests that many questions from our larger 2030 set might have performed better than the average question in our main question-rating survey if presented to these particular experts. ↩︎
- The 2050/2070 proxy performed moderately well for our small expert sample, with a correlation between mean expert POM and proxy rank of -0.4, and mean expert POM-z score and proxy rank of -0.5 (a more negative value indicates a stronger correlation, as higher rank orders are considered worse, while higher VOI scores are better). ↩︎
- For example, the top five questions on Metaculus at the time of this writing (July 25, 2024), are “Who will be elected US president in 2024?”; “Five years after AGI, will an AI company be a military power?”; “Five years after AGI, if there are digital people, what will be their population?”; “Who will be the Democratic nominee for Vice President on Election Day 2024 (if Joe Biden is no longer the nominee for President)?” and “When will an AI win a Gold Medal in the International Math Olympiad?” Of those, only “When will an AI win a Gold Medal in the International Math Olympiad?” seems to be interesting primarily because it is an indicator about a more important question. ↩︎
- Out of the 265 questions in our status quo set, 253 of them (~95%) came from just two platforms: Metaculus and Manifold Markets. We included in our set all questions resolving no earlier than 2027, and which were tagged “AI,” “artificial intelligence,” “machine learning,” or similar. Because Manifold Markets had a very large overall volume of questions, and because many questions with little engagement on this platform were duplicates of other questions, or otherwise low-quality, we only included Manifold Markets questions which had at least 50 traders at the time of collection. ↩︎
- Questions can fall into multiple categories. ↩︎
- Example questions in each category (many questions fall into multiple categories):
Acceleration: Deep learning revenue (VL30)—Revenue from deep learning doubles every two years before 2030.
Social / Political / Economic: AI Socializing (MQ70)—humans talk to AIs more than to humans by 2070.
Alignment: No interpretability progress (ZA40c)—by 2040, there are no interpretability tools which allow us to understand the function of state-of-the-art transformer component parts/circuits.
AI harms: Repeated AI harms (HS40)—by 2040, there are at least two events in a five-year period in which an AI system used by a major company causes at least 1,000 deaths or damage of $10B. ↩︎ - We chose the “Alignment” category because it was much more prevalent in the AICT set than in the status quo set, suggesting that the questions in that category may be unique in interesting ways. ↩︎
- The 25 questions in the AICT set were divided into different themes for analysis of uniqueness, some of which overlapped. ↩︎
- The theme “power-seeking” covers questions about AI models developing power-seeking or deceptive behavior; the theme “developer perception” covers questions about AI developers’ perception of alignment work. See Appendix 3.2 for additional information about categorization into themes. ↩︎
- Because AICT questions are often complex or technical, we suspect they may be less fun to forecast and therefore attract fewer participants, though this is untested. As an inexpensive experiment, we are posting these questions to two forecasting platforms to see whether they get engagement. We encourage readers to see Appendix 7 for further details on how you can submit your own forecasts on these questions. ↩︎
- Questions relating to concrete harms also featured in all three interviews with superforecasters, though this very small sample size makes it difficult to draw any conclusions about superforecasters’ concerns in general. ↩︎
- The careful reader will notice that the values in this column don’t match those found in Tables 3.1.2, 3.3.1 and 3.3.2. This is because the two additional questions (HB30 and CX50) forecasted on by superforecasters are not included in the calculation of z-scores here. ↩︎
- For example, in the Good Judgment Inc. project that compared superforecasters to other participants in an online forecasting competition, the average question was open for 214 days, with the entire tournament taking place over six years. Christopher W. Karvetski, “Superforecasters: A Decade of Stochastic Dominance,” technical white paper (2021): 2, https://goodjudgment.com/wp-content/uploads/2021/10/Superforecasters-A-Decade-of-Stochastic-Dominance.pdf. In addition to extensive research on shorter-term forecasts, Tetlock et al. found that, at least on some types of questions, experts are more accurate than simple base rate extrapolation over 25 year horizons, although they are much less accurate than they were over 0-2 years. Our research asks forecasters to consider forecasts over many decades, and we do not yet know how much accuracy declines over that much longer period. Philip E. Tetlock et al., “Long-Range Subjective-Probability Forecasts of Slow-Motion Variables in World Politics: Exploring Limits on Expert Judgment,” Futures & Foresight Science (2023), 33. ↩︎



