Published: Aug 12, 2024

Working Paper #3

Working paper

Working paper
Working Paper #3

Conditional Trees: A Method for Generating Informative Questions about Complex Topics

In this study, we used structured interviews with domain experts and superforecasters to generate questions that provide high "value of information" regarding a far-future outcome.

Tegan McCaslin^*, Josh Rosenberg^*, Ezra Karger^*†, Avital Morris^*, Molly Hickman^*, Otto Kuusela^*, Sam Glover^*, Zach Jacobs^*, Philip E. Tetlock^*‡ ,

* = Forecasting Research Institute
† = Federal Reserve Bank of Chicago
‡ = Wharton School of the University of Pennsylvania

Published: Aug 12, 2024

Tegan McCaslin^*, Josh Rosenberg^*, Ezra Karger^*†, Avital Morris^*, Molly Hickman^*, Otto Kuusela^*, Sam Glover^*, Zach Jacobs^*, Philip E. Tetlock^*‡

Abstract

We test a new process for generating high-value forecasting questions: asking experts to produce “conditional trees,” simplified Bayesian networks of quantifiably informative forecasting questions. We test this technique in the context of the current debate about risks from AI. We conduct structured interviews with 21 AI domain experts and 3 highly skilled generalist forecasters (“superforecasters”) to generate 75 forecasting questions that would cause participants to significantly update their views about AI risk. We elicit the “Value of Information” (VOI) each question provides for a far-future outcome—whether AI will cause human extinction by 2100—by collecting conditional forecasts from superforecasters (n=8).¹ In a comparison with the highest-engagement AI questions on two forecasting platforms, the average conditional trees-generated question resolving in 2030 was nine times more informative than the comparison AI-related platform questions (p = .025). This report provides initial evidence that structured interviews of experts focused on generating informative cruxes can produce higher-VOI questions than status quo methods.

View the full PDF Report

Acknowledgments

This research would not have been possible without the generous support of Open Philanthropy. We thank the research participants for their invaluable contributions. We greatly appreciate the assistance of Page Hedley, Kayla Gamin, Leonard Barrett, Coralie Consigny, Adam Kuzee, Arunim Agrawal, Bridget Williams, and Taylor Smith in compiling this report. Additionally, we thank Benjamin Tereick, Javier Prieto, Dan Schwarz, and Deger Turan for their insightful comments and research suggestions.

Executive summary

Introduction

From May 2022 to October 2023, the Forecasting Research Institute (FRI) (a)² experimented with a new method of question generation (“conditional trees”). While the questions elicited in this case study focus on potential risks from advanced AI, the processes we present can be used to generate valuable questions across fields where forecasting can help decision-makers navigate complex, long-term uncertainties.

Methods

Researchers interviewed 24 participants, including 21 AI and existential risk experts and three highly skilled generalist forecasters (“superforecasters”). We first asked participants to provide their personal forecast of the probability of AI-related extinction by 2100 (the “ultimate question” for this exercise).³ We then asked participants to identify plausible⁴ indicator events that would significantly shift their estimates of the probability of the ultimate question.

Following the interviews, we converted these indicators into 75 objectively resolvable forecasting questions. We asked superforecasters (n=8) to provide forecasts on each of these 75 questions (the “AICT” questions), and forecasts on how their beliefs about AI risk would update if each of these questions resolved positively or negatively. We quantitatively ranked the resulting indicators by Value of Information (VOI), a measure of how much each indicator caused superforecasters to update their beliefs about long-run AI risk.

To evaluate the informativeness of the conditional trees method relative to widely discussed indicators, we assess a subset of these questions using a standardized version of VOI, comparing them to popular AI questions on existing forecasting platforms (the “status quo” questions). The status quo questions were selected from two popular forecasting platforms by identifying the highest-engagement AI questions (by number of unique forecasters). We present the results of this comparison in order to provide a case study of a beginning-to-end process for producing quantitatively informative indicators about complex topics. (More on methods)

Results

The conditional trees method can generate forecasting questions that are more informative than existing questions on popular forecasting platforms⁵

Our report presents initial evidence that structured interviews of experts produce more informative questions about AI risk than the highest-engagement questions (as measured by unique users) on existing forecasting platforms.

Using predictions made by superforecasters (n=8), we compared the status quo questions to a subset of the AICT questions.⁶ Most of the AICT questions (nine of 13) scored higher on VOI than all 10 status quo questions.⁷

VOI is based on each respondent’s expected update in their belief about the ultimate question, not on how much a participant would update if an event happened. That is, it takes into account how likely the forecaster believes an event is to occur. If an event would result in a large update to a participant’s forecast, but is deemed vanishingly unlikely to occur, it would have a small VOI. If an event would result in a large update, and is also considered likely to occur, it would have a high VOI.

Table E.1 compares the top five AICT questions to the top five status quo questions, as measured by superforecasters’ ratings of a standardized metric of informativeness, which we call “Percentage of Maximum Value of Information” (POM VOI).⁸ In this table and throughout the report, we refer to questions by their reference numbers. For a full list of the AICT questions and status quo questions selected from forecasting platforms by reference number, with operationalizations and additional information, see Appendix 1.

Question	Mean POM VOI
AI causes large-scale deaths, ineffectual response (CX50)	6.34%
Administrative disempowerment warning shot (CX30)	3.55%
Deep learning revenue (VL30)	1.68%
Power-seeking behavior warning shot (ZA50)	1.59%
Extinction-level pathogens feasible (CQ30)	1.37%
Superalignment success (STQ205 / STQ215)*	0.28%
Kurzweil/Kapor Turing Test longbet (STQ9)*	0.27%
Brain emulation (STQ196)*	0.23%
Human-machine intelligence parity (STQ247)*	0.14%
Compute restrictions (STQ236)*	0.13%

Table E.1: Ratings of how informative AICT questions are relative to status quo questions. The status quo questions are marked with an asterisk.

Focusing on questions resolving in the near-term (by 2030), we found that questions generated with the conditional trees method were, on average, nine times more informative than popular questions from platforms (p = .025). While we did not find a statistically significant result for questions resolving in 2050-2070, in our sample AICT questions were still eleven times more informative on average. (More on VOI comparison)

Questions generated through the conditional trees method emphasized different topics than those on forecasting platforms

We also analyzed the extent to which questions taken from existing forecasting platforms effectively captured the topics raised in our expert interviews. We found that some topics (such as AI alignment-related questions and questions related to concrete AI harms) were of substantial interest to experts but had not received proportional attention on existing forecasting platforms, and that questions generated by the conditional trees method were meaningfully different from those taken from existing forecasting platforms.

The table below compares the topical distribution of the AICT questions to the status quo questions. (More on question uniqueness)

Category	AICT question set	Status quo question set
Social / Political / Economic	24% (29)	33% (131)
Alignment	20% (25)	12% (47)
AI harms	20% (25)	7% (27)
Acceleration	36% (44)	48% (191)

Table E.2: Proportion of total questions that fell into each category; numbers in parentheses are total questions per category. While some questions fell into multiple categories (and thus proportions in each column should sum to more than 100%), proportions have been normalized for ease of comparison.

We found weak evidence that superforecasters and experts value different types of questions

Given the small sample sizes involved, we are reluctant to make confident claims about the significance of the difference between the opinions of the superforecasters and the experts. However, we do see these results as providing prima facie evidence about which questions are the most informative for each group when making updates on the probability of AI-related extinction.

Our most notable finding when comparing the views of the superforecasters to those of the experts was that the superforecasters tended to value questions that focused on concrete harms caused by AI, rather than the experts’ preference for questions regarding advanced AI capabilities or whether AI had been successfully aligned. (More on AI risk takeaways)

Figure E.1 shows examples of how experts updated on the ultimate question conditional on three of the highest-VOI indicator questions.

**Figure E.1**: A diagram showing how experts update on three relatively high-VOI questions for different resolution years that scored particularly well on our VOI metric. Since experts answered different sets of questions, we derived P(U|C) and P(U|~C) (the probabilities on the bottom level) by multiplying the whole expert group’s average P(U) of 17% by the average relative risk factor for each crux.⁹

The table below provides more detail on each of the questions in the previous figure.

Question	Summary
Administrative disempowerment warning shot (CX30)	This question resolves positively if any of the following occur by 2030: AI systems in a country freeze assets affecting ≥1% of the population or ≥0.5% of GDP for over a year; prevent ≥5% of voting-age citizens from participating in a national election; restrict >3% of residents’ movement for 3+ months; limit >10% of the population’s internet access for 3+ months; or legally restrict >5% of citizens’ access to certain housing or employment for 1+ year.
Power-seeking behavior warning shot (ZA50)	This question resolves positively if, by 2050, AI developers must attempt to disable or destroy an AI (costing >$5M) after it exhibits power-seeking behaviors such as acquiring significant resources, controlling financial accounts, manipulating humans, infiltrating key systems, seizing control of weaponry, self-propagation, or engaging in hacking.
No aligned AGI (CX70)	This question resolves positively if there is no AGI system which meets several “alignment” criteria by 2070. Those criteria include AGI-like capabilities (high question answering skill on hard questions in diverse subjects, high programming ability) and alignment (performing well on SOTA alignment tests, and expert surveys indicating that experts believe that AI is aligned).

Table E.3: Example summaries of questions that experts found to be particularly informative.

The conditional trees method still has disadvantages

While this case study suggests that the conditional trees method can generate informative forecasting questions, a primary limitation of the method as implemented is its high labor cost. The process involved conducting more than 20 interviews with subject matter experts, writing 75 forecasting questions, and eliciting conditional forecasts. In future work, we expect it would typically be more efficient to elicit fewer indicators within a conditional tree and to operationalize only 1-2 forecasting questions per interview before eliciting forecasts. The intensive process described in this case study would be most appropriate for particularly high-value topics with large pools of resources for research. Additionally, it may be possible to use LLMs or incentivized crowdsourcing for the question generation or filtering stages, making the process cheaper and less labor intensive. (More on limitations of our research)

Key takeaways

Preliminary evidence suggests that the conditional trees method of generating forecasting questions can result in questions that perform better on “Value of Information” metrics than popular questions on existing forecasting platforms.
The conditional trees method produced questions with a markedly different distribution of topic areas compared to those on existing forecasting platforms. Notably, the conditional trees approach led to a greater proportion of questions focused on AI alignment and potential AI harms, reflecting that certain expert priorities may be underrepresented in existing forecasting efforts.
In our limited sample, experts tended to find questions related to alignment and concrete harms caused by AI to be the most informative. Superforecasters also found questions relating to concrete AI harms to be informative, but were less likely than experts to find questions relating to alignment to be informative.
The conditional trees method as implemented in this case study is particularly labor intensive. We expect the most broadly useful versions of this process would take the underlying principles and 1) apply them to shorter interviews with smaller numbers of forecasting questions to operationalize, 2) leverage LLMs for elicitation and synthesis, and/or 3) utilize crowdsourcing at the question generation and filtering steps.

Key outputs

In addition to the above takeaways, we highlight key outputs from the report: the tangible resources developed during the course of the conditional trees process which we believe may be useful to others interested in replicating parts of the process.

We created a guide and replicable process for using conditional tree interviews to generate informative forecasting questions (see Appendix 6). This process can be implemented by organizations and individuals that need high-quality, informative questions.
We provide details of relevant metrics (e.g., “Value of Information”) that can be used to assess how informative each generated question is. See our public calculator for “value of information” and “value of discrimination” here.
In total, the conditional trees process generated 75 new questions relating to AI risk. The full operationalizations and resolution criteria of these questions are available in Appendix 1 of this report. We have posted several of the highest-VOI questions to two forecasting platforms and encourage interested readers to submit their own predictions. (See Appendix 7 for links)
We used our question metrics to create aggregated conditional trees that visually summarize the most important AI risk pathways according to small samples of experts and generalist forecasters. These aggregated trees can be found here.

Limitations of our research

Limitations of our research include:

The total number of participants in this study was small (n=8 forecasts on most questions, 24 interviewees to generate questions).
The forecasting tasks in this study were unusually difficult, involving low probability judgments, long time horizons, conditional forecasts, and “short-fuse forecasts” made very quickly.
Participants were all either experts who are highly concerned about existential risks from AI or superforecasters who are relatively skeptical, so we are not able to separate differences caused by risk assessment from differences caused by forecasting aptitude, professional training, or other factors.

(More on limitations of our research)

Next steps

Further research related to this topic could include:

Studies on the same questions with larger numbers of forecasters, including by integrating the questions into existing forecasting platforms.
Replicating the conditional trees process in domains other than AI risk.
Following up as questions begin to resolve in 2030 to assess whether forecasters update their views in accordance with their expectations.

(More on next steps)

Glossary

AI Conditional Trees (AICT) question set

The set of questions generated by the AI Conditional Trees process described in this report.

Conditional tree

A simplified Bayesian network, in which each node is an event that may or may not occur, and each connection between nodes has the factor by which the next node is more or less likely if that one happens. In this report, the conditional trees ultimately ask how likely it is that AI causes human extinction by 2100, and each node is an event that affects the likelihood of that ultimate outcome.

Operationalization

The process of making a question about a future event into a resolvable forecasting question. For example, if a prompt said “there is major progress in interpretability by 2030” the operationalized question would contain a specific way to resolve that question so that there can be no future dispute about whether the progress counts as “major.”

Percent of Max (POM)

When we present VOI for a question, we also present the percentage of the maximum VOI (POM VOI) it captured in order to contextualize the magnitude of the results. The POM VOI of a question can be interpreted as the fraction of the uncertainty about the ultimate question U the question resolves, in expectation.

Question prompts

General topics of questions that we then operationalized into forecasting questions. For example, “major progress in interpretability by 2030” could be a question prompt, although it is not a clearly resolvable forecasting question.

Short-fuse forecasts

Very quickly estimated forecasts, in which each participant spent no more than one minute per question and gave a snap judgment.

Status quo questions

Questions on AI that we selected from existing forecasting platforms on the basis of their popularity (largest number of unique users) and other criteria. See 2.3 Selection of status quo questions.

Ultimate question / Ultimate outcome (U)

The “ultimate question” that all of the intermediate questions help predict. In this study: “Will AI cause human extinction by 2100?”

Value of information (VOI)

VOI is a measure of how much knowing the answer to a question would change an individual’s belief, in expectation. This is useful for understanding why individuals believe what they believe and what would change their minds.

1. Introduction

For policymakers to use forecasting in their work, they need accurate forecasts, but—perhaps equally important—the forecasts need to be about decision-relevant questions. Knowing which questions will be the most valuable to forecast on can be difficult. How can policymakers identify the short-term events that are most relevant to important long-term outcomes?

Here we present a tool, the conditional tree method (figure 1.1.1), which can distill complex issues into a few key uncertainties. We apply it to a topic of increasing public concern: “Will advanced artificial intelligence pose an existential threat to humanity in the 21st century?” Using a specialized interview process, we learn what subject matter experts believe are the best warning signs for this risk in the coming decades. Then we use metrics based on conditional forecasting to quantitatively measure the relevance of these warning signs. This allows us to winnow down to a few highly relevant indicators of increased risk to humanity from AI.

**Figure 1.1.1:** The conditional trees process

The conditional trees approach¹⁰ represents a new set of priorities in the field of forecasting. Most previous forecasting research focused almost exclusively on identifying accurate forecasters and improving forecasting accuracy. But comparatively little work was invested in choosing forecasting targets. In order to mature into a practically applicable body of knowledge, the field must look beyond optimizing forecasts and toward optimizing the questions we ask.

1.1 A method for generating and judging high-value questions

Some forecasting tournaments and platforms have already begun to utilize domain experts to generate questions with real-world relevance. However, many of these efforts are relatively ad hoc, producing inconsistent results and plausibly missing many high-value forecasting targets.

For example, for the Existential Risk Persuasion Tournament (XPT),¹¹ the question preparation phase enlisted domain experts to comment on the prospective question set in a relatively unstructured way. While this undoubtedly improved the question set, it did not identify the most informative questions within the set.

To leverage the expertise of domain experts more fully, we propose a more in-depth, systematic approach: expert elicitation structured around conditional trees.

Why conditional trees?

Conditional trees represent beliefs through a tree-like structure, using nodes to represent events that influence the probability of an ultimate outcome. In the tree in Figure 1.1.2, for example, if you know someone is vaccinated, they are half as likely to be infected than if you were unsure whether they were vaccinated. Then, if you know they have been exposed, they are 3.5x as likely to be infected.¹²

**Figure 1.1.2:** Example conditional tree diagram

In this study, the ultimate outcome was the probability of extinction due to AI by 2100, and the nodes are events that make that outcome more or less likely. The tree structure makes the conditional probabilities beneath a forecast explicit and visible, and may help forecasters narrow in on specific, important factors.

Participants initially provided an estimate of the probability of AI-related extinction by 2100 (the “ultimate question”), represented by O in Figure 1.1.2. Interviews then focused on identifying key indicators on the pathway to AI-related extinction. Participants selected two to five indicators for deeper analysis to understand how they might alter the risk of AI-related extinction. These factors then became the antecedents in the tree: for each of the indicators selected to be included in the tree, participants gave forecasts for how much their forecast of the ultimate outcome would change if that event happened.

The ultimate outcome (for our purposes, the probability of extinction due to AI by 2100) is an important parameter: the rest of the network’s relevance cascades from the outcome. But provided we’re able to identify an outcome with strong bearing on present policy decisions, we can ask experts to decompose the intervening time into possible events which would reflect a greater or lesser likelihood of reaching that outcome. Thus, these intervening events must themselves possess policy-relevance, in proportion to the strength of their relationship with the outcome, and the likelihood of observing them.

Conditional trees are a type of Bayesian network (BN).¹³ BNs explicitly represent probabilistic relationships between outcomes and their antecedents.¹⁴ This structure encourages experts to generate maximally relevant antecedents, and also provides us with a framework for measuring question relevance. But unlike some other forms of BNs, conditional trees are a relatively easy tool to learn. In our study, interviewees were able to grasp the necessary basics in around 10 minutes. This means that conditional trees may be more practical for interviews with subject-matter experts, who may not be experts in statistics or other domains that more often use BNs.

How does the conditional trees method fit into the forecasting research process?

**Figure 1.1.3:** Life cycle of an impactful forecasting project.

The AI Conditional Trees project is an in-depth investigation into how to generate informative forecasting questions. Question generation is the first step in the life cycle of an impactful forecasting project, illustrated in Figure 1.1.3.

Many earlier forecasting research projects have focused on identifying the most accurate forecasters and on improved methods for aggregating their forecasts. But to be useful to decision makers, forecasting research must move beyond those questions and incorporate forecasting into a process that includes question generation, considering actions based on forecasts, communicating with policymakers, and generating new questions.

Before the cycle starts, we begin with “scoping and gisting,” in which we consider the questions we want to answer, the scope of the possible project, and the general arguments (“gists”) on each side. We then begin the cycle by generating questions, through processes like the AI Conditional Trees method, aiming to find the forecasting questions that would be most informative to decision makers. Next, we elicit forecasts on those questions, to assess risk and understand which potentially dangerous events are most likely and in what circumstances. We then elicit “risk mitigation forecasts,” asking experts and skilled forecasters to predict which policies would most decrease risk and what the costs might be for implementing them.

Once we have completed these stages, we communicate that information to policymakers, and ask them whether it is useful and what would make it more relevant to their work. Their feedback gives us more information we can use for the next stage of question generation, and we begin the cycle again.

The cycle as depicted is somewhat stylized, and many forecasting projects will not include all of these stages. But thinking of AI conditional trees in the context of the “forecasting life cycle” helps us contextualize this work and think about how to incorporate it into our future research.

Measuring question value

In order to form the feedback loop necessary for a dramatic improvement in the decision-relevance of forecasting questions, we need a means of quantitatively measuring the value of a forecasting question.

Policymakers’ actions are often guided by a few important questions in their domain, like “What will be the effects of climate change over the next century?” or “Will our economy remain competitive in the world in the long-term?” Such questions are difficult to resolve because they refer to the distant future, and they may also be relatively complex or difficult to specify clearly. But often one can find nearer-term antecedent questions which are easier to resolve, and which would reduce some uncertainty about the “ultimate” question. For example, in a study forecasting the effects of climate change, with the ultimate question, “Will more than 2 billion people die or be displaced due to climate change by 2100?,” the question “What will the average global temperature be in 2040?” might be a good antecedent question. It would not give a forecaster the full answer to the main question, but knowing what the global surface temperature will be in 2040 would be at least somewhat helpful for forecasting the effects of climate change by 2100.

Thus, one way of conceptualizing the value of a forecasting question is to ask, “How would the answer to this question affect our expectation about an ‘ultimate’ question we care about?” There are several distinct ways of expressing this mathematically, which we collectively refer to as “Value of Information (VOI).”

Conceptually, VOI measures how important a potential crux question (“C”) is to a participant’s forecast of the ultimate question we care about (“U”, in this case: AI extinction risk by 2100), in expectation. That is, how much would a participant update on AI extinction risk by 2100 based on whether a crux happens, weighted by how likely that crux is to happen. A high VOI question for a given participant will therefore be one that a) that participant thinks has a meaningful chance of happening and b) meaningfully affects that participant’s forecast on the ultimate question.

VOI is a useful metric for understanding why individuals believe what they believe and what would change their minds. A technical explanation of VOI can be found in Appendix 4. To build intuition for using the VOI metric, we provide this calculator (a) in which users can input their own values. We also provide a more comprehensive R software package for calculating it.

2. Methods

2.1 Question generation

Sampling interviewees

Our sample included 24 interviewees in total: 21 “expert” interviewees, and 3 “superforecaster” interviewees. We aimed to include in our sample representatives of four quadrants of a strategically important belief space (see Figure 2.1.1):

short timeline for AI progress, high estimated risk from AI;
short timeline for AI progress, low estimated risk from AI;
long timeline for AI progress, low estimated risk from AI; and
long timeline for AI progress, high estimated risk from AI.¹⁵

**Figure 2.1.1:** Target groups for sampling

We gathered our expert sample via snowball sampling, seeded from recommendations from our funders and our networks. We do not expect our interview sample was particularly representative of any given group, such as AI experts. The goal of this project was to develop the trees process and assess whether it led to higher value questions, which did not require a representative expert sample. Our superforecaster sample was taken from the set of superforecaster participants in the Existential Risk Persuasion Tournament (XPT)¹⁶ who had shown particularly high engagement. Candidate interviewees were approached for interview with a monetary incentive for producing the “highest value” questions in our interview-derived question set.

**Figure 2.1.2:** Histogram of interviewees’ original forecasts of probability of extinction via AI by 2100.¹⁷

The majority of our expert sample had academic or professional experience pertaining directly to AI risk, such as experience in technical AI safety or AI governance (13/21 expert interviewees). Others were included for having publicly expressed views on AI risk indicating a high level of engagement with the topic and having expertise in a complementary field, such as machine learning (7/21 expert interviewees). Finally, a small number of our expert sample had expertise in a complementary field, but had not expressed detailed views on AI risk in public (2/21 expert interviewees). Most of our expert sample held senior positions within their fields, as professors, directors of organizations, leaders of research teams, or similar (13/21 expert interviewees).

Our expert sample skewed toward the top left quadrant in figure 2.1.1, “high risk/short timelines.” Of 21 expert participants, 13 estimated the risk of extinction from AI by 2100 to be >10%. Only one of our expert sample estimated the risk to be <1% by 2100, whereas the median expert in the XPT predicted 3%. Although we did not solicit AI progress timelines directly from interviewees, interview content generally suggested a positive relationship between beliefs in increased risk and shorter timelines in our sample.

Because of this skew in our expert sample, we chose to ensure some representation of the bottom two quadrants in figure 2.1.1 (low risk from AI) by selecting three superforecaster interviewees who forecast <10% probability of extinction from AI by 2100.

Interview process

Interviews were 1-on-1, ran for roughly 60 minutes and followed a semi-structured format. By default, interviews aimed to trace one plausible path of increasingly strong signals of heightened AI risk at three successive timepoints before 2100.¹⁸ Interviewers¹⁹ were allowed some latitude for individual approaches, but generally followed this basic structure:²⁰

Introduction, task instructions
Elicitation of P(AI-related extinction by 2100)
Node generation
Wrap-up questions

**Figure 2.1.3:** The conditional tree workflow (I)

Interviewees were first given a very brief summary of the aims of the project, a short explanation of conditional trees, and a statement of the goals of the interview. Interviewees were also told that they would be awarded $1,000 if a forecasting question derived from their interview was one of the “highest value” forecasting questions generated by the project.²¹ This introductory section of the interview typically took 10 minutes or less.

Next, interviewees were asked to give their best guess probability for the project’s “ultimate question,” namely “AI-related extinction by 2100,” which was operationalized as in the 2022 XPT.²² Following the probability elicitation, we sometimes asked participants warm-up questions, for instance asking them to name possible “driving forces” influencing their views.

Interviewees would then begin the node generating phase of the interview, which comprised the majority of interview time. Although we began the project with a set of three predefined years to ask participants about (2030, 2050 and 2070),²³ it soon became clear that this was not the best choice of years for participants with short AI progress timelines. Therefore, we began in the node generating phase to ask participants to propose a suitable set of years for their own trees (see Appendix 3 for the distribution of years chosen).

For each node, we took interviewees through a process of brainstorming, selection, and fleshing out. We would then elicit a probability of AI extinction by 2100 conditional on the node. We will refer to these pre-operationalization nodes as question prompts.

Interviewers took detailed notes, and most interviews were recorded (with participants’ permission). Further details on interview technique can be found in Appendix 6.

Operationalizing question prompts as forecasting questions

Question prompts were generally not fully resolvable forecasting questions, though some were operationalized in more detail than others. We considered it an inefficient use of interview time to focus on constructing forecasting questions with detailed resolution criteria, and also not the comparative advantage of expert interviewees generally. Instead, an internal question-writing team²⁴ turned question prompts into fully operationalized forecasting questions, with the help of notes from the interview and feedback from the interviewer.

The primary goals of question writing in this project were:

To capture as much of a question prompt’s original intent as possible, while still making questions highly resolvable.
To optimize the value of information from the question by adjusting thresholds or removing elements which made the probability of a positive or negative resolution too extreme.

We developed a template for the question-writing process, which encouraged question writers to first consider multiple distinct ways the interview node could be operationalized. They then analyzed these options with respect to several important criteria:

How much the question captured the most relevant aspects of the original interview node;
How efficiently the question captured relevant aspects of the original interview node;
Salient hypothetical cases of false positive resolution and false negative resolution;
How clear cut or practically feasible resolution of the question would be;
Amount of cognitive load for forecasters.

The question writer and reviewer would then jointly decide which formulations to include in the final question on the basis of these criteria. Finally, a more detailed set of resolution conditions would be written and incorporated into a “conditional tree summary document”, which could then be sent to the interviewee for feedback.

2.2 Judging questions and constructing aggregate trees

The question generation phase yielded 75 questions, some of which were very similar to one another, so our next task was to filter them and select the most useful questions to construct conditional trees. We began by eliciting “short-fuse” forecasts on each question, in which forecasters spent about one minute per question giving quick judgments that allowed us to estimate a rough VOI for each question. For the thirteen questions that passed this initial screen, we conducted a longer survey, asking participants to spend more time forecasting how likely each question is to resolve positively and how much difference it would make to their ultimate forecast of the likelihood of extinction due to AI by 2100.²⁵

Because participants in this study were all either (i) superforecasters who forecasted less than 1% likelihood of extinction due to AI by 2100 or (ii) people with professional AI risk-related experience who forecasted more than 1% likelihood of extinction due to AI by 2100 (with one exception, they forecasted at least 5%), we targeted these two socio-ideological camps separately in our question rating. We denote these groups, respectively, as “skeptical superforecasters” and “concerned experts.”

First pass filtering of the question set

Our full set of operationalized nodes included 75 questions, many of which were relatively overlapping. It would have been inefficient and excessively cognitively taxing to participants if we had attempted to elicit full 20-minute VOI judgments on each of the 75 questions. Therefore, we performed a first-pass filter on the question set using “short-fuse” forecasts.

We elicited VOI judgments in a “short-fuse” format from 8 skeptical superforecasters. This required very quick judgments, approximately 1 minute per question.²⁶ Separately, we also collected question data from a set of 5 “concerned expert proxies,”²⁷ asking them to rank order the question set and provide VOI judgments for a subset.²⁸ However, this method may have been substantially flawed, as actual experts did not ultimately think the questions selected by the proxies were more informative than other questions.

For superforecaster data, we ranked questions according to median VOI in the filtering round.²⁹ The filtered question set included thirteen questions including seven questions for the first tier (dates up to 2030) and six questions for the second tier (2031-2070).³⁰

**Figure 2.2.1:** The conditional tree workflow (II)
*Denotes stages which only superforecasters participated in.

Main question-rating survey

After the initial filtering, we further refined our question set using surveys, in which skeptical superforecasters and concerned experts were asked for more detailed forecasts on the filtered question set. We offered a fixed sum as an incentive for survey completion. Superforecasters answered a longer survey containing all thirteen questions. Because of experts’ time constraints, each expert answered a shorter survey containing a random subset of the questions.

The main survey superforecaster sample (n=8) was the same as the filtering survey sample. At this point, the sample had also participated in a lengthy adversarial collaboration with a camp of AI-risk concerned experts.³¹ Thus they had spent significant time developing their own beliefs on the topic and engaging with opposing beliefs.

The expert sample (n=11) was drawn from the candidate participant list from the AI adversarial collaboration.³²

Superforecaster survey

In the superforecaster survey, we presented all 13 questions of the filtered question set in Qualtrics, shown in two parts, first 2030 questions and then 2050-2070 questions. Within each part we randomized question order. Participants were instructed to spend approximately 20 minutes per question, to give their own beliefs, and separately to estimate the beliefs of the concerned expert group.

We first asked for (1) each participant’s own forecast of the probability of AI-related extinction by 2100 and (2) each participants’ forecast of what experts would forecast about the probability of AI-related extinction by 2100.³³

We then asked participants for forecasts on each of the 13 questions from the filtered question set. Each forecasting question contained moderately detailed resolution criteria, as well as links to reference information where possible. In the survey, answers were checked for logical coherence, and respondents were prompted to revise if necessary.³⁴ At the end of each part, we gave participants the opportunity to review all questions and answers from that section and revise if they wished.³⁵

A supplementary survey using the same protocol as above with questions drawn from the “status quo” question set (questions from forecasting platforms (see Appendix 3.2) was administered at a later date. This survey also included two further questions from the AI conditional tree set which had initially been eliminated in the filtering stage.³⁶

Expert survey

Experts were given the choice of a long or short version of the survey, including 6 and 3 questions, respectively. Each respondent saw a random subset of the 13 filtered questions. Experts were asked only to provide their own beliefs, without forecasting superforecasters’ beliefs. Apart from these changes, the survey was identical to the superforecaster survey.

Question combinations survey

Because individual question ratings are not sufficient to build a full conditional tree with multiple intermediate nodes, we followed up the main question-rating survey with a survey eliciting judgments for every combination of four top-scoring questions from the main question-rating survey. As this is a relatively sophisticated and labor-intensive task, we administered it only to our skeptical superforecaster sample.

This elicitation was conducted in a Google Sheets form, and included top-scoring questions (either by POM VOI or z-score³⁷) as previously rated by this sample: CX30, CQ30, CX50, and ZA50. VOI judgments were elicited for each of the sixteen combinations of “yes” and “no” resolutions for each of the four questions (i.e., all resolve positively; CX30 resolves positively and the rest negatively; CQ30 resolves positively and the rest negatively; …; all resolve negatively).

See Appendix 5 for further survey details. The image below presents the elicitation format.

**Figure 2.2.2:** Elicitation format for combinations (or “scenarios”) survey. Superforecasters were asked to provide forecasts for each of the scenarios in the yellow cells.

2.3 Selection of status quo questions

For comparison, we selected a set of pre-existing AI forecasting questions from popular forecasting platforms. Questions were restricted to those with dichotomous resolution which did not directly ask about AI causing human extinction. We selected questions with the largest number of unique users engaging with them, rather than by forecast or trading volume, which is more vulnerable to individual differences in updating frequency. We also restricted the number of questions written by known public figures (e.g., Scott Alexander, Eliezer Yudkowsky), as their outsized performance relative to other questions seemed primarily due to their personal following. For a later analysis regarding the distribution of question topics (see section 4.2 Distribution of question topics), we tagged these questions as “acceleration,” “alignment,” or “social/political/economic” using our judgment of their subject matter.

From Manifold Markets we selected three unique questions:

STQ47 (2030 set) – Largest total number of traders (1023), tagged “acceleration”
STQ149 (2030 set) – Largest number of traders for a non-public figure question (355), tagged “acceleration”
STQ19 (2030 set) – Largest number of traders for a non-public figure question, tagged “social / political / economic”

From Metaculus we selected four unique questions:

STQ196 (2050-2070 set) – Largest number of forecasters after those included in the main survey (424), tagged “acceleration”
STQ152 (2030 set) – Next largest number of forecasters (325), tagged “acceleration”
STQ232 (2050-2070 set) – Next largest number of forecasters for 2050-2070 set (263), tagged “acceleration”
STQ236 (2050-2070 set): Large number of forecasters for a 2050-2070 question, tagged “social / political / economic”

We selected two questions found on both platforms:

STQ9 (2030 set): Large number of forecasters/traders, tagged “acceleration”
STQ215 / STQ205 (2030 set): Large number of forecasters/traders, tagged “alignment”

3. Value of Information (VOI) Results

In this section we present the results of a quantitative analysis of question quality for our expert-derived “AI Conditional Tree (AICT)” question set. We rate these questions using metrics which factor in conditional and unconditional forecasts from surveyed populations, and reflect the strength of the relationship between the question and a possible future outcome (here, “AI-related extinction by 2100.”) We selected two groups for the survey—subject matter experts and superforecasters—analyzed separately.

These VOI results are presented in the spirit of a demonstration of methods, and we would caution readers not to place undue weight on the question ratings. Given the very limited number of survey participants,³⁸ the views captured here are unlikely to be representative of those of subject matter experts or skilled forecasters more generally. Furthermore, at the time of this report, eliciting conditional probabilities is a relatively new practice, and with many wrinkles still to be ironed out. Nevertheless, there are a few observations worth highlighting.

Among questions resolving in 2030, both groups rated “Administrative disempowerment warning shot” (CX30) as leading to relatively large updates on the probability of AI-caused extinction by 2100 in expectation: it ranked first with both groups for our main metric, POM VOI, and was relatively robust in sensitivity analysis. The question refers to a scenario in which “AI systems are preventing a substantial number of people from accessing financial assets, voting, moving freely, accessing the internet, or accessing certain types of housing or employment.”

As with many of the questions in our sample which performed well, CX30 benefited from one strongly positive opinion. Views about question value, even within the groups, were highly heterogeneous, and for all questions there was at least one respondent who took little or no information from it.

In the rest of this section, we:

Provide a summary of the methods, metrics, and terminology used in this analysis and explain how to read a conditional tree (More)
Summarize the question informativeness ratings for superforecasters and subject matter experts (More)
Present aggregated trees that show the most informative questions at each timepoint for both superforecasters and subject matter experts (More)
Provide details on the value of information ratings for all forecasting questions we surveyed superforecasters and subject matter experts about (More)

Summary of VOI methods, metrics and terminology

We surveyed two groups: a) forecasters with a strong track record of short-term accuracy, who also estimated a relatively low chance of AI-related extinction by 2100 (“skeptical superforecasters”) (n = 8 total, 7-8 respondents per question); and b) subject matter experts in fields related to AI risk, who also estimated a relatively high chance of AI-related extinction by 2100 (“concerned experts”) (n = 11 total, 4-6 respondents per question).

Due to the high cost of obtaining forecasts on all 75 questions, we evaluate only a subset of questions (13 in total). These were selected for their performance in a preliminary filtering round, though our data suggests that this filtering round was a weak predictor of main question-rating survey results, especially for our expert sample.³⁹ We also include in our survey the most popular (as of July 2023) AI questions from Metaculus, one each for 2030 and for the time period 2050-2070.

For each forecasting question, we asked respondents for their probability that it would resolve TRUE, and for their probability that AI extinction by 2100 would resolve TRUE, conditioned on the forecasting question resolving TRUE. We use Kullback-Leibler VOI (KL VOI, or simply VOI from this point forward) as our VOI measure.⁴⁰

We focus on the percentage of the theoretical maximum VOI (POM VOI, or simply POM) that a question achieves as our main result.⁴¹ In some places we also report the z-score of a question’s POM VOI value for a given respondent (POM-z VOI, or simply POM-z). This value is useful if you believe individual respondents may have a bias toward giving higher or lower answers in general, or toward reporting an overall wider range of VOI values. It is particularly useful in the case of the expert results, as each expert answered only a random subset of all survey questions, and thus the influence of individual response biases on the resulting rank order of questions is potentially problematic. We suggest interpreting POM-z as a robustness check on the main POM results.

We aggregate POM and POM-z over respondents using the arithmetic mean. This sometimes has the effect that a single extreme response dominates the aggregate; however we believe this is appropriate in the context of very small sample sizes for POM values: an apparent “outlier” opinion in a small cohort may reflect the existence of a genuine faction in a larger population.

We also report a “pairwise wins” statistic derived from our sensitivity analysis, roughly indicating the robustness of the ranking to resampling simulations. This was calculated as the percentage of times a given question had higher POM VOI than other questions in the set in a resampling simulation. We use this as an additional robustness check on the main POM results.

Throughout this report, we refer to the probability of the ultimate question resolving positively, “AI causing extinction by 2100”, as P(U), and the probability of indicator questions as P(c). P(U|c) is the probability of the ultimate question, given that an indicator question resolves positively. When we report aggregate probabilities, we use the arithmetic mean. We report relative risk as P(U|c) / P(U).

How to read a conditional tree diagram

A conditional tree diagram begins with an initial node displaying the “start date”, usually the point in time at which the conditional tree survey was elicited. This node also displays a current estimate of the probability of some “ultimate question,” which may be either an individual’s estimate or an average over respondents.

The subsequent node represents an “indicator,” or an event which implies an update to the probability of the ultimate question. It displays a highly abridged question title and question ID, for which question summaries and full texts can be found in Appendix 1. Below the node is an estimate of the probability of TRUE or FALSE resolution.

The first indicator question may be followed by one or more additional indicator question layers. Resolution of these questions is estimated conditional on the outcomes of any previous question layers. That is, when indicator question #1 resolves positively, it may affect the probability of indicator question #2 resolving positively, and this is reflected in the values displayed in Figure 3.1.

Finally, the ultimate question nodes are the terminal point of each branch, and display an updated probability estimate conditional on the path leading to it.

**Figure 3.1.1:** Conditional tree diagram for AI-related extinction risk

3.1 Question ratings summary

Tables 3.1.1 and 3.1.2 show ratings for thirteen questions from the question generation process and two additional, highest-ranked “status quo” questions drawn from forecasting platforms, for a total of fifteen questions. Summaries of question content can be found in Table 3.1.3.

On average, the experts estimated that the probability of AI-related extinction by 2100 is 16.8%. The superforecasters were more skeptical of the risk, with an average probability of 0.25%.⁴²

Question rating summary

	Superforecasters		Experts*
	VOI rank	Relative risk (P(U\|c) / P(U))	VOI rank	Relative risk (P(U\|c) / P(U))
2030 Questions
Administrative disempowerment warning shot (CX30)	1	13.4	1	1.9
Deep learning revenue (VL30)	2	2.5	4	1.2
Extinction-level pathogens feasible (CQ30)	3	1.9	6	0.8
Deceptive AI warning shot (ZD30)	4	3.2	3	1.1
AI involvement in nuclear arms (HB30)***	5	1.5	NA	NA
Kurzweil/Kapor longbet (STQ9)**	6	1.1	7	0.8
AI arms race, multipolar result (NG30)	7	1.0	5	1.1
AI autonomous purchasing (EX30)	8	1.0	2	1.6
2050-2070 Questions
AI causing deaths, ineffectual response (CX50)***	1	23.2	NA	NA
Power-seeking behavior warning shot (ZA50)	2	2.4	4	1.4
High AI investment, low safety indicators (VL70)	3	1.3	2	4.2
No aligned AGI (CX70)	4	0.8	1	1.5
AI CEOs / Research productivity (EX50)	5	1.3	5	1.2
Less prosocial behavior / Failing institutions (HS50)	6	1.0	6	0.9
Human-machine intelligence parity (STQ247)**	7	1.0	3	1.4

Table 3.1.1: Question rating summary. VOI rank from group POM VOI means. Relative risk is an arithmetic mean of each individual’s relative risk (P(U|c) / P(U)).
*Note that each question was shown to a random subset of experts, not to all experts. This may have the effect of amplifying noise due to individual response biases, for both the VOI ranking and relative risk.
**Denotes external questions not generated as part of the conditional tree process.
***Denotes questions elicited in a supplementary survey round along with the status quo question set (see section 4.1). This round was only administered to the superforecaster sample.

Question ratings (all years)

		Superforecasters			Experts
Question	Res year	Mean POM	Mean POM-z	n	Mean POM	Mean POM-z	n
AI causing deaths, ineffectual response (CX50)**	2050	6.34%	0.08	7	NA	NA	NA
Administrative disempowerment warning shot (CX30)	2030	3.55%	0.13	8	1.26%	0.94	5
Deep learning revenue (VL30)	2030	1.68%	-0.04	7	0.64%	0.16	5
Power-seeking behavior warning shot (ZA50)	2050	1.59%	0.53	8	3.00%	0.56	5
Extinction-level pathogens feasible (CQ30)	2030	1.37%	0.57	8	0.18%	-0.59	5
Deceptive AI warning shot (ZD30)	2030	0.98%	0.23	8	0.85%	0.10	5
AI involvement in nuclear arms (HB30)**	2030	0.68%	-0.07	7	NA	NA	NA
High AI investment, low safety indicators (VL70)	2070	0.54%	0.67	8	10.19%	-0.05	5
No aligned AGI (CX70)	2070	0.37%	-0.21	8	14.71%	0.53	6
Kurzweil/Kapor longbet (STQ9)*	2030	0.27%	0	8	0.15%	-0.41	5
AI CEOs / Research productivity (EX50)	2050	0.26%	-0.17	8	1.12%	-0.59	4
Less prosocial behavior / Failing institutions (HS50)	2050	0.26%	-0.30	8	0.25%	-0.63	6
AI arms race, multipolar result (NG30)	2030	0.26%	-0.28	8	0.37%	-0.33	4
Human-machine intelligence parity (STQ247)*	2040	0.14%	-0.59	8	4.19%	0.11	4
AI autonomous purchasing (EX30)	2030	0.02%	-0.55	8	0.98%	0.06	4

Table 3.1.2: Question ratings (all years)
*Denotes external questions not generated as part of the conditional tree process.
**Denotes questions elicited in a supplementary survey round along with the status quo question set (see section 4.1). This round was only administered to the superforecaster sample.

All main question-rating survey questions

Year	Title	Concise question summary
2030	Administrative disempowerment warning shot (CX30)	AI systems are preventing a substantial number of people from accessing financial assets, voting, moving freely, accessing the internet, or accessing certain types of housing or employment.
2030	Deep learning revenue (VL30)	Revenue from deep learning doubles every two years before 2030.
2030	Extinction-level pathogens feasible (CQ30)	A panel of relevant experts believes extinction-level pathogens are both possible and will be practically feasible relatively soon, at relatively low cost.
2030	Deceptive AI warning shot (ZD30)	After a costly incident, AI developers or relevant experts conclude that a top lab’s AI system behaved deceptively.
2030	AI involvement in nuclear arms (HB30)**	Governments publicly integrate AI decision-making into their nuclear weapons control systems, or nuclear risk experts worry about AI.
2030	Kurzweil/Kapor longbet (STQ9)*	Before 2030, will an AI complete the Turing Test in the Kurzweil/Kapor Longbet?
2030	AI arms race, multipolar result (NG30)	AI-related government spending is high for both the US and China, with low research collaboration, and relative economic parity; OR private investment in AI is very high and there is no single dominant company.
2050	AI causing deaths, ineffectual response (CX50)**	AI systems cause the deaths of >1m humans, which is not intended by any humans, and subsequently a panel of experts believes the collective global response has not noticeably reduced risk of similar events.
2050	Power-seeking behavior warning shot (ZA50)	AI developers shut down an expensive AI system after it displays a power-seeking behavior, such as hoarding resources, interfering with vital infrastructure, propagating itself, etc.
2070	High AI investment, low safety indicators (VL70)	Compute spending is high and experts agree that aligning AI systems is very difficult; and there is insufficient political attention to AI safety.
2070	No aligned AGI (CX70)	No AI system exists which both performs well on general ability benchmarks (e.g. Q&A dataset) and has positive indicators of alignment (performance on alignment benchmarks, confidence of AI safety researchers).
2050	AI CEOs / Research productivity (EX50)	AI systems are performing entire roles at top companies that currently are performed by C-suite executives; or research productivity is higher than it was in 1930.
2050	Less prosocial behavior / Failing institutions (HS50)	Charitable donations in the US have fallen dramatically; or corruption rises dramatically in the US or Europe; or autocracy increases dramatically worldwide.
2040	Human-machine intelligence parity (STQ247)*	Will there be Human-machine intelligence parity before 2040?
2030	AI autonomous purchasing (EX30)	AI autonomously buying goods or services (e.g. purchasing flights, managing inventories for companies, etc) — >$1 million / yr

Table 3.1.3: All main question-rating survey questions
Question IDs link to the full text of the question operationalization in Appendix 1.
*Denotes external questions not generated as part of the conditional tree process.
**Denotes questions elicited in a supplementary survey round along with the status quo question set (see section 4.1). This round was only administered to the superforecaster sample.

3.2 Candidate high VOI trees from two camps

This section displays high VOI trees produced by the main question-rating survey data for skeptical superforecasters and for concerned experts. For each group, we included a selection of the most informative questions in the tree. Only the superforecaster tree is a true conditional tree, as only superforecasters were surveyed on every combination of the top-scoring questions.

Skeptical superforecasters’ conditional tree

We surveyed the superforecasters in our sample for conditional forecasts on sixteen scenarios. These scenarios were combinations of the top-ranked questions: “administrative disempowerment” (CX30), “extinction-level pathogens” (CQ30), “AI-related deaths” (CX50) and “Power-seeking” (ZA50).⁴³ Seven superforecasters responded. The sixteen scenarios are mutually exclusive and exhaust the space of possible outcomes; thus, we ensured that each respondent’s probabilities assigned to the scenarios summed to 100% and showed them their implied P(U), the average of their P(U|scenario)’s weighted by the likelihood they assigned to each scenario (see Figure 2.2.2). We averaged the forecasts for each P(scenario) and P(U|scenario) separately to create an aggregate judgment. The implied P(U) of this aggregate was then used to compute average relative risk (the multiplier in each branch of the tree). A simplified version of the resulting tree is shown in Figure 3.2.1.

For example, conditional on both “Extinction-level pathogens” and “AI-related deaths” resolving positively (superforecasters assign a 2.82% chance to this outcome), the superforecasters would on average update their P(U) from 0.94% to 6.21%.

The scenario that would constitute the biggest update is the case where all four questions that would imply higher risk resolve positively. If the four relevant risk-increasing outcomes were to happen (far right in the full tree (a)), the superforecasters’ relative risk assessment is 10.7 (i.e., they would be 10.7x more concerned than they currently are about the risk of AI-related extinction). Conversely, if none of the questions resolve positively (far left), their relative risk assessment is 0.3.

Note that the average P(U) in this survey (0.94% in Figure 3.2.1) is higher than in the main survey (0.25%), which we used to compute VOI. Two superforecasters made substantial updates to their unconditional probability of AI-related extinction by 2100 (P(U)) between the main survey (conducted in July 2023) and this combinations survey (conducted in February to March 2024 with a follow-up in May), which may be attributable to events of the intervening months or to the exercise of thinking through scenarios. One superforecaster updated from 0.1% to 0.4% and another from 1% to 4.2%. The other five did not update.

**Figure 3.2.1:** Skeptical superforecaster conditional tree
This is a collapsed tree of combinations of the superforecasters’ highest-VOI questions. For the purpose of legibility, we are presenting a simplified tree, using two of the four questions. We collapsed the sixteen scenarios into four combinations. Positive resolution (“TRUE”) is a bad outcome for both questions. The far right scenario (both TRUE) constitutes the worst scenario, a 6.6x update, and the far left scenario is the best (both FALSE) with a halving of the superforecasters’ current risk estimate. You can see the full, unpruned tree here (a).

Concerned experts’ conditional trees

Figure 3.2.2 presents the question from each year (2030, 2050, and 2070) that surveyed experts rated the highest, on average, in terms of POM VOI. As a whole, among these highest-POM VOI questions, the experts would be most worried if there were an administrative disempowerment warning shot by 2030 (1.9x update from their current unconditional P(U) of 17%). Conversely, if we do not see a power-seeking behavior warning shot by 2050, the experts would be least worried (0.6x update).

**Figure 3.2.2:** A diagram showing how experts update on three questions for different resolution years that scored particularly well on our VOI metric. Since experts answered different sets of questions, we derived P(U|C) and P(U|~C) (the probabilities on the bottom level) by multiplying the whole expert group’s average P(U) of 17% by the average relative risk factor for each crux.⁴⁴

3.3 Skeptical superforecasters’ question ratings

2030 questions

Question	Mean POM	P(c)	RR (P(U\|c) / P(U))	Mean POM-z	Pairwise wins	n
Administrative disempowerment warning shot (CX30)	3.55%	16%	13	0.13	83%	8
Deep learning revenue (VL30)	1.68%	33%	2.5	-0.04	59%	7
Extinction-level pathogens feasible (CQ30)	1.37%	39%	1.9	0.57	75%	8
Deceptive AI warning shot (ZD30)	0.98%	32%	3.2	0.23	64%	8
AI involvement in nuclear arms (HB30)**	0.68%	18%	1.5	-0.07	50%	7
Kurzweil/Kapor longbet (STQ9)*	0.27%	43%	1.1	0	33%	8
AI arms race, multipolar result (NG30)	0.26%	39%	1.0	-0.28	33%	8
AI autonomous purchasing (EX30)	0.02%	35%	1.0	-0.55	3%	8

Table 3.3.1: Skeptical superforecasters’ 2030 question ratings
P(c) is the arithmetic mean of this group’s responses. RR (relative risk) is an arithmetic mean of each
individual’s relative risk (P(U|c) / P(U)).
*Denotes external questions not generated as part of the conditional tree process.
**Denotes questions elicited in a supplementary survey round along with the status quo question set (see
section 4.1). This round was only administered to the superforecaster sample.

Skeptical superforecasters’ top-rated question by mean POM was “Administrative disempowerment warning shot” (CX30), referring to a scenario in which “AI systems are preventing a substantial number of people from accessing financial assets, voting, moving freely, accessing the internet, or accessing certain types of housing or employment.” It scored ~3.6% of the theoretical maximum VOI score on average. However, this high value was driven by a single respondent, with the question achieving a remarkable 25% of the theoretical maximum VOI for this individual.⁴⁵ This is consistent with superforecasters in our sample preferring questions which refer to concrete AI-related harms, though the high variance in VOI ratings for this question suggest that there is no consensus on exactly which harms provide the clearest signal.

The top-rated question by POM-z, “Feasibility of extinction-level pathogens” (CQ30), refers to a scenario in which “A panel of relevant experts believes extinction-level pathogens are both possible and will be practically feasible relatively soon, at relatively low cost.” It is the question that respondents most agreed was informative, though the highest VOI rating any individual gave this question was only 5.2% of the theoretical maximum. Interestingly, this question does not refer to realized harm, but rather to favorable conditions for harm to take place. Such questions may gain a VOI advantage by omitting divisive or low-probability conditions that hinge on human motivations for misusing AI technologies.⁴⁶ It was the third most likely 2030 question to resolve positively.

No mean POM differences between questions were significant in this sample (after correcting for multiple testing using the Bonferroni correction, all p-values were equal to 1). Survey responses between filtering and main survey rounds were fairly similar, though with some notable differences. See Appendix 2.1 for further details on intra-individual response variability.

**Figure 3.3.1:** Skeptical superforecasters’ 2030 P(c)

**Figure 3.3.2:** Skeptical superforecasters’ 2030 relative risk. Diamonds represent arithmetic means. Log scale. Relative risk >1 reflects a positive update, that is, where P(U|c) > P(U).

**Figure 3.3.3:** Skeptical superforecasters’ 2030 POM VOI. Diamonds represent arithmetic means.

**Figure 3.3.4:** Skeptical superforecasters’ 2030 POM VOI sensitivity matrix (pairwise wins). Visualization of resampling simulation results.

2050-2070 questions

Question	Mean POM	P(c)	RR (P(U\|c) / P(U))	Mean POM-z	Pairwise wins	n
AI causing deaths, ineffectual response (CX50)**	6.34%	6%	23	0.08	67%	7
Power-seeking behavior warning shot (ZA50)	1.59%	38%	2.4	0.53	87%	8
High AI investment, low safety indicators (VL70)	0.54%	38%	1.3	0.67	64%	8
No aligned AGI (CX70)	0.37%	34%	0.8	-0.21	48%	8
AI CEOs / Research productivity (EX50)	0.26%	21%	1.3	-0.17	35%	8
Less prosocial behavior / Failing institutions (HS50)	0.26%	31%	1.0	-0.30	32%	8
Human-machine intelligence parity (STQ247)*	0.14%	53%	1.0	-0.59	17%	8

Table 3.3.2: Skeptical superforecasters’ 2050-2070 question ratings. P(c) is the geometric mean of odds of this group’s responses. RR (relative risk) is an arithmetic mean of each individual’s relative risk (P(U|c) / P(U)).
*Denotes external questions not generated as part of the conditional tree process.
**Denotes questions elicited in a supplementary survey round along with the status quo question set (see section 4.1). This round was only administered to the superforecaster sample.

Skeptical superforecasters’ top-rated question by mean POM was “AI causing deaths, ineffectual response” (CX50), referring to a scenario in which “AI systems cause the deaths of >1m humans, which is not intended by any humans, and subsequently a panel of experts believes the collective global response has not noticeably reduced risk of similar events.” It scored ~6.3% of the theoretical maximum VOI score on average. Similarly to the top-scoring 2030 question, this question benefits from a single strong opinion, and thus does relatively worse on both sensitivity analysis and POM-z. It is an even starker example of concrete harm caused by AI systems; but was also among the least likely questions to resolve positively, at 6%.

While neither first-ranked by mean POM or POM-z, the most robust question in sensitivity analysis was “Power-seeking behavior warning shot” (ZA50), in which “AI developers shut down an expensive AI system after it displays a power-seeking behavior, such as hoarding resources, interfering with vital infrastructure, propagating itself, etc.” This question does not fit as clearly with superforecasters’ apparent preference for questions referring to concrete harm.

The highest rated question by POM-z was “High AI investment, low safety indicators” (VL70), in which “Compute spending is high and experts agree that aligning AI systems is very difficult; and there is insufficient political attention to AI safety.” However, in absolute terms it was rated relatively low at 0.5% POM.

“No aligned AGI” (CX70) is unique in this question set as the only question which on average updated superforecasters away from AI-related extinction (mean relative risk = 0.8x). Here respondents may have inferred that a world with no aligned AGI by 2070 was more likely to be a world with no AGI of any kind, than a world with only unaligned AGI.

No mean POM differences between questions were significant in this sample (after correcting for multiple testing, all p-values were equal to 1). Survey responses between filtering and main survey rounds were broadly similar. See Appendix 2.1 for further details on intra-individual response variability.

2030 vs 2050/2070 questions

In the AICT question set (that is, all questions excluding STQ9 and STQ247) the average of POM responses for 2030 questions, 1.2%, was slightly lower than that of 2050-2070 questions, at 1.5%. Due to the small number of questions included and the small absolute size of the difference, this does not seem indicative of a genuine VOI difference between earlier and later questions in our set.

Responses for 2030 and 2050-2070 AICT questions suggested similar probabilities of positive resolution, 30% and 28% respectively (mean; interquartile range (IQR) = 5% – 50% and 10% – 40%).⁴⁷ However, the relative risk was lower for 2030 questions at 3.6x (mean; IQR = 1x – 1.5x), vs. 2050 – 2070 questions at 4.6x (mean; IQR = 1x – 1.5x).

**Figure 3.3.5:** Skeptical superforecasters’ 2050-2070 P(c)

**Figure 3.3.6:** Skeptical superforecasters’ 2050-2070 relative risk. Diamonds represent mean values. Log scale. Relative risk >1 reflects a positive update, that is, where P(U|c) > P(U).

**Figure 3.3.7:** Skeptical superforecasters’ 2050-2070 POM VOI. Diamonds represent arithmetic means.

**Figure 3.3.8:** Skeptical superforecasters’ 2050/2070 POM VOI sensitivity matrix (pairwise wins). Visualization of resampling simulation results.