Published: Jul 1, 2025

Revised: Apr 30, 2026

View Policy Memo

Working paper

Working paper

Forecasting LLM-enabled Biorisk and the Efficacy of Safeguards

This forecasting study on biological risks from large language models (LLMs) examined expert views on AI-enabled biosecurity threats. The study saw 46 biosecurity and biology experts, along with 22 superforecasters, predict how advancing LLM capabilities might increase the risk of a human-caused epidemic.

Bridget Williams^1*, Luca Righetti², Josh Rosenberg¹, Rebecca Ceppas de Castro¹, Otto Kuusela¹, Rhiannon Britt¹, Emily Soice³, Alvaro Morales³, Jon Sanders³, Seth Donoughe³, James Black⁴, Ezra Karger^1,5, Philip E. Tetlock^1,6 ,

1 = Forecasting Research Institute
2 = Centre for the Governance of AI
3 = SecureBio
4 = Johns Hopkins Center for Health Security
5 = Federal Reserve Bank of Chicago
6 = Wharton School of the University of Pennsylvania
*Corresponding author: bridget@forecastingresearch.org

Published: Jul 1, 2025
Revised: Apr 30, 2026

Abstract

Capabilities of large language models (LLMs) on several biological benchmarks have prompted excitement about their usefulness for beneficial research, but also concern about potential biosecurity risks. We recruited 46 subject-matter experts in biology and biosecurity, and 22 generalist forecasters to estimate the risks of growing LLM capabilities. The median expert predicted a 0.3% baseline annual risk of a human-caused epidemic that causes 100,000 deaths. This estimate then rose to 1.5% conditional on several hypothetical LLM capabilities, including matching the performance of a top-performing team of virologists on a virology troubleshooting test. Given this finding, we conducted a baselining study and found that LLMs have already crossed this performance threshold. The median respondent thought that this would not happen until after 2030. More encouragingly, experts reduced their risk forecast close to baseline (0.4%) conditional on the adoption of LLM safeguards and mandatory nucleic acid screening.

View the full PDF report

Acknowledgments

This research would not have been possible without the support of Open Philanthropy. We greatly appreciate the assistance of Dan Mayland, Holden Karnofsky, Victoria Schmidt, Rory Svarc, Tessa Alexanian, Kayla Gamin, and Nadja Flechner throughout the project, and others who gave feedback on earlier drafts of this paper. Lastly, we extend our gratitude to our research participants for their invaluable contributions.

Disclaimers

The views expressed in this paper do not necessarily reflect those of the Federal Reserve Bank of Chicago or the Federal Reserve System.

Main

Large language models (LLMs) have recently shown strong improvements in biological capabilities and now outperform PhD-level experts on a variety of biology benchmarks.¹ Similarly, LLMs have shown early promise in providing scientific tutoring² and assisting with the conduct of scientific research.³ While there are still clear limitations to how useful LLMs can be in science,⁴ there is a clear trend that new models are more capable than their predecessors.

Numerous observers—including leaders of frontier AI companies—recognize both the benefits and risks that such capabilities could bring in the near future.⁵ OpenAI, Google DeepMind, and Anthropic have all released policies to prevent LLM misuse of biology and run capability evaluations on new models ahead of their commercial deployment.⁶ Recently, Anthropic announced that it provisionally implemented a stronger security standard for its latest model release, since it could not rule out that it might significantly assist with CBRN-weapons-related tasks of concern.⁷ OpenAI has announced that it is preparing similar mitigations.⁸ However, it is still unclear which empirical evaluation results would indicate that LLMs present a meaningful increase in risk.⁹ It is also unclear what sorts of mitigation measures would then be most helpful in reducing such risk while preserving the power of the models to advance scientific work.

Forecasting the probability of biological threats is challenging due to the rarity of such events and the complex interplay of technical, social, and political factors involved.¹⁰ Previous surveys have found a wide range of views. In a 2005 study of 83 nonproliferation and national security experts, the median respondent gave a 10% probability of a major biological weapons attack within 5 years, with individual responses ranging from 0% to more than 80%.¹¹ A 2009 survey of biological scientists found a mean forecast of roughly 50% probability of a bioterrorist within 5 years.¹² Similarly, a 2015 survey of relevant experts found a mean forecast of roughly 50% probability of a biological weapons attack causing more than 100 cases of illness within 10 years.¹³

Research in forecasting and expert elicitation has found that expert predictions can be made more accurate through careful question design and aggregation of responses.¹⁴ Therefore, we designed an exercise to elicit opinion from a large and varied group of subject-matter experts and top-performing generalist forecasters with the aim of leveraging judgmental forecasting techniques to i) assess views on biological risks from rapidly improving LLMs and ii) understand the degree to which these views track short-term advances in LLM capabilities.

In our study, between December 2024 and February 2025, participants completed a survey that asked them to forecast the risk of a large-scale pathogen outbreak arising from human-caused accidents or misuse in 2028, and then to say how their forecasts would change conditional on several hypothetical LLM capabilities and mitigation measures. The capabilities scenarios refer to hypothetical studies conducted in the first quarter of 2026. We chose to ask about the annual risk in the year 2028 to allow for a lag between dangerous capabilities emerging and these resulting in harmful outcomes—such as due to a lag in the adoption of technology. The scenarios of future LLM capabilities were chosen to cover a range of pathways by which experts have suggested LLMs could plausibly facilitate the development of bioweapons; these scenarios also closely correspond to dangerous capability evaluations and biological benchmarks in the existing literature. Those evaluations measure the ability of an LLM to assist in biological weapon ideation, pathogen development, attack planning, and evading existing biosecurity safeguards.¹⁵ We then directly measured whether one of these scenarios had been met: whether LLMs match the top performance of five teams of human experts on a virology troubleshooting questionnaire.

Results

A total of 46 people with expertise in biosecurity and/or wet lab biology (henceforth “experts”) and 22 top-performing generalist forecasters (“superforecasters”) completed the survey. Of the experts, 27 (59%) reported expertise in both biosecurity and wet lab biology research, while the remainder reported expertise in just one of the two domains (24% biosecurity-only; 17% wet lab biology only). The expert group’s median number of years of experience was seven years for biosecurity work and eight years for wet lab research. Most experts had a doctorate (78%). The most common area of study for experts was a subfield of biology (46%) or medicine (26%). We used a diversified sampling strategy to identify participants. This included faculty of top-ranked molecular biology labs, members of the Engineering Biology Research Consortium, attendees of major AI-biosecurity workshops, researchers at biosecurity-focused think tanks, and additional researchers identified via Google Scholar search. The full sampling strategy and more details of participants are provided in the Supplementary Materials.

Forecasts of baseline risk

Participants were asked to answer the following question, detailed resolution criteria for which are available in the Supplementary Materials:

What is the likelihood that a human-caused release of a pathogen occurs in 2028, and leads to at least 100,000 deaths in excess mortality or $1 trillion in damage within 3 years?

Figure 1 reports the participants’ baseline risk responses. The median expert’s response was 0.3% annual probability of such a catastrophe (interquartile range, IQR 0.01–2%). Superforecasters had a similar median of 0.38% (IQR 0.1–1.21%). There was considerable variation in responses, with forecasts spanning several orders of magnitude.

Some of the heterogeneity in responses might be explained by participants’ accuracy in their ability to assign numbers to low-probability events. To test this, we looked at three measures of participants’ forecasting accuracy: the ability to assess the frequency of ten other low-probability events (e.g., the probability that a randomly chosen person in the U.S. is a neurosurgeon), the ability to correctly predict recent progress on LLM benchmarks, and the ability to predict the views of other survey respondents (a measure that has previously been correlated with forecasting accuracy in other domains).¹⁶

For each of these measures we split the participants into two groups: a higher-performing group composed of the top-scoring half and a lower-performing group composed of the bottom-scoring half. On each accuracy measure, the higher-performing group generally had higher baseline risk forecasts, and this was statistically significant for two of the three measures. Participants who better predicted other participants’ views forecasted a considerably higher median probability of a human-caused pandemic than those who were less accurate on this measure (0.93% vs 0.08%, p=0.04). We also asked participants to forecast whether LLMs would have several specific capabilities by 2026. Some of these capabilities have since arisen and so we could resolve these forecasts. Participants who more accurately predicted whether LLMs would have these capabilities by 2026 also gave higher forecasts of baseline risk relative to those who were less accurate on this task (1.1% vs 0.1%, p=0.02).

Fig. 1: Participants and baseline forecasts

**Figure 1:** Forecasts of the probability of a human-caused epidemic in 2028, which, within a 3-year period causes more than 100,000 deaths and/or more than $1 trillion in damages, disaggregated by participant characteristics. Black dots indicate group medians and black line segments indicate the bootstrapped 95% confidence intervals around the medians. Individual forecasts are shown as points and color-coded to identify their provenance from the superforecaster or expert group. The x-axis uses a logarithmic scale to make it easier to see variation in forecasts in the 0–10% range. Very few participants gave forecasts of 0%. Most points that appear on the 0% line represent very small, non-zero forecasts.

Most participants considered several factors in their forecast rationale, including the historical base rates of analogous events (which some participants thought should be zero while others pointed to the 1977 H1N1 Russian flu outbreak¹⁷ as a potential human-caused outbreak), the relative probabilities of accidental versus intentional releases, the number and location of BSL3 and BSL4 labs, the potential for AI systems to increase biorisk, the motivation of potential actors involved and possible changes if major global conflicts were to increase, and academic studies that attempt to model potential future pandemics. Examples of forecast rationales are provided in the Supplementary Materials.

Change in risk conditional on LLM capabilities

Next, we studied whether participants would increase their baseline estimate of biorisk if leading LLMs were to exhibit large and measurable increases in biological capabilities. We asked participants how they might change their predictions in response to various scenarios in the first quarter of 2026 if LLM evaluations find specific empirical results. The scenarios referred to performance on five different evaluations: two of these measure an LLM’s performance relative to experts on knowledge relevant to biorisks (i.e. benchmarks), and three of them measure an LLM’s ability to enable human actors to succeed at relevant tasks (i.e. human uplift).

These scenarios were based on existing LLM biology capability evaluations or other possible evaluations discussed in the biosecurity literature. The knowledge evaluation scenarios involved the Virology Capabilities Test (VCT)¹⁸ as well as a long-form biorisk questions test conducted by OpenAI.¹⁹ The human uplift scenarios included a study that assesses LLM’s ability to help humans plan bioweapons attacks that was first performed and evaluated by RAND in 2023,²⁰ and two other hypothetical studies inspired by discussion in the biosecurity literature: assessing an LLM’s ability to assist novices to acquire synthetic DNA fragments from the 1918 pandemic influenza virus,²¹ and a study evaluating an LLM’s ability to assist with laboratory tasks (expanding on plans announced by OpenAI with the Los Alamos National Laboratory).²² Figure 2 summarizes these scenarios (see the Supplementary Materials for more detailed descriptions of the scenarios).

Fig. 2: Effects of hypothetical evaluation results on forecasts

**Figure 2:** Forecasts of the probability of a human-caused epidemic in 2028 that within a 3-year period causes more than 100,000 deaths and/or more than $1 trillion in damages: unconditional (baseline) and conditional on the hypothetical evaluation results. Black dots indicate group medians and black line segments indicate the bootstrapped 95% confidence intervals around the medians. Individual forecasts are shown as points. The forecasts for each set of questions related to an evaluation include only the subset of the sample who gave consistent forecasts across that set. The median baseline forecast for this subset of participants is shown in gray and is sometimes different from the overall group median baseline shown in Figure 1. (See the Supplementary Materials for more details.) The x-axis uses a logarithmic scale to make it easier to see variation in forecasts in the 0–10% range.

For experts, the largest increases in estimated risk were from two conditions: a randomized controlled trial finding that LLMs enable half of non-experts to successfully synthesize an influenza virus in a wet lab setting, and LLMs matching the top-performing team of expert virologists on a virology troubleshooting questionnaire. Conditional on these capabilities emerging, the median expert forecast of the annual risk increased to 1.25% and 1.5% respectively, which are significant changes from the baseline (Wilcoxon p < 0.0001 for both). The median superforecaster also increased their risk estimate significantly for the wet lab study threshold to 1.5%—but less so for the virology troubleshooting to 0.7% (Wilcoxon p < 0.0001 for both).

When two or more capabilities were considered together, increases in risk were greater still. If a 10% success rate in non-experts’ pathogen synthesis, a significant uplift in bioweapons attack planning ability, and acquiring dual-use DNA were considered together, risk estimates increased by more than their respective marginal risk estimates combined. The median expert’s annual risk forecast increased to 2.3% conditional on these capabilities emerging, which was also a statistically significant increase from baseline (Wilcoxon p < 0.0001).

Timeline of advances in LLM capabilities

We next gauged the views of the participants about the probability of observing, in 2026, evaluation results that matched the hypothetical scenarios. Further, for a subset of scenarios, we asked when participants thought the corresponding thresholds would be achieved, if ever. Again, there was a divergence of views. However, many participants didn’t expect any of the specified scenarios would be achieved in 2026 (median expert probabilities ranged from 0.1% to 42.5% across the scenarios). When asked when each of a subset of the scenarios’ thresholds would be crossed, most respondents suggested they were instead more likely to occur between 2030 and 2045 (see Figure 3 below). Only a small number of respondents—between three and five experts and at most one superforecaster—thought that any of the thresholds would not be achieved before 2100.

Fig. 3: The timing of evaluation results being achieved

**Figure 3a:** Forecasts of the median year of evaluation results being achieved, assuming the evaluations were to be run each year. Group median forecast is shown in text.

**Figure 3b:** Forecasts of the probability of the evaluation result being achieved assuming the study is run in the first quarter of 2026. In both panels, black dot indicates group median and black line indicates 95% CI for group median.

However, after participants completed this forecasting survey (between November 2024 and February 2025) but before the publication of the present article describing its results, a paper was released in April 2025 showing that several LLMs already outperform the median expert virologist on the VCT benchmark.²³ Therefore, one of the hypothetical scenarios of LLM performance in the forecasting survey had already come to pass.

The forecasting survey also included a more extreme scenario: if the most performant LLM were to match the performance of the top team out of five teams of expert virologists on VCT. To evaluate whether this scenario had also been achieved, we conducted a team baselining study. The results of the team baselining study show that OpenAI’s o3 model performs comparably to the top team of five expert virologists. (The details of the ‘top out of five teams of expert virologists answering VCT questions, as described in the forecasting survey, were very similar, but not identical to the team baselining procedure we carried out; see Methods for details.) The median expert in the forecasting study thought this was 14% likely to occur by 2026 and that the most likely date for it to occur was 2030. For superforecasters the numbers were 2% and 2034 respectively. Claude 4 Opus, released in May 2025, performs notably worse than all other AI models as it refuses to answer many of the VCT questions. This may be a result of the additional security measures implemented by Anthropic at the launch of this model.²⁴

Fig. 4: LLM and virologist team performance on the Virology Capabilities Test

**Figure 4:** Performance of LLMs, and five teams of virologists on the VCT. For reference the score achieved by random guessing and the score achieved by the median individual expert in Götting et al. (2025) are also shown. Refusal to answer a question is counted as 3+ errors in response.

It is likely that the long-form biorisk capability scenario has also been achieved. In this scenario, 90% of LLM responses to long-form biorisk questions are assessed as being preferable to answers provided by human experts. Responses would be scored on several dimensions: accuracy, clarity, and feasibility. The relevant benchmark is run in-house by OpenAI. Their previous o1 pre-mitigation model scored 75% in December 2024. Their newer o3 model in April 2025 markedly outperforms o1 across test indicators but the specific ‘expert human preference win-rate’ metric we use for our scenario was not reported.²⁵ Fitting the available data to an exponential curve suggests a 60% chance that the true preference rate already exceeds the 90% threshold specified in the scenario (see the Supplementary Materials for more details). The median expert thought this threshold—LLM responses preferred over expert responses 90% of the time—was most likely to occur in 2030, and assigned a 10% probability to it being achieved by 2026.

The impact of mitigation measures

Finally, we asked participants to state how their forecasts would change, conditional on several mitigation measures also being in place in addition to some of the LLM scenarios. These measures addressed two key pathways for risk mitigation that have been suggested in the literature: AI model safeguards, and screening customers and orders of synthetic nucleic acids. These measures were chosen based on a review of published recommendations for reducing the biosecurity risks of LLMs.²⁶ In total, we asked participants to consider six mitigation scenarios, which varied in terms of i) whether or not synthetic nucleic acid providers were required to conduct screening and ii) the types of AI model safeguards in place.

For synthetic nucleic acid screening, the baseline scenario involved providers in the US, China, the EU, and the UK being encouraged—but not legally required—to screen customers and orders against a regulated sequence list. In the stricter scenario, providers in these countries were legally required to conduct such screening and verification.

For the AI model safeguards, there were three aspects to the scenarios: i) whether the models were open-weight or proprietary, ii) if models were proprietary, whether there were standard or stricter measures—including red-teaming exercises, bug bounty programs and rapid response teams—to prevent model “jailbreaking” (i.e., subverting the safeguards that prevent models from giving out potentially dangerous information) and iii) whether there was a structured access program to limit the use of LLMs that have been trained on dangerous dual-use information. (For more detail on how these scenarios were described, see the Supplementary Materials.)

To evaluate the impact of these mitigation measures, participants were asked to assume that an LLM could enable either 10% or 50% of non-experts to synthesize an influenza virus in a randomized controlled trial. The absolute probabilities of biorisk catastrophe under a variety of mitigation scenarios are shown in Figure 5.

Fig. 5: Effects of mitigation measures

**Figure 5a:** Description of the mitigations scenarios participants were asked to consider.

**Figure 5b:** Absolute risk probability of a human-caused epidemic in 2028, unconditionally, conditional on scenarios where LLMs enable 10% or 50% of non-experts to synthesize influenza, and conditional on the scenarios with various mitigations. The lines and text show the expert group for each scenario. The shaded area shows bootstrapped 95% confidence intervals for the expert median. NA = nucleic acid.

Participants believed that the mitigation scenario involving proprietary frontier model weights, strict jailbreaking safeguards, and mandatory synthetic nucleic acid screening (P3) would yield the largest reduction in risk. In particular, the median expert’s risk estimate under the “AI enables 50% of non-experts to synthesize influenza” scenario decreased from 1.25% to 0.4%, approaching the median expert’s original baseline. Many participants expressed concerns that open-weight models pose higher risks than proprietary models for two main reasons: i) open-weight models can be finetuned to have specialized capabilities, and ii) unlike proprietary models, malicious use of open-weight models will not attract the attention of AI companies, which could trigger a law enforcement response.

We compared participants’ risk estimates under different mitigation schemes to assess the impact of each component separately. In the “AI enables 50% of non-experts to synthesize influenza” scenario, requiring nucleic acid synthesis screening alone reduced the risk by 0.35 percentage points (p.p.) for the median expert and 0.14 p.p. for the median superforecaster. Requiring models to be proprietary with strict anti-jailbreaking measures reduced risk by 0.4 p.p. for the median expert and 0.24 p.p. for the median superforecaster (see the Supplementary Materials).

Discussion

This study provides, to the best of the authors’ knowledge, the first systematic assessment of how experts in molecular biology and biosecurity, along with superforecasters, view the biosecurity risks posed by advancing LLM capabilities. We found that many experts and superforecasters believed that certain measurable LLM capabilities would meaningfully increase the annual risk of a large-scale human-caused epidemic. In particular, LLMs matching the performance of teams of experts on a virology troubleshooting questionnaire (the VCT) or enabling non-experts to successfully synthesize a living virus were associated with a substantial increase in risk. This suggests that many expert participants saw troubleshooting and tacit knowledge as an especially large hurdle for biological misuse, which if future LLMs were to meaningfully assist at would increase risk. Such views are also found in the biosecurity literature.²⁷

Critically, this study demonstrated that many experts and superforecasters alike are substantially underestimating the pace of LLM progress in biology, including in capabilities associated with substantial increase in risk. We found that current LLMs already match the performance of teams of experts on the Virology Capabilities Test. Furthermore, it seems very likely that an additional scenario (experts strongly preferring LLM responses to long-form biorisk questions) has also been achieved. We did not assess the other LLM capabilities given the additional resources that would be required to do so, and therefore it is uncertain whether these have also been achieved. This mismatch between expert predictions and reality highlights the rapid pace of advancement in LLM capabilities relevant to biological research and underscores the urgency of fostering deeper expert collaboration across fields and developing appropriate governance frameworks.

More positively, most participants believe that mitigation measures can also meaningfully reduce the increase in risk. Some of these measures require action by governments, such as introducing a requirement that synthetic nucleic acid companies conduct customer and order screening. Others require action from the developers of AI, such as implementing safeguards to prevent model misuse. When prompted to consider the possible trade-offs required by mitigation measures (e.g., the possibility of measures slowing scientific progress) if a randomized controlled trial were to find that LLMs enable 10% of non-experts to synthesize influenza, most participants reported that they would be in favor of such measures being implemented, particularly AI model safeguards (see the Supplementary Materials for more detail).

This study has limitations that future work should address. The present study only investigated one consequence of LLM capabilities: the risk of a large (>100,000-mortality) human-caused epidemic. It does not attempt to quantify other risks of LLM capabilities—or the effects of any potential offsetting benefits from LLM capabilities for beneficial scientific research. Most participants reported favoring mitigation measures. However, work that examines these trade-offs more closely, and in quantitative terms, would add a useful perspective to complement our work. For example, prospect theory suggests that, before approving a policy, decisionmakers would need to see a greater number lives saved by extending human life expectancy than lives expected to be lost from epidemics.²⁸ Other schools of thought may reject such trade-offs on precautionary principle grounds.²⁹ Therefore, it’s important to note that these results should only be considered one input among many into AI and biosecurity policy choices.

This study was also limited to the implications of LLM capabilities, rather than AI more broadly. Progress in AI biological design tools is also advancing rapidly.³⁰ This progress is likely to have important implications for the risk of human-caused epidemics,³¹ which we did not explore in this study and that future work may address.

Although we used a systematic sampling strategy (described in the Supplementary Materials) and the responses exhibited a large array of views on the baseline risks, it is possible that people who agreed to participate were more likely to be concerned about these risks than their peers who declined. To offset this potential self-selection bias, we took a diversified sampling approach that recruited expert participants from several sources. We also included a sample of superforecasters, who may be less likely than experts to have preconceived views on biorisks or to have incentives that may bias responses.

The reliability of this study’s results depends on the skill and effort exerted by the participants. It’s clear that humans—and in some cases experts in particular—are subject to important cognitive biases that can impair their ability to accurately predict future events, including risks of human-caused epidemics.³² To offset these biases we had participants complete a calibration exercise before making forecasts, prompted them to consider relevant information, including the history of bioweapons and laboratory escape events, and asked them to consider what a reasonable range of forecasts would be and the possible rationales for higher or lower forecasts than their own (see Supplementary Materials for details). While there is evidence that exercises such as these can increase predictive accuracy, it is likely that more in-depth training would yield more accurate results.³³

This study offers insight into how experts are thinking about the potential biological risks posed by LLMs and serves as a foundation for ongoing discussions about AI governance and risk assessments in highly complex and uncertain domains. As AI companies begin to implement additional mitigation measures to prevent the misuse of their models, understanding the views of experts clarifies what capabilities ought to prompt additional measures and what those measures should be. The widespread underestimation of the pace of AI progress by our sample highlights the need for proactive rather than reactive approaches to expert collaboration and governance. By combining multiple mitigation measures that address different aspects of the risk pathway—from model access to synthetic nucleic acid screening—it may be possible to realize the benefits of LLMs in biology while mitigating its risks.

Methods

Survey development

To develop the survey, we undertook an iterative process whereby researchers developed an initial set of forecasting questions quantifying the marginal effect of LLMs on the ability of non-experts to synthesize pathogens. We collected answers on these questions from a small group of experts and superforecasters, and we then revised the questions in light of how they interpreted them, clarifying definitions and increasing the precision of each question. We conducted five rounds of this iterative question improvement process because forecasts can be highly sensitive to the precise wording of a question and its resolution criteria. We also conducted a pilot study with a sample of 21 participants and performed a final round of updates to the survey questionnaire before beginning data collection.

The survey was administered as a Google Sheet or Excel Spreadsheet, which can be viewed here. We invited participants to make a copy of the survey and fill in their responses over the course of several weeks. We also provided a document that gave detailed instructions on the survey, including detailed descriptions of the questions and scenarios included in the survey. This document can be viewed here.

Participant recruitment

In our recruitment, we targeted expert participants with expertise in biosecurity and/or molecular and synthetic biology. We used a diversified sampling strategy to identify participants. This included faculty of top-ranked molecular biology labs, members of the Engineering Biology Research Consortium, attendees of major AI-biosecurity workshops, researchers at biosecurity-focused think tanks, and additional researchers identified via Google Scholar search. The full sampling strategy is available in the Supplementary Materials. In total, we invited over 1500 experts to participate in the study. As mentioned, 46 experts completed the full survey. Therefore, our participation rate was roughly 3%. This low response rate was likely influenced by the length of the survey. When inviting possible participants, we noted that we expected participation to take between 5 and 15 hours.

We also recruited top-performing generalist forecasters (“superforecasters”). These are people who consistently scored in the top 2% of the Intelligence Advanced Research Projects Activity (IARPA) Aggregative Contingent Estimation (ACE) program or had high predictive accuracy in subsequent forecasting exercises run by Good Judgment, Inc.

To incentivize engagement, we paid participants for their time spent completing the survey. Experts were paid $125 / hour up to a maximum of 20 hours, and superforecasters were paid $50 / hour up to a maximum of 20 hours. The median compensation per expert participant was $1,281.25. Participants spent a considerable amount of time on the exercise, with a median of 10 self-reported hours for experts and 14 self-reported hours for superforecasters. Most participants provided detailed rationales for their forecasts, with a median of ~2,000 words written per participant across all forecasting questions.

Data cleaning and analysis

Data analysis was conducted using R after aggregating and cleaning the data submitted by participants. We used the median as the default method for aggregating forecasts. For the questions about when evaluation results would be achieved, participants were asked for their 5^th, 50^th, and 95^th percentile forecasts. These were aggregated by first fitting a maximum entropy distribution to each participant’s percentiles and then calculating an average density over participants. Data cleaning included a series of validation tests that checked for logical coherence and consistency in responses. When inconsistencies were identified, we reviewed the individual’s responses to determine if they were likely to be typographical errors, or if the response was likely to be intended. Clear typographical errors (such as the automated percentage formatting being accidentally removed) were corrected.

Separately, we reviewed all forecasts and written rationales to assess for any misinterpretations. This identified that several participants may have misunderstood the descriptions of the influenza synthesis evaluations to be representing a situation where the proportion of non-experts who are able to successfully synthesize influenza virus is increased by 10% (or 50%), rather than AI enabling a total of 10% (or 50%) of non-experts to succeed at the task.

As it was unclear how many participants had misinterpreted in this way, we contacted all participants to advise them of the correct interpretation and invite them to update their forecasts if they had misinterpreted the question. We also alerted those participants who had inconsistencies in their forecasts to the inconsistencies and asked if they would like to update their responses. A summary of the inconsistencies identified and how participants responded to them is provided in the Supplementary Materials.

The analysis presented in this paper uses the updated responses from participants. It also excludes responses where there is a clear logical incoherence. For a summary of the responses removed from the data, see the Supplementary Materials. We also ran the analysis on the original data provided by participants with the only changes being clear typographical errors. These results are provided in the Supplementary Materials and do not change the main conclusions of this paper.

VCT baselining study

We recruited a total of 14 virology experts to complete five group sessions, with each group consisting of five experts. Some experts were in multiple groups, but we allowed a maximum of two people shared between any two groups. Each session lasted 4 hours and included 20 VCT questions tailored to the group’s collective expertise. Participants were instructed to take their time answering each question, moving on once they had come to consensus or found that they were making no further progress. If they did not complete the full 20 questions, remaining questions were excluded from analysis. Participants were allowed to use any internet-based resources except for LLMs. See the Supplementary Materials for details.

Notes

Justen, Lennart. “LLMs Outperform Experts on Challenging Biology Benchmarks.” Preprint, arXiv, May 21, 2025. https://doi.org/10.48550/arXiv.2505.06108. ↩︎
Caccavale, Fiammetta, et al. “Towards Education 4.0: The Role of Large Language Models as Virtual Tutors in Chemical Engineering.” Education for Chemical Engineers 49 (2024): 1–11. https://doi.org/10.1016/j.ece.2024.07.002; Chevalier, Alexis, et al. “Language Models as Science Tutors.” Preprint, arXiv, February 16, 2024. https://doi.org/10.48550/arXiv.2402.11111. ↩︎
Ghareeb, Ali Essam, et al. “Robin: A Multi-Agent System for Automating Scientific Discovery.” Preprint, arXiv, May 19, 2025. https://doi.org/10.48550/arXiv.2505.13400; Swanson, Kyle, et al. “The Virtual Lab: AI Agents Design New SARS-CoV-2 Nanobodies with Experimental Validation.” Preprint, bioRxiv, November 12, 2024. https://doi.org/10.1101/2024.11.11.623004; Boiko, Daniil A., et al. “Autonomous Chemical Research with Large Language Models.” Nature 624, no. 7992 (2023): 570–78. https://doi.org/10.1038/s41586-023-06792-0; Ruan, Yixiang, et al. “An Automatic End-to-End Chemical Synthesis Development Platform Powered by Large Language Models.” Nature Communications 15, no. 1 (2024): 10160. https://doi.org/10.1038/s41467-024-54457-x; Hale, Conor. “OpenAI, Babylon Aim to Tailor AI to Predict Drug Successes.” Fierce Biotech, May 14, 2025. https://www.fiercebiotech.com/medtech/fine-tuned-ai-models-openai-babylon-aim-predict-clinical-trial-successes. ↩︎
Binz, Marcel, et al. “How Should the Advancement of Large Language Models Affect the Practice of Science?” Proceedings of the National Academy of Sciences 122, no. 5 (2025): e2401227121. https://doi.org/10.1073/pnas.2401227121; Lissack, Michael, and Brenden Meagher. “LLMs and the Risk of Sloppy Science: Navigating the Future of Scientific Inquiry in the Age of Artificial Intelligence.” SSRN Scholarly Paper no. 4949823. Social Science Research Network, September 2, 2024. https://doi.org/10.2139/ssrn.4949823. ↩︎
Pannu, Jaspreet, et al. “AI Could Pose Pandemic-Scale Biosecurity Risks. Here’s How to Make It Safer.” Nature 635, no. 8040 (2024): 808–11. https://doi.org/10.1038/d41586-024-03815-2; Amodei, Dario. “Written Testimony of Dario Amodei, Ph.D. Co-Founder and CEO, Anthropic, For a Hearing on ‘Oversight of A.I.: Principles for Regulation’ Before the Judiciary Committee Subcommittee on Privacy, Technology, and the Law, United States Senate, July 25th, 2023.” July 25, 2023; Carter, Sarah, et al. The Convergence of Artificial Intelligence and the Life Sciences. Nuclear Threat Initiative, 2023; Wheeler, Nicole E. “Responsible AI in Biotechnology: Balancing Discovery, Innovation and Biosecurity Risks.” Frontiers in Bioengineering and Biotechnology 13 (2025): 1537471. https://doi.org/10.3389/fbioe.2025.1537471; Drexel, Bill, and Caleb Withers. AI and the Evolution of Biological National Security Risks. Center for a New American Security, 2024; Sandbrink, Jonas B. “Artificial Intelligence and Biological Misuse: Differentiating Risks of Language Models and Biological Design Tools.” Preprint, arXiv, December 23, 2023. https://doi.org/10.48550/arXiv.2306.13952. ↩︎
Model Evaluation and Threat Research. Common Elements of Frontier AI Safety Policies. Model Evaluation and Threat Research, 2025; Executive Office of the President. “Safe, Secure, and Trustworthy Development and Use of Artificial Intelligence.” Federal Register, November 1, 2023. https://www.federalregister.gov/documents/2023/11/01/2023-24283/safe-secure-and-trustworthy-development-and-use-of-artificial-intelligence; Anthropic. “Responsible Scaling Policy Version 2.2.” Anthropic, May 14, 2025; OpenAI. “Preparedness Framework Version 2.” OpenAI, April 15, 2025; Google DeepMind. “Frontier Safety Framework Version 2.0.” Google DeepMind, February 4, 2025. ↩︎
Anthropic. “Activating AI Safety Level 3 Protections.” May 22, 2025. https://www.anthropic.com/news/activating-asl3-protections. ↩︎
OpenAI. “Preparing for Future AI Capabilities in Biology.” June 18, 2025. https://openai.com/index/preparing-for-future-ai-capabilities-in-biology/. ↩︎
OpenAI. “Building an Early Warning System for LLM-Aided Biological Threat Creation.” February 14, 2024. https://openai.com/index/building-an-early-warning-system-for-llm-aided-biological-threat-creation/. ↩︎
The National Academy of Sciences. Department of Homeland Security Bioterrorism Risk Assessment: A Call for Change. The National Academies Press, 2008; JASON. Rare Events. The Mitre Corporation, 2009; Ezell, Barry Charles, et al. “Probabilistic Risk Analysis and Terrorism Risk.” Risk Analysis 30, no. 4 (2010): 575–89. https://doi.org/10.1111/j.1539-6924.2010.01401.x; Aven, Terje, and Ortwin Renn. “The Role of Quantitative Risk Assessments for Characterizing Risk and Uncertainty and Delineating Appropriate Risk Management Options, with Special Emphasis on Terrorism Risk.” Risk Analysis 29, no. 4 (2009): 587–600. https://doi.org/10.1111/j.1539-6924.2008.01175.x. ↩︎
Lugar, Richard G. The Lugar Survey on Proliferation Threats and Responses. N.p., n.d. ↩︎
National Research Council (US) Committee on Assessing Fundamental Attitudes of Life Scientists as a Basis for Biosecurity Education. A Survey of Attitudes and Actions on Dual Use Research in the Life Sciences: A Collaborative Effort of the National Research Council and the American Association for the Advancement of Science. National Academies Press, 2009. http://www.ncbi.nlm.nih.gov/books/NBK214757/. ↩︎
Boddie, Crystal, et al. “Assessing the Bioweapons Threat.” Science 349, no. 6250 (2015): 792–93. https://doi.org/10.1126/science.aab0713. ↩︎
Mellers, Barbara, et al. “Psychological Strategies for Winning a Geopolitical Forecasting Tournament.” Psychological Science 25, no. 5 (2014): 1106–15. https://doi.org/10.1177/0956797614524255; Mellers, Barbara, et al. “The Psychology of Intelligence Analysis: Drivers of Prediction Accuracy in World Politics.” Journal of Experimental Psychology: Applied 21, no. 1 (2015): 1; Chang, Welton, et al. “Developing Expert Political Judgment: The Impact of Training and Practice on Judgmental Accuracy in Geopolitical Forecasting Tournaments.” Judgment and Decision Making 11, no. 5 (2016): 509–26. https://doi.org/10.1017/S1930297500004599; Colson, Abigail R., and Roger M. Cooke. “Expert Elicitation: Using the Classical Model to Validate Experts’ Judgments.” Review of Environmental Economics and Policy 12, no. 1 (2018): 113–32. https://doi.org/10.1093/reep/rex022. ↩︎
Pannu, Jaspreet, et al. “AI Could Pose Pandemic-Scale Biosecurity Risks. Here’s How to Make It Safer.” Nature 635, no. 8040 (2024): 808–11. https://doi.org/10.1038/d41586-024-03815-2; Amodei, Dario. “Written Testimony of Dario Amodei, Ph.D. Co-Founder and CEO, Anthropic, For a Hearing on ‘Oversight of A.I.: Principles for Regulation’ Before the Judiciary Committee Subcommittee on Privacy, Technology, and the Law, United States Senate, July 25th, 2023.” July 25, 2023; Ezell, Barry Charles, et al. “Probabilistic Risk Analysis and Terrorism Risk.” Risk Analysis 30, no. 4 (2010): 575–89. https://doi.org/10.1111/j.1539-6924.2010.01401.x. ↩︎
Karger, Ezra, et al. “Reciprocal Scoring: A Method for Forecasting Unanswerable Questions.” SSRN Scholarly Paper no. 3954498. Social Science Research Network, October 31, 2021. https://doi.org/10.2139/ssrn.3954498. ↩︎
Rozo, Michelle, and Gigi Kwik Gronvall. “The Reemergent 1977 H1N1 Strain and the Gain-of-Function Debate.” mBio 6, no. 4 (2015): 10.1128/mbio.01013-15. https://doi.org/10.1128/mbio.01013-15. ↩︎
Götting, Jasper, et al. “Virology Capabilities Test (VCT): A Multimodal Virology Q&A Benchmark.” Preprint, arXiv, April 29, 2025. https://doi.org/10.48550/arXiv.2504.16137. ↩︎
OpenAI. “Building an Early Warning System for LLM-Aided Biological Threat Creation.” February 14, 2024. https://openai.com/index/building-an-early-warning-system-for-llm-aided-biological-threat-creation/. ↩︎
Mouton, Christopher A., Caleb Lucas, and Ella Guest. The Operational Risks of AI in Large-Scale Biological Attacks. RAND, 2024. ↩︎
Edison, Rey, Sara Toner, and Kevin Esvelt. “Evaluating the Robustness of Current Nucleic Acid Synthesis Screening.” Preprint, May 8, 2024. ↩︎
OpenAI. “OpenAI and Los Alamos National Laboratory Announce Bioscience Research Partnership.” October 7, 2024. https://openai.com/index/openai-and-los-alamos-national-laboratory-work-together/. ↩︎
Götting, Jasper, et al. “Virology Capabilities Test (VCT): A Multimodal Virology Q&A Benchmark.” Preprint, arXiv, April 29, 2025. https://doi.org/10.48550/arXiv.2504.16137. ↩︎
Anthropic. “Activating AI Safety Level 3 Protections.” May 22, 2025. https://www.anthropic.com/news/activating-asl3-protections. ↩︎
OpenAI. OpenAI O3 and O4-Mini System Card. OpenAI, 2025. ↩︎
Carter, Sarah, et al. The Convergence of Artificial Intelligence and the Life Sciences. Nuclear Threat Initiative, 2023; Wheeler, Nicole E. “Responsible AI in Biotechnology: Balancing Discovery, Innovation and Biosecurity Risks.” Frontiers in Bioengineering and Biotechnology 13 (2025): 1537471. https://doi.org/10.3389/fbioe.2025.1537471; Drexel, Bill, and Caleb Withers. AI and the Evolution of Biological National Security Risks. Center for a New American Security, 2024; Executive Office of the President. “Safe, Secure, and Trustworthy Development and Use of Artificial Intelligence.” Federal Register, November 1, 2023. https://www.federalregister.gov/documents/2023/11/01/2023-24283/safe-secure-and-trustworthy-development-and-use-of-artificial-intelligence. ↩︎
Revill, James, and Catherine Jefferson. “Tacit Knowledge and the Biological Weapons Regime.” Science and Public Policy 41, no. 5 (2014): 597–610. https://doi.org/10.1093/scipol/sct090. ↩︎
Kahneman, Daniel, and Amos Tversky. “Prospect Theory: An Analysis of Decision under Risk.” Econometrica 47, no. 2 (1979): 263–91. https://doi.org/10.2307/1914185. ↩︎
Foster, Kenneth R., Paolo Vecchia, and Michael H. Repacholi. “Science and the Precautionary Principle.” Science 288, no. 5468 (2000): 979–81. https://doi.org/10.1126/science.288.5468.979. ↩︎
Brixi, Garyk, et al. “Genome Modeling and Design across All Domains of Life with Evo 2.” Preprint, bioRxiv, February 21, 2025. https://doi.org/10.1101/2025.02.18.638918; Callaway, Ewen. “DeepMind’s New AlphaGenome AI Tackles the ‘Dark Matter’ in Our DNA.” Nature, ahead of print, June 25, 2025. https://doi.org/10.1038/d41586-025-01998-w. ↩︎
Sandbrink, Jonas B. “Artificial Intelligence and Biological Misuse: Differentiating Risks of Language Models and Biological Design Tools.” Preprint, arXiv, December 23, 2023. https://doi.org/10.48550/arXiv.2306.13952; Bloomfield, Doni, et al. “AI and Biosecurity: The Need for Governance.” Science 385, no. 6711 (2024): 831–33. https://doi.org/10.1126/science.adq1977. ↩︎
Koblentz, Gregory D. “Predicting Peril or the Peril of Prediction? Assessing the Risk of CBRN Terrorism.” Terrorism and Political Violence 23, no. 4 (2011): 501–20. https://doi.org/10.1080/09546553.2011.575487. ↩︎
Chang, Welton, et al. “Developing Expert Political Judgment: The Impact of Training and Practice on Judgmental Accuracy in Geopolitical Forecasting Tournaments.” Judgment and Decision Making 11, no. 5 (2016): 509–26. https://doi.org/10.1017/S1930297500004599. ↩︎