SafeBench competition background art

Example Ideas

We are interested in benchmarks that reduce risks from AI systems. In order to provide further guidance, we've outlined four categories where we would like to see benchmarks.

SafeBench Examples Overview

We are interested in benchmarks that reduce risks from AI systems. In order to provide further guidance, we've outlined four categories where we would like to see benchmarks:

  • Robustness: designing systems to be reliable in the face of adversaries and highly unusual situations.
  • Monitoring: detect malicious use, monitor predictions, and discover unexpected model functionality
  • Alignment: building models that represent and safely optimize difficult-to-specify human values.
  • Safety Applications: using ML to address broader risks related to how ML systems are handled.

For each of these categories, we've provided examples. You can submit a benchmark that concretizes one of these ideas, but it needs to be a fully-developed benchmark. See guidelines for information about what submissions should contain.

Robustness

Jailbreaking Text and Multimodal Models

Improving defenses against adversarial attacks.

Description

Adversarial robustness entails making systems robust to inputs that are selected to make the system fail, which is critical for AI security. As AI systems are deployed to higher stakes settings, they need to be robust to jailbreaking and other forms of exploitation by malicious actors.

With the development of automated jailbreaking attacks, adversaries can search through incredibly large design spaces to elicit unintended or harmful behavior. This is especially true for multimodal models, necessitating multilayered defenses that must be robust to unforeseen attacks.

Example benchmarks

A benchmark could tackle new and ideally unseen jailbreaking attacks and defenses. It would be especially interesting to measure defenses against expensive or highly motivated attacks. Benchmarks could consider attacks in a variety of domains beyond natural images or text, including attacks on systems involving multiple redundant sensors or information channels.

More reading:

Proxy gaming

Detecting when models are pursuing proxies to the detriment of the true goal, and developing robust proxies.

Description

When building systems, it is often difficult to measure the true goal (e.g., human wellbeing) directly. Instead, it is common to create proxies or approximation of the true goal. However, proxies can be “gamed”: a system might be able to optimize the proxy to the detriment of the true goal. Thus, the proxy metrics used to quantify the performance of a system (e.g., the reward function) must be adversarially robust to gaming. If not, AI systems will exploit vulnerabilities in these proxies to achieve high scores without optimizing the actual objective.

For instance, recommender systems optimizing for user engagement have been demonstrated to recommend polarizing content. This content engenders high engagement but decreases human wellbeing in the process, which has been costly not only for users but also for system designers.

Example benchmarks

An example of a benchmark could be a set of proxy objectives such that if they are optimized weakly, models perform well at the intended task, while optimizing them strongly causes models to perform poorly at the intended task (the proxies are ‘gamed’). A benchmark could then evaluate methods that aim to detect whether/when the model is gaming the proxy. Another benchmark could measure the extent to which learned proxy metrics are robust to powerful optimization.

More reading

Agent and Text Out-of-Distribution Detection

Detecting out-of-distribution text or events in a reinforcement learning context.

Description

Agents, whether trained with standard RL methods or built as scaffolding around language models, may encounter states far outside the expected distribution, where undesirable behavior becomes more likely. As language models and agents see increased adoption in high-stakes settings, it is essential that systems are able to identify out-of-distribution inputs so that models can be overridden by external operators or systems.

Example benchmarks

A strong benchmark might include a diverse range of environments or textual contexts that models are evaluated on after training. The benchmark should contain difficult examples on the boundary of the in-distribution and out-of-distribution. In addition, it should specify a clear evaluation protocol so that methods can be easily compared.

More reading

Monitoring

Emergent Capabilities

Detecting and forecasting emergent capabilities.

Description

In today’s AI systems, capabilities that are not anticipated by system designers emerge during training. For example, as language models became larger, they gained the ability to perform arithmetic, even though they received no explicit arithmetic supervision. Future ML models may, when prompted deliberately, demonstrate capabilities to synthesize harmful content or assist with crimes. To safely deploy these systems, we must monitor what capabilities they possess. Furthermore, if we’re able to accurately forecast future capabilities, this gives us time to prepare to mitigate their potential risks.

Example benchmarks

Benchmarks could assume the presence of a trained model and probe it through a battery of tests designed to reveal new capabilities. Benchmarks could also evaluate capability prediction methods themselves, e.g., by creating a test set of unseen models with varying sets of capabilities and measuring the accuracy of methods that have white-box access to these models and attempt to predict their capabilities. Benchmarks could cover one or more model types, including language models or reinforcement learning agents.

More reading

Hazardous Capability Unlearning

Preventing and removing unwanted and dangerous capabilities from trained models.

Description

Unanticipated capabilities often emerge in today’s AI systems, which makes it important to check whether AI systems have hazardous capabilities before deploying them. If they do have dangerous capabilities, e.g. the ability to produce persuasive political content, assist with a cyber attack, or to deceive/manipulate humans, these capabilities may need to be unlearned. Alternatively, the training procedure or dataset may need to be changed in order to train new models without hazardous capabilities.

Example benchmarks

Harder unlearning benchmarks could be developed with subtle distinctions between capabilities to be unlearned versus retained. New unlearning methods may be developed that make it very difficult for a model to exhibit a capability, even with moderate fine-tuning. Benchmarks might also verify that unlearning methods do not affect model performance in unrelated and harmless domains.

More reading

Transparency

Building tools that offer clarity into model inner workings.

Description

Neural networks are notoriously opaque. Transparency tools that intelligibly communicate model reasoning and knowledge to humans may be useful for uncovering dangerous model properties and making models correctable. Successful transparency tools would allow a human to predict how a model will behave in various situations without testing it. They would provide clear explanations for behavior that suggest corrective interventions.

Example benchmarks

Benchmarks could determine how well transparency tools are able to identify belief structures possessed by models. Alternatively, they could measure the extent to which the transparency tools predict model behavior or can be used to identify potential failures.

Trojans

Recovering triggers for ML model backdoors.

Description

Trojans (or backdoors) can be planted in models by training them to fail on a specific set of inputs. For example, a trojaned language model might produce toxic text if triggered with a key word but otherwise behaved benignly. Trojans are usually introduced into the model through data poisoning, which is especially a risk if the training set contains data scraped from public sources. Screening for and patching trojans is necessary for ensuring model security in real-world applications. Otherwise, adversaries might exploit the model’s backdoor to their own advantage. Trojan detection may also be a microcosm for detecting deceptive behavior in future AI agents. Misaligned AI agents that are sufficiently capable may adopt the strategy of appearing to be aligned and benign in order to preserve themselves and be deployed. These AIs may defect on a small subset of inputs, e.g. inputs that indicate that the AI isn’t being monitored or inputs indicating it could bypass security in the training environment. Methods for detecting trojans may be relevant for detecting deceptively benign behavior like this.

Example benchmarks

A benchmark could measure how well the trigger for a trojan could be reconstructed.

More reading

Alignment

Power-seeking

Measuring or penalizing power-seeking.

Description

To better accomplish their goals, advanced agent AIs may be instrumentally incentivized to seek power. Various forms of power, including resources, legitimate power, coercive power, and so on, are helpful for achieving nearly any goal a system might be given. AIs that acquire substantial power can become especially dangerous if they are not aligned with human values, since powerful agents are more difficult to correct and can create more unintended consequences. AIs that pursue power may also reduce human autonomy and authority, so we should avoid building agents that do not act within reasonable boundaries.

Example benchmarks

A benchmark may involve developing an environment in which agents clearly develop self-preserving or power-seeking tendencies and designing a metric that tracks this behavior. Potentially, benchmarks could consider using video games or other environments in which agents’ goals can be achieved by acquiring power.

More reading

Honest models

Measuring the extent to which language models state what they know.

Description

An honest language model only outputs text that that they hold to be true. It is important that ML models do not output falsehoods nor deceive human operators. If AI agents are honest, it will be easier to monitor their plans. Honesty is not the same as truthfulness, which requires that models only output truths about the world. We focus on honesty rather than truthfulness because honesty is more orthogonal to general model capabilities. Being truthful requires both honesty and the capability to determine the truth.

Example benchmarks

A benchmark could build an evaluation scheme that catches models making inconsistent statements (while showing these inconsistencies are not the result of fickleness). Useful benchmarks should ideally rely on rigorous definitions of “honesty” and “beliefs.”

More reading

Collusion

Detecting and preventing collusion in multi-agent systems.

Description

In multi-agent environments (e.g. a monitor evaluating a model), there may be incentives to collude; for example, a monitor and a model under evaluation could collude to both report favorable results. Undetectable collusion undermines the integrity of monitoring mechanisms and opens the door to a variety of failure modes.

Example benchmarks

A useful environment might incentivize collusion in a toy scenario and provide a standardized method of measurement, so that anti-collusion techniques can be objectively evaluated. Environments could also propose anti-collusion measures (e.g. limited communication channels) and create a benchmark to elicit examples of collusion that are still possible. Video games with strategies involving collusion may be useful sandboxes.

More reading

Implementing moral decision-making

Training models to robustly represent and abide by ethical frameworks.

Description

AI models that are aligned should behave morally. One way to implement moral decision-making could be to train a model to act as a “moral conscience” and use this model to screen for any morally dubious actions. Eventually, we would want every powerful model to be guided, in part, by a robust moral compass. Instead of privileging a single moral system, we may want an ensemble of various moral systems representing the diversity of humanity’s own moral thought.

Example benchmarks

Given a particular moral system, a benchmark might seek to measure whether a model makes moral decisions according to that system or whether a model understands that moral system. Benchmarks may be based on different modalities (e.g., language, sequential decision-making problems) and different moral systems. Benchmarks may also consider curating and predicting philosophical texts or pro- and contra- sides for philosophy debates and thought experiments. In addition, benchmarks may measure whether models can deal with moral uncertainty. While an individual benchmark may focus on a single moral system, an ideal set of benchmarks would have a diversity representative of humanity’s own diversity of moral thought.

Note that moral decision-making has some overlap with task preference learning; e.g. “I like this Netflix movie.” However, human preferences also tend to boost standard model capabilities (they provide a signal of high performance). Instead, we focus here on enduring human values, such as normative factors (wellbeing, impartiality, etc.) and the factors that constitute a good life (pursuing projects, seeking knowledge, etc.).

More reading

Applications

Cyberdefense

Using ML to defend against sophisticated cyberattacks.

Description

Networked computer systems now control critical infrastructure, sensitive databases, and powerful ML models. This leads to two major weaknesses:

As AI systems increase in economic relevance, cyberattacks on AI systems themselves will become more common. Some AI systems may be private or unsuitable for proliferation, and they will therefore need to operate on computers that are secure.

ML may amplify future automated cyberattacks. Hacking currently requires specialized skills, but if ML code-generation models or agents could be fine-tuned for hacking, the barrier to entry may sharply decrease.

Future ML systems could:

  • Automatically detect intrusions
  • Actively stop cyberattacks by selecting or recommending known defenses
  • Submit patches to security vulnerabilities in code
  • Generate unexpected inputs for programs (fuzzing)
  • Model binaries and packets to detect obfuscated malicious payloads
  • Predict next steps in large-scale cyberattacks to provide contextualized early warnings. Warnings could be judged by lead time, precision, recall, and quality of contextualized explanations.

Example benchmarks

Useful benchmarks could outline a standard to evaluate one or more of the above tasks. Ideally the benchmark should be easy to use for deep learning researchers without a background in cybersecurity. Benchmarks may involve toy tasks, but should bear similarity to real world tasks. A benchmark should incentivize defensive capabilities only and have limited offensive utility.

More reading

Deep Learning for Biodefense

Using ML to defend against bioattacks.

Description

Biodefense encompasses strategies and measures to protect against harms from biological agents and natural pathogens, aiming to prevent disease outbreaks and bioterrorism.

Future ML techniques might enhance predictive modeling for outbreak detection and response, optimize the design and efficacy of vaccines, and facilitate the rapid detection of novel diseases using genetic sequencing. We hope that well-designed benchmarks can aid in the development of these techniques.

Example benchmarks

Benchmarks could aid in genetic engineering attribution, for instance by associating modified genetic sequences with the tools or techniques used to engineer them. They could help advance anomaly detection in metagenomic sequencing, especially for methods which can detect novel pathogens. Benchmarks should incentivize defensive capabilities only and have limited offensive utility.

More reading