SafeBench competition background art


Your SafeBench submission must produce a benchmark that clearly tests safety and not model capabilities. We detail further guidelines and tips to keep in mind for the competition below.


  1. Each benchmark will be evaluated by the judges according to the criteria outlined below. Prizes will be awarded to the benchmarks which score the best according to the aggregate evaluations of the judges.
  2. Prizes will be distributed evenly to named lead authors on the paper, unless other instructions are provided.
  3. Benchmarks released prior to the competition launch are ineligible for prize consideration. Benchmarks released after competition launch are eligible.
  4. By default, we will require the code and dataset to be publicly available on Github. (If the submission deals with a dangerous capability, we will review whether to make the dataset publicly available on a case-by-case basis).
  5. Judges cannot submit or be featured as an author on submissions for the competition.
  6. Pay attention to the legal aspects of data sourcing. It's acceptable and recommended to use data that is already freely available; however, make sure that obtaining the data complies with the licensing or usage guidelines set by its originator.
  7. You are eligible to submit as an individual, on behalf of an organization, from a for-profit or a not-for-profit - we are impartial as to your affiliation (or lack thereof).
  8. We are only able to award prizes according to the constraints laid out in our terms and conditions.

How will submissions be evaluated?


What good looks like:

Clearly assesses the
safety of AI systems
The benchmark clearly assesses some aspect of the safety of AI systems. Ideally, results are communicable to a non-technical audience (e.g., policymakers).
Progress would be beneficial
Future work that builds on this measurement will make progress towards safe and beneficial AI systems.
The benchmark is not a binary indicator (e.g., "no hazard present"), but rather more continuous. This is an important property for risk mitigation instead of just risk assessment.
Easily evaluable
Researchers can easily use the measurement. Specifically, automatic evaluation is preferred over human in the loop.
At an appropriate level of difficulty
A benchmark which is too difficult is either caught near the floor or has almost nothing affect the benchmark beyond generic upstream performance. A benchmark which is too easy is one which is already almost solved.
Includes code implementation

Submissions must include their code and, if applicable, their datasets. By default, we will require the code and dataset to be publicly available on github. (If the submission deals with a dangerous capability, we will review whether to make the dataset publicly available on a case-by-case basis.)

For an in-depth discussion on how to develop good benchmarks, see this blogpost.

Safety vs capabilities

Your benchmark needs to clearly delineate between safety vs capabilities. Performance on many benchmarks are often highly correlated, improving with the general capabilities of models. Good safety benchmarks should ideally have a lower correlation or association with general capabilities, in order to encourage new safety techniques that improve safety without simultaneously improving capabilities. For example, a model's ability to correctly answer questions (truthfulness) is closely related to its general capabilities, and will naturally improve with scale, but a more safety-relevant metric might be honesty (the extent to which a model's outputs match its internal beliefs). Work that improves on honesty does not need to make the model more generally knowledgeable, allowing for progress on safety that does not require progress on capabilities.

As an example, benchmarks that would previously have won include:

Example format

If you have already written a paper about your benchmark, submit that. Otherwise, you should submit a write-up that provides a thorough and concrete explanation of your benchmark, including details about how it would be implemented. We’ve provided an example format in this document, though using it is entirely optional.