Research Projects

Guiding Principles

At the Center for AI Safety, our research focuses on mitigating high-consequence, societal-scale risks posed by AI.

We seek to develop foundational benchmarks and methods. To ensure that our work differentially improves the safety of AI systems, we do not pursue research which improves safety as a result of improving a model’s underlying general capabilities. Through our work, we strive to solve the technical challenge at the heart of AI safety.

In addition to our technical research, we also pursue conceptual research, examining AI safety from a multidisciplinary perspective and incorporating insights from safety engineering, complex systems, international relations, philosophy, and so on. Through our conceptual research, we create frameworks that aid in understanding the current technical challenges and publish papers which provide insight into the societal risks posed by future AI systems.

Technical AI Research

Research which improves the safety of existing AI systems.

Virology Capabilities Test (VCT): A Multimodal Virology Q&A Benchmark

Biosecurity

Jasper Götting, Pedro Medeiros, Jon G Sanders, Nathaniel Li, Long Phan, Karam Elabd, Lennart Justen, Dan Hendrycks, Seth Donoughe

Categories

Guiding Principles

At the Center for AI Safety, our research focuses on mitigating high-consequence, societal-scale risks posed by AI.

Technical AI Research

Virology Capabilities Test (VCT): A Multimodal Virology Q&A Benchmark

The MASK Benchmark: Disentangling Honesty From Accuracy in AI Systems

EnigmaEval: A Benchmark of Long Multimodal Reasoning Challenges

Utility Engineering: Analyzing and Controlling Emergent Value Systems in AIs

Humanity's Last Exam

AgentHarm: A Benchmark for Measuring Harmfulness of LLM Agents

Safetywashing: Do AI Safety Benchmarks Actually Measure Safety Progress?

Tamper-Resistant Safeguards for Open-Weight LLMs

Improving Alignment and Robustness with Circuit Breakers

The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning

HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal

Can LLMs Follow Simple Rules?

A Recipe for Improved Certifiable Robustness: Capacity and Data

Representation Engineering: A Top-Down Approach to AI Transparency

Universal and Transferable Adversarial Attacks on Aligned Language Models

Testing Robustness Against Unforeseen Adversaries

Do the Rewards Justify the Means? Measuring Trade-Offs Between Rewards and Ethical Behavior in the MACHIAVELLI Benchmark

Forecasting Future World Events with Neural Networks

How Would The Viewer Feel? Estimating Wellbeing From Video Scenarios

Dreamlike Pictures Comprehensively Improve Safety Measures

Scaling Out-of-Distribution Detection for Real-World Settings

What Would Jiminy Cricket Do? Towards Agents That Behave Morally

The Many Faces of Robustness: A Critical Analysis of Out-of-Distribution Generalization

Natural Adversarial Examples

Aligning AI With Shared Human Values

Pretrained Transformers Improve Out-of-Distribution Robustness

A Simple Data Processing Method to Improve Robustness and Uncertainty

Using Self-Supervised Learning Can Improve Model Robustness and Uncertainty

Using Pre-Training Can Improve Model Robustness and Uncertainty

Deep Anomaly Detection with Outlier Exposure

Benchmarking Neural Network Robustness to Common Corruptions and Perturbations

A Baseline for Detecting Misclassified and Out-of-Distribution Examples in Neural Networks

Conceptual Research

Superintelligence Strategy

Unsolved Problems in ML Safety

X-Risk Analysis

Natural Selection Favors AI Over Humans.

An Overview of Catastrophic AI Risks

AI Deception: A Survey of Examples, Risks, and Potential Solutions

Keep up to date with AI Safety

Thank you!