ALL Projects
Research

CAIS AI Safety Research

We pursue impactful research on a variety of topics in AI safety.

Guiding Principles

At the Center for AI Safety, our research focuses on mitigating high-consequence, societal-scale risks posed by AI.

We seek to develop foundational benchmarks and methods that can have a significant impact and be used by others in their own work. To ensure that our work differentially improves the safety of AI systems, we do not pursue research which improves safety as a result of improving a model’s underlying general capabilities. Through our work, we strive to solve the technical challenge at the heart of AI safety.

In addition to our technical research, we also pursue conceptual research, examining AI safety from a multidisciplinary perspective and incorporating insights from safety engineering, complex systems, international relations, philosophy, and so on. Through our conceptual research, we create frameworks that aid in understanding the current technical challenges and publish papers which provide insight into the societal risks posed by future AI systems.

Technical AI Research

Research which improves the safety of existing AI systems.

Do the Rewards Justify the Means? Measuring Trade-Offs Between Rewards and Ethical Behavior in the MACHIAVELLI Benchmark

Forecasting

Alexander Pan*, Chan Jun Shern*, Andy Zou*, Nathaniel Li, Steven Basart, Thomas Woodside, Jonathan Ng, Hanlin Zhang, Scott Emmons, Dan Hendrycks

View Research

/

Forecasting Future World Events with Neural Networks

Forecasting

Andy Zou, Tristan Xiao, Ryan Jia, Joe Kwon, Mantas Mazeika, Richard Li, Dawn Song, Jacob Steinhardt, Owain Evans, Dan Hendrycks

View Research

/

How Would The Viewer Feel? Estimating Wellbeing From Video Scenarios

Wellbeing

Mantas Mazeika*, Eric Tang*, Andy Zou, Steven Basart, Dawn Song, David Forsyth, Jacob Steinhardt, Dan Hendrycks

View Research

/

Dreamlike Pictures Comprehensively Improve Safety Measures

PixMix

Dan Hendrycks*, Andy Zou*, Mantas Mazeika, Leonard Tang, Bo Li, Dawn Song, and Jacob Steinhardt

View Research

/

Scaling Out-of-Distribution Detection for Real-World Settings

Scaling OOD

Dan Hendrycks*, Steven Basart*, Mantas Mazeika, Andy Zou, Joe Kwon, Mohammadreza Mostajabi, Jacob Steinhardt, Dawn Song

View Research

/

What Would Jiminy Cricket Do? Towards Agents That Behave Morally

Jiminy Cricket

Dan Hendrycks*, Mantas Mazeika*, Andy Zou, Sahil Patel, Christine Zhu, Jesus Navarro, Dawn Song, Bo Li, Jacob Steinhardt

View Research

/

The Many Faces of Robustness: A Critical Analysis of Out-of-Distribution Generalization

Many Faces

Dan Hendrycks, Steven Basart*, Norman Mu*, Saurav Kadavath, Frank Wang, Evan Dorundo, Rahul Desai, Tyler Zhu, Samyak Parajuli, Mike Guo, Dawn Song, Jacob Steinhardt, Justin Gilmer

View Research

/

Natural Adversarial Examples

Many Faces

Dan Hendrycks, Kevin Zhao*, Steven Basart*, Jacob Steinhardt, Dawn Song

View Research

/

Aligning AI With Shared Human Values

Many Faces

Dan Hendrycks, Kevin Zhao*, Steven Basart*, Jacob Steinhardt, Dawn Song

View Research

/

Pretrained Transformers Improve Out-of-Distribution Robustness

Many Faces

Dan Hendrycks, Kevin Zhao*, Steven Basart*, Jacob Steinhardt, Dawn Song

View Research

/

A Simple Data Processing Method to Improve Robustness and Uncertainty

Many Faces

Dan Hendrycks*, Norman Mu*, Ekin D. Cubuk, Barret Zoph, Justin Gilmer, Balaji Lakshminarayanan

View Research

/

Using Self-Supervised Learning Can Improve Model Robustness and Uncertainty

Many Faces

Dan Hendrycks, Mantas Mazeika*, Saurav Kadavath*, Dawn Song

View Research

/

Using Pre-Training Can Improve Model Robustness and Uncertainty

Many Faces

Dan Hendrycks, Kimin Lee, Mantas Mazeika

View Research

/

Deep Anomaly Detection with Outlier Exposure

Many Faces

Dan Hendrycks, Kimin Lee, Mantas Mazeika

View Research

/

Benchmarking Neural Network Robustness to Common Corruptions and Perturbations

Many Faces

Dan Hendrycks, Thomas Dietterich

View Research

/

A Baseline for Detecting Misclassified and Out-of-Distribution Examples in Neural Networks

Many Faces

Dan Hendrycks, Kevin Gimpel

View Research

Conceptual Research

We approach AI safety through a multi-disciplinary perspective.

Unsolved Problems in ML Safety

Conceptual Research

This paper provides a roadmap for ML Safety and refines the technical problems that the field needs to address. It presents four problems ready for research, namely withstanding hazards (“Robustness”), identifying hazards (“Monitoring”), steering ML systems (“Alignment”), and reducing deployment hazards (“Systemic Safety”). Throughout, it clarifies each problem’s motivation and provides concrete research directions.

Dan Hendrycks, Nicholas Carlini, John Schulman, Jacob Steinhardt

View Research

X-Risk Analysis

Conceptual Research

This paper provides an analysis of navigating tail-risks, including speculative long-term risks. The discussion covers three parts: applying concepts from hazard analysis and systems safety to make systems safer today, strategies for having long-term impacts on the safety of future systems, and improving the balance between safety and general capabilities.

Dan Hendrycks, Mantas Mazeika

View Research

Natural Selection favors AI over humans.

Conceptual Research

By analyzing the environment that is shaping the evolution of AIs, this paper argues that the most successful AI agents will likely have undesirable traits, as selfish species typically have an advantage over species that are altruistic to other species. The paper considers various interventions to counteract these risks and Darwinian forces. Resolving this challenge will be necessary in order to ensure the development of artificial intelligence is a positive one.

Dan Hendrycks

View Research