All Work
Compute Cluster

CAIS Compute Cluster

The Center for AI Safety is providing compute resources for ML safety research.

CAIS Compute Cluster - Overview

Conducting useful AI safety research often requires working with cutting-edge models, but running large-scale models is expensive and often cumbersome to implement. As a result, many non-industry researchers are unable to pursue advanced AI safety research.

To address this issue, CAIS runs an initiative to provide free compute for research projects in ML safety, based on a cluster of 256 A100 GPUs, with a dedicated team to provide support to cluster users.

The CAIS Compute Cluster has already supported numerous research projects on AI safety:

~100

AI safety research papers produced.

150+

AI safety researchers actively using the cluster

2,500+

Research citations

Who is Eligible for Access?

The CAIS Compute Cluster is specifically designed for researchers who are working on the safety of machine learning systems. For a non-exhaustive list of topics we are excited about, see Unsolved Problems in ML Safety or the ML Safety Course. We may also consider other research areas provided appropriate justification of the impact on ML safety is provided.

  • We are particularly excited to support work on LLM adversarial robustness and transparency [1, 2], and may give preference to those proposals.
  • Work which improves general capabilities or work that improves safety as a consequence of improving general capabilities are not in scope. “General capabilities” of AI refers to concepts such as a model’s accuracy on typical tasks, sequential decision making abilities in typical environments, reasoning abilities on typical problems, and so on.
  • The current application is primarily for professors.

Further detail on our access policies and how to use the cluster can be found here.

Application Process

We expect to process applications and allocate the majority of computing resources around three application deadlines in February, June and October each year. For the current cycle, the deadline will be May 31st. Later applications may be considered if sufficient resources remain available.

Instructions on what your proposal should contain and other details required can be found in the application form. Project proposals can be brief and in many cases we expect that 250 words will be sufficient.

Applicants will need to specify how long they require access for their project, up to a maximum initial term of 4 months. Users of the cluster may request to extend their access at the end of this term, provided suitable progress is demonstrated.

Proposals will be assessed based on the following criteria:

  • Impact: if the proposed questions are successfully answered, how valuable would this be in improving the safety of AI systems?
  • Feasibility: How likely is it that this project will be successfully executed? Does the research team involved have the right experience, skills and track record to successfully execute?
  • Risks and externalities: How likely is it that this project accelerates the development of generally capable systems more than it improves safety? Could this project contribute to other harmful outcomes? How are such risks being managed?

Our Collaborators

We support leading experts in a diverse range of ML safety research directions, some of which are listed below.

Bo Li

Assistant Professor of Computer Science, University of Illinois at Urbana-Champaign

Carl Vondrick

Assistant Professor of Computer Science, Columbia University

Cihang Xie

Assistant Professor of Computer Science, UC Santa Cruz

David Bau

Assistant Professor of Computer Science, Northeastern Khoury College

David Krueger

Assistant Professor at the University of Cambridge
Member of Cambridge: CBL & MLG

David Wagner

Professor of Computer Science, University of California Berkeley

Dawn Song

Professor of Computer Science, University of California Berkeley

Florian Tramer

Assistant Professor of Computer Science, ETH Zurich

James Zou

Associate Professor of Biomedical Data Science and, by courtesy, of Computer Science and Electrical Engineering at Stanford University.

Jinwoo Shin

Professor of AI, Korean Advanced Institute of Science & Technology

Matthias Hein

Professor of Machine Learning, University of Tübingen

Percy Liang

Associate Professor of Computer Science, Stanford University

Robin Jia

Assistant Professor of Computer Science, University of Southern California

Scott Niekum

Associate Professor of Computer Science, University of Massachusetts Amherst

Sharon Li

Assistant Professor Department of Computer Sciences University of Wisconsin-Madison

Yizheng Chen

Assistant Professor of Computer Science, University of Maryland

Research produced using the CAIS compute cluster

View our Google Scholar page for papers based on research supported by the CAIS Compute Cluster:

CAIS Compute Cluster Research

Universal and Transferable Adversarial Attacks on Aligned Language Models

We showed that it was possible to automatically bypass the safety guardrails on GPT-4 and other AI systems, causing the AIs to generate harmful content such as instructions for building a bomb or stealing another person’s identity. Our work was covered by the New York Times.

Andy Zou,Zifan Wang,Nicholas Carlini,Milad Nasr,J. Zico Kolter,Matt Fredrikson

Publication link

Under review for conference

Do the Rewards Justify the Means? Measuring Trade-Offs Between Rewards and Ethical Behavior in the MACHIAVELLI Benchmark

We evaluated the tendency of AI systems to make ethical decisions in complex environments. The benchmark provides 13 measures of ethical behavior, including measures of whether the AI behaves deceptively, seeks power, and follows ethical rules. 

Alexander Pan,Jun Shern Chan,Andy Zou,Nathaniel Li,Steven Basart,Thomas Woodside,Jonathan Ng,Hanlin Zhang,Scott Emmons,Dan Hendrycks

Publication link

Under review for conference

DecodingTrust: A Comprehensive Assessment of Trustworthiness in GPT Models

Provides a thorough assessment of trustworthiness in GPT models, including toxicity, stereotype and bias, robustness, privacy, fairness, machine ethics, and so on. It won the outstanding paper award at NeurIPS 2023.

Bo Li,Boxin Wang,Weixin Chen,Hengzhi Pei,Chulin Xie,Mintong Kang,Chenhui Zhang,Chejian Xu,Zidi Xiong,Ritik Dutta,Rylan Schaeffer,Sang T. Truong,Simran Arora,Mantas Mazeika,Dan Hendrycks,Zinan Lin,Yu Cheng,Sanmi Koyejo,Dawn Song

Publication link

Under review for conference

The Reversal Curse: LLMs trained on "A is B" fail to learn "B is A"

We expose a surprising failure of generalization in auto-regressive large language models (LLMs). If a model is trained on a sentence of the form "A is B", it will not automatically generalize to the reverse direction "B is A".

Lukas Berglund,Meg Tong,Max Kaufmann,Mikita Balesni,Asa Cooper Stickland,Tomasz Korbak,Owain Evans

Publication link

Under review for conference

Continuous Learning for Android Malware Detection

Proposes new methods to use machine learning to detect Android malware.

Yizheng Chen,Zhoujie Ding,David Wagner

Publication link

Under review for conference

DiverseVul: A New Vulnerable Source Code Dataset for Deep Learning Based Vulnerability Detection

Provides a new vulnerable source code dataset which is significantly larger than previous datasets and analyzes challenges and opportunities in using deep learning for detecting software vulnerabilities.

Yizheng Chen,Xinyun Chen,Zhoujie Ding,David Wagner

Publication link

Under review for conference

BadLlama: cheaply removing safety fine-tuning from Llama 2-Chat 13B

Demonstrates that with a budget of a few hundred dollars, it is possible to reduce the rate at which Meta’s Llama 2 model refuses to follow harmful instructions to below 1%. This raises significant questions about the risks associated with AI developers allowing external users to conduct fine-tuning of Large Language Models, due to the potential to remove safeguards against harmful outputs. Mentioned in US Congress as part of Schumer AI Insight Forum discussions.

Pranav Gade,Simon Lermen,Charlie Rogers-Smith,Jeffrey Ladish

Publication link

Under review for conference

How to Catch an AI Liar: Lie Detection in Black-Box LLMs by Asking Unrelated Questions

Large language models (LLMs) can "lie" by outputting false statements despite "knowing" the truth in a demonstrable sense. LLMs might "lie", for example, when instructed to output misinformation.This paper provides a simple lie detector that works by asking a predefined set of unrelated follow-up questions after a suspected lie, and is highly accurate and surprisingly general.

Lorenzo Pacchiardi,Alex J. Chan,Sören Mindermann,Ilan Moscovitz,Alexa Y. Pan,Yarin Gal,Owain Evans,Jan Brauner

Publication link

Under review for conference

Testing Robustness Against Unforeseen Adversaries

Dan Hendrycks

Publication link

Under review for conference

Robust Semantic Segmentation: Strong Adversarial Attacks and Fast Training of Robust Models

Naman Deep Singh,Francesco Croce,Matthias Hein

Publication link

Under review for conference

Representation Engineering: A Top-Down Approach to AI Transparency

Andy Zou,Long Phan,Sarah Chen,James Campbell,Phillip Guo,Richard Ren,Alexander Pan,Xuwang Yin,Mantas Mazeika,Ann-Kathrin Dombrowski,Shashwat Goel,Nathaniel Li,Michael J. Byun,Zifan Wang,Alex Mallen,Steven Basart,Sanmi Koyejo,Dawn Song,Matt Fredrikson,J. Zico Kolter,Dan Hendrycks

Publication link

Under review for conference

Linearity of Relation Decoding in Transformer Language Models

Evan Hernandez,Arnab Sen Sharma,Tal Haklay,Kevin Meng,Martin Wattenberg,Jacob Andreas,Yonatan Belinkov,David Bau

Publication link

Under review for conference

Sight Beyond Text: Multi-Modal Training Enhances LLMs in Truthfulness and Ethics

Haoqin Tu,Bingchen Zhao,Chen Wei,Cihang Xie

Publication link

Under review for conference

Unified Concept Editing in Diffusion Models

Rohit Gandikota,Hadas Orgad,Yonatan Belinkov,Joanna Materzyńska,David Bau

Publication link

Under review for conference

Contrastive Prefence Learning: Learning from Human Feedback without RL

Joey Hejna,Rafael Rafailov,Harshit Sikchi,Chelsea Finn,Scott Niekum,W. Bradley Knox,Dorsa Sadigh

Publication link

Under review for conference

Copy Suppression

Callum McDougall,Arthur Conmy,Cody Rushing,Thomas McGrath,Neel Nanda

Publication link

Under review for conference

Taken out of context: On measuring situational awareness in LLMs

Owain Evans,Meg Tong,Max Kaufmann,Lukas Berglund,Mikita Balesni,Tomek Korbak,Daniel Kokotajlo,Asa Stickland

Publication link

Under review for conference

Revisiting Adversarial Training at Scale

Cihang Xie

Publication link

Under review for conference

Generalization Analogies (GENIES): A Testbed for Generalizing AI Oversight to Hard-To-Measure Domains

Joshua Clymer,Garrett Baker,Rohan Subramani,Sam Wang

Publication link

Under review for conference

Transformers Learn Higher-Order Optimization Methods for In-Context Learning: A Study with Linear Models

Deqing Fu,Tian-Qi Chen,Robin Jia,Vatsal Sharan

Publication link

Under review for conference

SPFormer: Enhancing Vision Transformer with Superpixel Representation.

Cihang Xie

Publication link

Under review for conference

Can LLMs Follow Simple Rules?

Norman Mu,Sarah Chen,Zifan Wang,Sizhe Chen,David Karamardian,Lulwa Aljeraisy,Dan Hendrycks,David Wagner

Publication link

Under review for conference

Tell, don't show: Declarative facts influence how LLMs generalize

Alexander Meinke,Owain Evans

Publication link

Under review for conference

Jatmo: Prompt Injection Defense by Task-Specific Finetuning

Chawin Sitawarin,Sizhe Chen,David Wagner

Publication link

Under review for conference

How Many Unicorns Are in This Image? A Safety Evaluation Benchmark for Vision LLMs

Haoqin Tu,Chenhang Cui,Zijun Wang,Yiyang Zhou,Bingchen Zhao,Junlin Han,Wangchunshu Zhou,Huaxiu Yao,Cihang Xie

Publication link

Under review for conference

HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal

Mantas Mazeika,Long Phan,Xuwang Yin,Andy Zou,Zifan Wang,Norman Mu,Elham Sakhaee,Nathaniel Li,Steven Basart,Bo Li,David Forsyth,Dan Hendrycks

Publication link

Under review for conference

LMRL Gym: Benchmarks for Multi-Turn Reinforcement Learning with Language Models. Under review. .

Marwa Abdulhai,Isadora White,Charlie Victor Snell,Charles Sun,Joey Hong,Yuexiang Zhai,Kelvin Xu,Sergey Levine

Publication link

Under review for conference

AQA-Bench: An Interactive Benchmark for Evaluating LLMs' Sequential Reasoning Ability

Siwei Yang,Bingchen Zhao,Cihang Xie

Publication link

Under review for conference

Repetition Improves Language Model Embeddings

Jacob Mitchell Springer,Suhas Kotha,Daniel Fried,Graham Neubig,Aditi Raghunathan

Publication link

Under review for conference

Future Lens: Anticipating Subsequent Tokens from a Single Hidden State

Rohit Gandikota,Joanna Materzynska,Jaden Fiotto-Kaufman,David Bau

Publication link

Under review for conference

Poisoning RLHF

Florian Tramer,Javier Rando

Publication link

Under review for conference

Eight Methods to Evaluate Robust Unlearning in LLMs

Aengus Lynch,Phillip Guo,Aidan Ewart,Stephen Casper,Dylan Hadfield-Menell

Publication link

Under review for conference

WebArena: A Realistic Web Environment for Building Autonomous Agents

Fangzheng Xu,Fangzheng Xu,Uri Alon

Publication link

Under review for conference

Benchmarking Neural Network Robustness to Optimisation Pressure

Dan Hendrycks

Publication link

Under review for conference

LoRA Fine-tuning Efficiently Undoes Safety Training in Llama 2-Chat 70B

Simon Lermen,Charlie Rogers-Smith,Jeffrey Ladish

Publication link

Under review for conference

Fine-Tuning Enhances Existing Mechanisms: A Case Study on Entity Tracking

Nikhil Prakash,Tamar Rott Shaham,Tal Haklay,Yonatan Belinkov,David Bau

Publication link

Under review for conference

Function Vectors in Large Language Models

Eric Todd,Millicent L. Li,Arnab Sen Sharma,Aaron Mueller,Byron C. Wallace,David Bau

Publication link

Under review for conference

VisualWebArena: Evaluating Multimodal Agents on Realistic Visual Web Tasks

Jing Yu Koh,Robert Lo,Lawrence Jang,Vikram Duvvur,Ming Chong Lim,Po-Yu Huang,Graham Neubig,Shuyan Zhou,Ruslan Salakhutdinov,Daniel Fried

Publication link

Under review for conference

Defense against transfer attack

Chawin Sitawarin,David Wagner

Publication link

Under review for conference

PAL: Proxy-Guided Black-Box Attack on Large Language Models

Chawin Sitawarin,Norman Mu,David Wagner,Alexandre Araujo

Publication link

Under review for conference

Defending Against Unforeseen Failure Modes with Latent Adversarial Training

Stephen Casper,Lennart Schulze,Oam Patel,Dylan Hadfield-Menell

Publication link

Under review for conference

SHINE: Shielding Backdoors in Deep Reinforcement Learning

Wenbo Guo

Publication link

Under review for conference

The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning

Publication link

Under review for conference

Seek and You Will Not Find: Hard-To-Detect Trojans in Deep Neural Networks

Dan Hendrycks

Publication link

Under review for conference

Aligning Modalities in Vision Large Language Models via Preference Fine-tuning

Yiyang Zhou,Chenhang Cui,Rafael Rafailov,Chelsea Finn,Huaxiu Yao

Publication link

Under review for conference

TextGuard: Provable Defense against Backdoor Attacks on Text Classification

Hengzhi Pei,Jinyuan Jia,Wenbo Guo,Bo Li,Dawn Song

Publication link

Under review for conference

LLM-PBE: Assessing Data Privacy in Large Language Models

Zhun Wang,Dawn Song

Publication link

Under review for conference

BIRD: Generalizable Backdoor Detection and Removal for Deep Reinforcement Learning

Wenbo Guo,Dawn Song,Guanhong Tao,Xiangyu Zhang

Publication link

Under review for conference

Multi-scale Diffusion Denoised Smoothing

Jinwoo Shin,Jongheon Jeong

Publication link

Under review for conference

TrojFM: Resource-efficient Backdoor Attacks against Very Large Foundation Models

Wenbo Guo

Publication link

Under review for conference

Django: Detecting Trojans in Object Detection Models via Gaussian Focus Calibration

Guangyu Shen,Siyuan Cheng,Guanhong Tao,Kaiyuan Zhang,Yingqi Liu,Shengwei An,Shiqing Ma,Xiangyu Zhang

Publication link

Under review for conference

Defining Deception in Decision Making. Under review

Marwa Abdulhai,Micah Carroll,Justin Svegliato,Anca Dragan,Sergey Levine

Publication link

Under review for conference

D^3: Detoxing Deep Learning Dataset

Lu Yan,Siyuan Cheng,Guangyu Shen,Guanhong Tao,Kaiyuan Zhang,Xuan Chen,Yunshu Mao,Xiangyu Zhang

Publication link

Under review for conference

Query Based Adversarial Examples for LLMs

Florian Tramer

Publication link

Under review for conference

Out-of-context meta-learning in Large Language Models

David Krueger,Dmitrii Krasheninnikov,Egor Krasheninnikov

Publication link

Under review for conference

ParaFuzz: An Interpretability-Driven Technique for Detecting Poisoned Samples in NLP

Lu Yan,Zhuo Zhang,Guanhong Tao,Kaiyuan Zhang,Xuan Chen,Guangyu Shen,Xiangyu Zhang

Publication link

Under review for conference