We seek to develop foundational benchmarks and methods that can have a significant impact and be used by others in their own work. To ensure that our work differentially improves the safety of AI systems, we do not pursue research which improves safety as a result of improving a model’s underlying general capabilities. Through our work, we strive to solve the technical challenge at the heart of AI safety.
In addition to our technical research, we also pursue conceptual research, examining AI safety from a multidisciplinary perspective and incorporating insights from safety engineering, complex systems, international relations, philosophy, and so on. Through our conceptual research, we create frameworks that aid in understanding the current technical challenges and publish papers which provide insight into the societal risks posed by future AI systems.
Research which improves the safety of existing AI systems.
Alexander Pan*, Chan Jun Shern*, Andy Zou*, Nathaniel Li, Steven Basart, Thomas Woodside, Jonathan Ng, Hanlin Zhang, Scott Emmons, Dan Hendrycks
/
Andy Zou, Tristan Xiao, Ryan Jia, Joe Kwon, Mantas Mazeika, Richard Li, Dawn Song, Jacob Steinhardt, Owain Evans, Dan Hendrycks
/
Mantas Mazeika*, Eric Tang*, Andy Zou, Steven Basart, Dawn Song, David Forsyth, Jacob Steinhardt, Dan Hendrycks
/
Dan Hendrycks*, Andy Zou*, Mantas Mazeika, Leonard Tang, Bo Li, Dawn Song, and Jacob Steinhardt
/
Dan Hendrycks*, Steven Basart*, Mantas Mazeika, Andy Zou, Joe Kwon, Mohammadreza Mostajabi, Jacob Steinhardt, Dawn Song
/
Dan Hendrycks*, Mantas Mazeika*, Andy Zou, Sahil Patel, Christine Zhu, Jesus Navarro, Dawn Song, Bo Li, Jacob Steinhardt
/
Dan Hendrycks, Steven Basart*, Norman Mu*, Saurav Kadavath, Frank Wang, Evan Dorundo, Rahul Desai, Tyler Zhu, Samyak Parajuli, Mike Guo, Dawn Song, Jacob Steinhardt, Justin Gilmer
/
Dan Hendrycks, Kevin Zhao*, Steven Basart*, Jacob Steinhardt, Dawn Song
/
Dan Hendrycks, Kevin Zhao*, Steven Basart*, Jacob Steinhardt, Dawn Song
/
Dan Hendrycks, Kevin Zhao*, Steven Basart*, Jacob Steinhardt, Dawn Song
/
Dan Hendrycks*, Norman Mu*, Ekin D. Cubuk, Barret Zoph, Justin Gilmer, Balaji Lakshminarayanan
/
Dan Hendrycks, Mantas Mazeika*, Saurav Kadavath*, Dawn Song
/
Dan Hendrycks, Kimin Lee, Mantas Mazeika
/
/
Dan Hendrycks, Thomas Dietterich
/
Dan Hendrycks, Kevin Gimpel
We approach AI safety through a multi-disciplinary perspective.
This paper provides a roadmap for ML Safety and refines the technical problems that the field needs to address. It presents four problems ready for research, namely withstanding hazards (“Robustness”), identifying hazards (“Monitoring”), steering ML systems (“Alignment”), and reducing deployment hazards (“Systemic Safety”). Throughout, it clarifies each problem’s motivation and provides concrete research directions.
Dan Hendrycks, Nicholas Carlini, John Schulman, Jacob Steinhardt
This paper provides an analysis of navigating tail-risks, including speculative long-term risks. The discussion covers three parts: applying concepts from hazard analysis and systems safety to make systems safer today, strategies for having long-term impacts on the safety of future systems, and improving the balance between safety and general capabilities.
Dan Hendrycks, Mantas Mazeika
By analyzing the environment that is shaping the evolution of AIs, this paper argues that the most successful AI agents will likely have undesirable traits, as selfish species typically have an advantage over species that are altruistic to other species. The paper considers various interventions to counteract these risks and Darwinian forces. Resolving this challenge will be necessary in order to ensure the development of artificial intelligence is a positive one.
Dan Hendrycks