CS 362: Research in AI Alignment

Course Information

  • Instructor Name(s): Scott Viteri

  • Teaching Assistants: Victor Lecomte, Gabe Mukobi, Peter Chatain

  • Course Faculty Sponsor: Clark Barrett

  • Lecture: Monday 3:00-4:30pm, B067 Mitchell Earth Science (In-person only)

  • Optional Office Hours: Large Language Model productivity meetings at 1-2PM in Gates 200

  • Graduate-level course or advanced undergraduates (contact course instructor)

  • 3 Units, Spring 2023, ExploreCourses

AI Alignment

Course Description

In this course we will explore the current state of research in the field of AI alignment, which seeks to bring increasingly intelligent AI systems in line with human values and interests. The purpose of this course is to encourage the development of new ideas in this field, where a dominant paradigm has not yet been established. The format will be weekly lectures in which speakers present their current research approaches.

The assignment structure will be slightly unusual: each week students will have a choice between a problem set and a short research assignment based on the weekly guest speaker’s research area. For the research assignment, students will start with the abstract of a relevant AI alignment paper or blog post and create a blog post or Github repository describing how they would continue the paper. The final weekly assignment will be an extension of one of the previous weeks’ work. Therefore this course requires research experience, preferably using mathematical and programming tools (e.g. Python, PyTorch, calculus), and is a graduate level course, open to advanced undergraduates.

Prerequisites:

Any one of the following: CS223a, CS224n, CS224w, CS224u, CS227b, CS228, CS229, CS229t, CS229m, CS231a, CS231n, CS234, CS236, CS237a, CS281

In addition to the above, strong ability to do independent research will be necessary, preferably using mathematical and programming tools (e.g. Python, PyTorch, calculus).

Syllabus

Date Week Name Topic Suggested Assignment Prompt
April 3 (Mon) 1 Scott Viteri Overview of Course and AI Safety Bowman 2022, Steinhardt 2022, Carlsmith 2022, Gates 2022
April 10 4 Adam Gleave (UC Berkeley) Inverse Reinforcement Learning Gleave, Toyer 2022, Gleave 2022
April 17 3 Andrew Critch (UC Berkeley) Multiagent problems Critch 2019, Fickinger, Zhuang, Critch et al 2020, Garrabrandt, Critch et al 2016
April 24 2 Andy Jones (Anthropic) Empirical alignment - interpretability Askell 2021, Elhage, Nanda 2021
May 1 5 Dan Hendrycks (Center for AI Safety) Robustness and Generalization in AI Systems Hendrycks 2022, Hendrycks 2021a, Hendrycks 2021b
May 8 6 Alex Turner (UC Berkeley) Shard theory Turner 2022, Pope, Turner 2022
May 15 7 Laria Reynolds (Conjecture) Empirical alignment research with LLM Reynolds, McDonell 2021
May 22 8 John Wentworth (independent researcher) Agent Foundations and Abstractions Wentworth 2022a, Wentworth 2022b
May 29 9 Memorial Day — no class
Jun 5 10 Evan Hubinger (Anthropic) Mesa-Optimization and Inner Alignment Hubinger, Mikulik et al 2019, Hubinger 2021