CS 362: Research in AI Alignment

Course Information (Fall 2024)

Course Information (Spring 2023)

  • Instructor: Scott Viteri

  • Teaching Assistants: Victor Lecomte, Gabe Mukobi, Peter Chatain

  • Course Faculty Sponsor: Clark Barrett

  • Lecture: Monday 3:00-4:30pm, B067 Mitchell Earth Science (In-person only)

  • Optional Office Hours: Large Language Model productivity meetings at 1-2PM in Gates 200

  • Graduate-level course or advanced undergraduates (contact course instructor)

  • 3 Units, Spring 2023, ExploreCourses

AI Alignment

Course Description (Fall 2024)

In this course we will explore the current state of research in the field of AI alignment, which seeks to bring increasingly intelligent AI systems in line with human values and interests. As the energy in the AI alignment landscape has been increasingly focused on political considerations, we seek to create a space to discuss which direction we should be pointing in, now that we have a better idea of what AI scaling will look like in the near future. This is a philosophical task, and we will invite several speakers that are philosophical in persuasion, but we also find that several of the most relevant philosophical questions cannot be asked without a strong technical familiarity with the specifics of language models and reinforcement learning. The format will consist of weekly lectures in which speakers present their relationships to the alignment problem and their current research approaches.

Before each speaker, we will have some corresponding assigned readings and we will assign some form of active engagement with the material: we will accept a blog post in response to the ideas in the readings, but we will encourage jupyter notebooks that engage with the technical material directly. Therefore this course requires research experience, preferably using mathematical and programming tools (e.g. Python, PyTorch, calculus), and is a graduate level course, open to advanced undergraduates.

Course Description (Spring 2023)

In this course we will explore the current state of research in the field of AI alignment, which seeks to bring increasingly intelligent AI systems in line with human values and interests. The purpose of this course is to encourage the development of new ideas in this field, where a dominant paradigm has not yet been established. The format will be weekly lectures in which speakers present their current research approaches.

The assignment structure will be slightly unusual: each week students will have a choice between a problem set and a short research assignment based on the weekly guest speaker’s research area. For the research assignment, students will start with the abstract of a relevant AI alignment paper or blog post and create a blog post or Github repository describing how they would continue the paper. The final weekly assignment will be an extension of one of the previous weeks’ work. Therefore this course requires research experience, preferably using mathematical and programming tools (e.g. Python, PyTorch, calculus), and is a graduate level course, open to advanced undergraduates.

Prerequisites:

Any one of the following: CS223a, CS224n, CS224w, CS224u, CS227b, CS228, CS229, CS229t, CS229m, CS231a, CS231n, CS234, CS236, CS237a, CS281

In addition to the above, strong ability to do independent research will be necessary, preferably using mathematical and programming tools (e.g. Python, PyTorch, calculus).

Fall 2024 Schedule

Week Date Speaker Topic
1 09/24/24 Scott Viteri Course and Alignment Landscape Overview
2 10/01/24 Joe Carlsmith Otherness and Control in the Age of AGI
3 10/08/24 Joscha Bach Life, Intelligence, Consciousness, AI & the Future of Humans
4 10/15/24 Jesse Hoogland Singular Learning Theory
5 10/22/24 Jack Lindsey Towards Monosemanticity
6 10/29/24 Rachel Freedman Reinforcement Learning and Value Alignment
8 11/12/24 Dan Hendrycks California Senate Bill 1047
9 11/19/24 Jessica Taylor Fixed Points in AI Alignment
11 12/03/24 Students Presentations Course Takeaways

Note: Week 7, 10, and 12 are Election Day, Thanksgiving, and Finals Week, respectively.

Fall 2024 Assignment Structure

For each speaker, we will assign corresponding online material to engage with. We expect the readings to take up to two hours. For each reading, there will be a corresponding written assignment.

We are open to many formats of written assignment, as long as we feel it represents at least two hours worth of effort and shows understanding of the relevant material. Sample acceptable formats include:

  • Blog post / book report style – Write about related work, a summary, and some novel analysis of the reading content
  • Respond to the readings with a small research project of your own – e.g. a github repo, a google colab, an overleaf document, or a small interactive javascript demo

We will drop the lowest score, and the course score will be the average score among the remaining assignments. Each assignment will be due at 4PM PT, which is 30 minutes before the start of the class. Notice that the readings corresponding to a given lecturer will be due before their lecture, so that we can more productively engage as a class.

Fall 2024 Reading Schedule

Week Due Reading/Material
2 10/01/24 Gentleness and the Artifical Other and Loving in a World You Don’t Trust
3 10/08/24 AGI Series 2024 - Joscha Bach: Is Consciousness a Missing Link to AGI?
4 10/15/24 Neural Networks Generalize Because of This One Weird Trick and Distilling Singular Learning Theory
5 10/22/24 Towards Monosemanticity
6 10/29/24 Choice Set Misspecification in Reward Inference
8 11/12/24 SB 1047: Safe and Secure Innovation for Frontier Artificial Intelligence Models Act
9 11/19/24 Reflective Oracles, Logical Induction, and The Obliqueness Thesis
11 12/03/24 Previous Course Material

Spring 2023 Schedule

Date Week Name Topic Suggested Assignment Prompt
April 3 (Mon) 1 Scott Viteri Overview of Course and AI Safety Bowman 2022, Steinhardt 2022, Carlsmith 2022, Gates 2022
April 10 4 Adam Gleave (UC Berkeley) Inverse Reinforcement Learning Gleave, Toyer 2022, Gleave 2022
April 17 3 Andrew Critch (UC Berkeley) Multiagent problems Critch 2019, Fickinger, Zhuang, Critch et al 2020, Garrabrandt, Critch et al 2016
April 24 2 Andy Jones (Anthropic) Empirical alignment - interpretability Askell 2021, Elhage, Nanda 2021
May 1 5 Dan Hendrycks (Center for AI Safety) Robustness and Generalization in AI Systems Hendrycks 2022, Hendrycks 2021a, Hendrycks 2021b
May 8 6 Alex Turner (UC Berkeley) Shard theory Turner 2022, Pope, Turner 2022
May 15 7 Laria Reynolds (Conjecture) Empirical alignment research with LLM Reynolds, McDonell 2021
May 22 8 John Wentworth (independent researcher) Agent Foundations and Abstractions Wentworth 2022a, Wentworth 2022b
May 29 9 Memorial Day — no class
Jun 5 10 Evan Hubinger (Anthropic) Mesa-Optimization and Inner Alignment Hubinger, Mikulik et al 2019, Hubinger 2021