CS 362: Research in AI Alignment

Mar 1, 2023

Course Information (Fall 2024)

Instructor: Scott Viteri ( sviteri@stanford.edu)
Course Assistant: Kai Fronsdal ( kaif@stanford.edu)
Lecture: Tuesday 4:30-5:50pm, 200-034
Graduate-level course or advanced undergraduates (contact course instructor)
3 Units, Autumn 2024, ExploreCourses

Course Information (Spring 2023)

Instructor: Scott Viteri
Teaching Assistants: Victor Lecomte, Gabe Mukobi, Peter Chatain
Course Faculty Sponsor: Clark Barrett
Lecture: Monday 3:00-4:30pm, B067 Mitchell Earth Science (In-person only)
Optional Office Hours: Large Language Model productivity meetings at 1-2PM in Gates 200
Graduate-level course or advanced undergraduates (contact course instructor)
3 Units, Spring 2023, ExploreCourses

Course Description (Fall 2024)

In this course we will explore the current state of research in the field of AI alignment, which seeks to bring increasingly intelligent AI systems in line with human values and interests. As the energy in the AI alignment landscape has been increasingly focused on political considerations, we seek to create a space to discuss which direction we should be pointing in, now that we have a better idea of what AI scaling will look like in the near future. This is a philosophical task, and we will invite several speakers that are philosophical in persuasion, but we also find that several of the most relevant philosophical questions cannot be asked without a strong technical familiarity with the specifics of language models and reinforcement learning. The format will consist of weekly lectures in which speakers present their relationships to the alignment problem and their current research approaches.

Before each speaker, we will have some corresponding assigned readings and we will assign some form of active engagement with the material: we will accept a blog post in response to the ideas in the readings, but we will encourage jupyter notebooks that engage with the technical material directly. Therefore this course requires research experience, preferably using mathematical and programming tools (e.g. Python, PyTorch, calculus), and is a graduate level course, open to advanced undergraduates.

Course Description (Spring 2023)

In this course we will explore the current state of research in the field of AI alignment, which seeks to bring increasingly intelligent AI systems in line with human values and interests. The purpose of this course is to encourage the development of new ideas in this field, where a dominant paradigm has not yet been established. The format will be weekly lectures in which speakers present their current research approaches.

The assignment structure will be slightly unusual: each week students will have a choice between a problem set and a short research assignment based on the weekly guest speaker’s research area. For the research assignment, students will start with the abstract of a relevant AI alignment paper or blog post and create a blog post or Github repository describing how they would continue the paper. The final weekly assignment will be an extension of one of the previous weeks’ work. Therefore this course requires research experience, preferably using mathematical and programming tools (e.g. Python, PyTorch, calculus), and is a graduate level course, open to advanced undergraduates.

Prerequisites:

Any one of the following: CS223a, CS224n, CS224w, CS224u, CS227b, CS228, CS229, CS229t, CS229m, CS231a, CS231n, CS234, CS236, CS237a, CS281

In addition to the above, strong ability to do independent research will be necessary, preferably using mathematical and programming tools (e.g. Python, PyTorch, calculus).

Fall 2024 Schedule

Week	Date	Speaker	Topic
1	09/24/24	Scott Viteri	Course and Alignment Landscape Overview
2	10/01/24	Joe Carlsmith	Otherness and Control in the Age of AGI
3	10/08/24	Joscha Bach	Life, Intelligence, Consciousness, AI & the Future of Humans
4	10/15/24	Jesse Hoogland	Singular Learning Theory
5	10/22/24	Jack Lindsey	Towards Monosemanticity
6	10/29/24	Rachel Freedman	Reinforcement Learning and Value Alignment
8	11/12/24	Dan Hendrycks	California Senate Bill 1047
9	11/19/24	Jessica Taylor	Fixed Points in AI Alignment
11	12/03/24	Students Presentations	Course Takeaways

Note: Week 7, 10, and 12 are Election Day, Thanksgiving, and Finals Week, respectively.

Fall 2024 Assignment Structure

For each speaker, we will assign corresponding online material to engage with. We expect the readings to take up to two hours. For each reading, there will be a corresponding written assignment.

We are open to many formats of written assignment, as long as we feel it represents at least two hours worth of effort and shows understanding of the relevant material. Sample acceptable formats include:

Blog post / book report style – Write about related work, a summary, and some novel analysis of the reading content
Respond to the readings with a small research project of your own – e.g. a github repo, a google colab, an overleaf document, or a small interactive javascript demo

We will drop the lowest score, and the course score will be the average score among the remaining assignments. Each assignment will be due at 4PM PT, which is 30 minutes before the start of the class. Notice that the readings corresponding to a given lecturer will be due before their lecture, so that we can more productively engage as a class.

Fall 2024 Reading Schedule

Week	Due	Reading/Material
2	10/01/24	Gentleness and the Artifical Other and Loving in a World You Don’t Trust
3	10/08/24	AGI Series 2024 - Joscha Bach: Is Consciousness a Missing Link to AGI?
4	10/15/24	Neural Networks Generalize Because of This One Weird Trick and Distilling Singular Learning Theory
5	10/22/24	Towards Monosemanticity
6	10/29/24	Choice Set Misspecification in Reward Inference
8	11/12/24	SB 1047: Safe and Secure Innovation for Frontier Artificial Intelligence Models Act
9	11/19/24	Reflective Oracles, Logical Induction, and The Obliqueness Thesis
11	12/03/24	Previous Course Material

Spring 2023 Schedule

Date	Week	Name	Topic	Suggested Assignment Prompt
April 3 (Mon)	1	Scott Viteri	Overview of Course and AI Safety	Bowman 2022, Steinhardt 2022, Carlsmith 2022, Gates 2022
April 10	4	Adam Gleave (UC Berkeley)	Inverse Reinforcement Learning	Gleave, Toyer 2022, Gleave 2022
April 17	3	Andrew Critch (UC Berkeley)	Multiagent problems	Critch 2019, Fickinger, Zhuang, Critch et al 2020, Garrabrandt, Critch et al 2016
April 24	2	Andy Jones (Anthropic)	Empirical alignment - interpretability	Askell 2021, Elhage, Nanda 2021
May 1	5	Dan Hendrycks (Center for AI Safety)	Robustness and Generalization in AI Systems	Hendrycks 2022, Hendrycks 2021a, Hendrycks 2021b
May 8	6	Alex Turner (UC Berkeley)	Shard theory	Turner 2022, Pope, Turner 2022
May 15	7	Laria Reynolds (Conjecture)	Empirical alignment research with LLM	Reynolds, McDonell 2021
May 22	8	John Wentworth (independent researcher)	Agent Foundations and Abstractions	Wentworth 2022a, Wentworth 2022b
May 29	9	Memorial Day — no class
Jun 5	10	Evan Hubinger (Anthropic)	Mesa-Optimization and Inner Alignment	Hubinger, Mikulik et al 2019, Hubinger 2021