Notes for a Alignment Forum post / problems w. the dominant paradigm

There’s a lot of AI alignment work I think is useful. For instance, I’m a big fan of Chris Olah and Shan Carter’s work.

But a main thrust of AI alignment work suffers, in my view, from four deep misconceptions.

1

First, they generally try to align AI either to something we tell it to do (i.e., a goal we give to it), or, in the case of IRL / CIRL, the preference profile we reveal through our actions and demonstrations (our behavior). But an important kind of information is missing from goals and preference profiles: neither captures what it is about a goal or about our behavior that’s important to us.

Say I’m making coffee, and a CIRL agent is watching me and trying to guess my reward function.

For some people, the important part is drinking the coffee.
For others, the important part is having some coffee made (and available as a resource, perhaps to be offered to others).
Or, some enjoy the process of making coffee, and would therefore want assistance with some parts of the process, but not others.

There may be clues in behavior as to which of these is the “true reward function” of the human being. If you watch me make coffee, and see me smell the beans, carefully pack the ground coffee into Bialetti Moka Pot, whistle while adjusting the stove top flame, etc—by watching this you might conclude that I enjoy the process, not just the result.

But this conclusion is not one that a CIRL agent is predisposed to make. Behavior of this type does not enjoy any kind of special status in CIRL.

The situation is even worse when trying to align an agent with stated goals, since goals never capture what it is we want to do, ourselves. A goal we give to a machine is always instrumental. We imagine it is a step towards something we want to do ourselves, but we always omit that part. The AI will end up as a kind of accelerationist, making as much coffee, as quickly as possible, without understanding that my real value is the camaraderie I hope to enjoy after this alignment forum post is live, and the coffee is just a tiny (and likely unnecessary) step in that direction.

<aside> ☝ See more in my special note on cooperative inverse reinforcement learning.

</aside>

2

Second, these alignment researchers are politically and economically naive. Powerful AIs are built and owned by nation states and corporations. Those organizations and nation states are not aligned with the general interest. Hedge funds for instance, want to do things that are not aligned with human values. So, even were an AI available on the market which was fully aligned with human values, they’d prefer to buy an unaligned one.

The alignment problem, then, is to ensure that organizations and nation states will make the right choice. It is a market alignment problem. What successor to the market would encourage at least large actors to work in alignment with human values?

3

These researchers talk about human values, but when pushed to name one, they struggle, often naming what I’d call a goal. To align something with human values, it seems sensible to look into the best theories of what they are, how you’d specify one (even in vague and imprecise language) and how you’d know the difference between values and other types of pursuits.

As it is, many researchers seem to mistake values for preferences or goals.

Even if we could somehow fully specify a goal, the goal exists because of an underlying value. When the goal ceases to support the value, we find ourselves out of alignment. (See Operationalizing Metaethics rule #2.)
If goals are about getting to an outcome, the AI becomes an accelerationist, trying to do that as quickly or as much as possible. That’s a problem. Instead, we can tell the AI what we want to do, ourselves. This makes alignment easier. Some representations of values (Amartya Sen’s; mine) have this feature.