CIRL is really cool! But it has some holes. These holes have been noted by other authors: for instance, in this review of Human Compatible by Rohin Shah.
What holes?
Even though these holes have been noted by other authors, here I’ll restate them in a way that points more clearly towards my preferred alignment research program.
First, revealed preference is a complex sum and doesn’t differentiate between what is actually meaningful to us, what we are being pressured to do, what we have been habituated to do, etc.
The human might be making coffee (or demonstrating coffee-making for the robot to do) to please their boss. But pleasing their boss might be an instrumental concern, dependent on the human’s current beliefs or social situation, which the robot cannot observe. The robot will deduce the wrong “reward function” for the human.
More generally: does the human want coffee made? Coffee drunk (by anyone)? Does the human want to drink coffee themselves? Want coffee drunk by the boss? Want to get on the boss’ good side?
By observing behavior, even demonstrations, we don’t know.
Second, revealed preference fails to model issues of social welfare. As the previous example shows, our wellbeing is deeply social and interdependent. It doesn’t decompose person-by-person. There are distributional concerns. There are also strange loops where people’s welfare is mutually dependent. My girlfriend and I are best optimized together, not separately and then merged. For these reasons, it’s better to define a social welfare function than an individual reward function.
<aside> ☝ See papers by Amartya Sen and Martha Nussbaum
</aside>
CIRL has a third problem—it’d be easy for a CIRL agent to overfit to what appears to be the human agent’s current reward function, and it might be quite surprised when it eventually discovers the agent is growing and changing, revising its beliefs, etc, and that it has completely shifted its apparent reward function. It would need to watch such emotional growth moments and belief updates before it would have a chance to discern the true value function of a human being. (Probably, infeasibly many.)
<aside> ☝ See papers by Ruth Chang on Hard Choices, Transformative Choices, etc.
</aside>
All three problems connect to CIRL’s emphasis on reward-seeking behavior. People use language to sync up on what's good. We use a shared conceptual language—the language of reasons and values—to align with one another. This is very different from demonstration.
(This problem is alluded to in the original CIRL paper as one of “coordination”. If there’s work being done to rid CIRL of these problems, I would love to know about it.)
Any human-in-the-loop alignment approach will be slow to collect data. It will collect far less data from it’s humans than, say, the terabytes consumed by LLMs.
That means HITL approaches can’t get away with vague talk of a “reward function” learned iteratively. The mathematical space of such reward functions is so vast that the meager amount of data provided by HITL interactions like those in CIRL cannot suffice.