TH1.R1.1

Asymptotics of Language Model Alignment

Joy Yang, University of Sydney, Australia; Salman Salamatian, MIT, United States; Ziteng Sun, Ananda Theertha Suresh, Ahmad Beirami, Google, Australia

Session:
Language Models

Track:
8: Machine Learning

Location:
Ballroom II & III

Presentation Time:
Thu, 11 Jul, 09:45 - 10:05

Session Chair:
Homa Esfahanizadeh,
Abstract
Let $p$ denote a reference generative language model. Let $r$ denote a reward model that returns a scalar to capture the degree at which a draw from $p$ is preferred. The goal of {\em language model alignment} is to alter $p$ to a new distribution $\phi$ that results in a higher expected reward while keeping $\phi$ close to $p$. A popular alignment method is the {\em KL-constrained reinforcement learning (RL)}, which chooses a distribution $\phi_\Delta$ that maximizes $E_{\phi_{\Delta}} r(y)$ subject to a relative entropy constraint $D_{\text{KL}}(\phi_\Delta \| p) \leq \Delta.$ Another simple alignment method is {\em best-of-$N$}, where $N$ samples are drawn from $p$ and one with highest reward is selected. In this paper, we offer a closed-form characterization of the optimal KL-constrained RL solution. We then demonstrate that any alignment method that achieves a comparable trade-off between KL divergence and expected reward must approximate the optimal KL-constrained RL solution in terms of relative entropy. To analyze the properties of alignment methods, we introduce two simplifying assumptions: we let the language model be memoryless, and the reward model be linear. Although these assumptions may not reflect complex real-world scenarios, they enable a precise characterization of the asymptotic (in the sequence length) behavior of the best-of-$N$ and the KL-constrained RL methods, in terms of information-theoretic quantities.
Resources