Fine-tuning Approaches

These types of methods aim to fine-tune part of the LLM to produce text that satisfies the controlled conditions. The conventional fine-tuning method is to train the pre-trained LLM on task target data to shift his weights toward the generation of texts that align with the controlled conditions.

Adapted Modules

These methods adopt a task-related adapted network module around the LLM that are usually fine-tuned together.
One example is Auxiliary Tuning @zeldesTechnicalReportAuxiliary2020 that creates a second small network whose logits are added to those of the LLM to shift its prediction towards outputs that are more aligned with respect to some property the text is expected to have.

The basic assumption made by this technique is that, to learn the conditional probability of the CTG system expressed as $P (x_{t} ∣ x_{< t}, a)$ where $x$ is the sequence and $a$ is the aforementioned property the text need to have, we can deconstruct it in two steps:

Learn to generate fluent, natural language, $P (x_{t} ∣ x_{< t})$ ;
Learn to shift the probability $P (x_{t} ∣ x_{< t})$ as a function of $a$ to obtain $P (x_{t} ∣ x_{< t}, a)$ ;
Given that a pre-trained LLM already satisfies the first step, to solve the second, a model is implemented that shifts the probability of the pre-trained LLM in function of $a$ .
The Auxiliary module is composed of embedding table (EAUX) that takes the input text and project the token to a the hidden size of the AUX, and Transformer layers (TAUX) which are typical transformer’s blocks, where the dimensionality of the AUX model (dAUX) can be smaller than that of the LM (dLM).

The AUX module is trained to modify the output distribution of the LM in a way that the final generated text reflects the desired attributes. This involves training the module to predict how the presence of an attribute should change the probabilities of different tokens being the next token in the sequence. The logits produced by the AUX module are added to the logits from the pre-trained LM before the softmax operation. This means that the AUX module effectively adjusts the probability distribution produced by the LM, skewing it towards sequences that align with the specified attribute.

One of the advantages of the AUX module is that it can be trained with relatively modest amounts of data and compute resources compared to training a large LM from scratch. This is because the AUX module leverages the existing capabilities of the pre-trained LM and only needs to learn the adjustments necessary to reflect the specified attributes.

TL:DR;

Adaptive modules aim to bridge the gap between the controlled attributes and the LLM, while guiding the language model to generate language that aligns with the corresponding control conditions.

Prompting

Prompt is a way to keep consistent the objective from the pre-training phase during the downstream task. The idea is to use the next token prediction to model any kind of task. For CTG the idea is to prompt the model to generate a text with certain characteristics.

Inverse prompting @zouControllableGenerationPretrained2021 is a technique based on the beam search algorithm for decoding sequences where the generated text is evaluated by using it to generate the prompt. This enhances the relevance between prompt and the generation, providing better controllability.
Normally, during a Beam Search with width $x$ , at each decoding step we find the best possible $x$ sequences starting from the current best x sequences. So at each decoding step $x$ possible generation sequences are kept and at the next generation step, for each of these sequences the next tokens are generated and only the best $x$ sequences are kept. To define which generations are the best, a scoring function is used, the baseline function is the log likelihood of the generated text given the prompt.
Inverse prompting changes this baseline score function, by inverting the direction of the log likelihood, instead of using as a score the likelihood of the generated text given the prompt, we calculate the likelihood of the prompt given the text.
The function ranks the different beams by their likelihood to generate back the prompt in an inverse way.

TL:DR;

Most of the prompt-based methods show a degree of versatility. From a CTG point-of-view these methods use the characteristic of the pre-training phase of the LLM to generate constrained text by selecting appropriate prompts in the fine-tuning stage.

Reinforcement Learning

Reinforcement Learning is used to feed back to the model whether or not the control conditions are respected during its generations.

@zieglerFineTuningLanguageModels2020 use reinforcement learning to fine-tune LLM with a reward model trained from human preferences.

Given a dataset $D (X, Y)$ the goal is to fine tune a policy $π$ , that denotes a pre-trained LLM, so that it can approximate the distribution in $D$ . This is done by reinforcement learning to optimize the expectation of the reward:

E_{π} [r] = E_{x \sim D, y \sim π (\cdot ∣ x)} [r (π (x)), y]

Then, the reward model $r$ , is trained based on the sample $(x, y_{0}, y_{1}, y_{2}, y_{3})$ via $x \in D$ e $y_{i}$ is generated from $p (y_{i} ∣ x)$ . Human labelers are required to choose the human-preferred sentence from $(y 0, y 1, y 2, y 3)$ . Also, to avoid that $π$ moves too far away from the original LLM $p$ , and starts then losing fluency, a penalty item is added to the reward function during the process of fine-tuning $π$ :

R (x, y) = r (x, y) - β K L (π, p)

Where $R$ is the re-defined reward function, $β$ is the penalty coefficient, and $K L$ ensures that the two distributions are as close as possible.

More on Reinforcement Learning with Human Feedback.

@rafailovDirectPreferenceOptimization2023 presents Direct Preference Optimization (DPO) which is a Reinforcement Learning Inspired technique which instead of directly training a reward model using preference data, it directly use that data to calculate a different fine-tuning training loss and modify the policy directly, while still enforcing a KL constraint to avoid furthering itself too much from the original model.

TL:DR;

While the idea to use RL to LLM-based CTG is natural, the central challenge is how to ensure that the LLM is optimized towards the RL reward while maintaining the fluency of the generated text.

Instruction Tuning

Instruction tuning provides an avenue to align the language models with user intents, i.e. making the LLM generate content that complies with human instructions.

FLAN was proposed by Google Research in 2022 @weiFinetunedLanguageModels2022. It involves fine-tuning large language models on a mixture of more than 60 NLP Datasets, where each task is expressed through natural language instructions. The result demonstrated that language models are capable of performing tasks described purely through human instructions and can generalize to previously unseen tasks through instruction tuning. @chungScalingInstructionFinetunedLanguage2022 further scaled up the number of tasks and model size beyond what FLAN achieved, mixing normal instruction tuning prompts with chain-of-thoughts prompts showcasing strong few-shot performances.

TL:DR;

Instruction Tuning seems to enable LLM to understand human directions using natural language formats, offering approaches that seem to be more general for CTG. However, these systems require carefully designed human prompts, and how to safely align the human instruction with the LLM remains an open problem @ouyangTrainingLanguageModels2022 which demands further exploration.

📚 Michele's Notes

Explorer

Fine-tuning Approaches

Adapted Modules

TL:DR;

Prompting

TL:DR;

Reinforcement Learning

TL:DR;

Instruction Tuning

TL:DR;

Graph View

Table of Contents

Backlinks