Post-training

← Back to main

Reading

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback. (Anthropic, 2022) Mainly RLHF. Also tried OOD detection techniques, which works in some sense as well.

Tulu 3: Pushing Frontieres in Open Language Model Post-Training (Allen Institute for AI, 2025) The training algorithms for Tulu 3 include supervised finetuning (SFT), Direct Preference Optimization (DPO), and a novel method we call Reinforcement Learning with Verifiable Rewards (RLVR). Tulu 3 is a family of open SoTA post-trained models, alongside all of the data, training recipes, code, infrastructure, and evaluation framework. ▷ Explaining Tule 3 recipe: (i) Data curation: Curate a variety of prompts to be allocated across multiple stages of optimization; (ii) Supervised finetuning, and then use mergekit to merge experiments; (iii) Preference fine-tuning, specifically DPO: There is PPO (Proximal Policy Optimization) which uses online RL, and DPO (Direct Preference Tuning) which directly optimizes for the RLHF objective. (cf. This paper explored two promising variants of DPO, called SimPO and length-normalized DPO.); (iv) RLVR: A novel method for training LM on tasks with verifiable outcomes such as math problem-solving and instruction following. ▷ The evaluation setup has four parts: (i) development partition, (ii) safety evaluation, (iii) unseen evaluation, and (iv) this paper's novel evaluation. There are many safety evaluation sets included, which are mostly after 2023. One of their novel evaluation uses LM judges for instruction following task.

Lecture

Introduction (and some past works). ▷ RLHF (Christiano et al., 2017). ▷ Instruct GPT (Ouyang et al., 2022): (a) Collect dataset of human-written demonstrations, and using SFT to train an initial policy; (b) Train a reward model (RM) on human-labeled comparisons of different model outputs (PPO); (c) Using that RM to fine-tune the LM with RL. However, this approach helps the helpfulness of the model, but does not help the harmlessness of the model.

Anthropic 2022 paper. This follows similar approach as InstructGPT, but tackles both helpfulness and harmlessness. Use Elo rating, and do some calibration.

Tulu3. SFT teaches 'valid' answers. Preferance teach which valid answers people most want. Their DPO datset includes promots from SFT (Figure 7). Figure20: comparison of RLVR.

Some critiques. Figure 2 of Tulu: the data they train on is only about 8k tokens long max, which is much shorter than many open source models. Coverage of benchmarks are not sufficient enough. PPO is good for model-based but GRPO (a variant of PPO) is good for rule-based, thus it would be more natural to d something like GRPO instead of PPO, which is what they used.

Some advocacy. ▷ Anthropic 2022: 'wisdom of crowds', separate datasets for helpfulness and harmlessness. ▷ Tulu3: open source data, persona-driven approach for more diversity of the data. ▷ Regarding hort-context data in Tulu3, Llama3 has something similar -- found that adding 0.1\% of synthetic long-context data was helpful.

Some related follow-up ▷ DPO has the same KL-regularized object as RLHF without training RM: same goal as RLHF, but optimized via supervised pairwise loss. ▷ RL objective: LM must pick responses that are highly preferred wile not drifting away from a baseline model (typcially SFT model). But this has a closed form solution. ▷ Bradley-Terry Preference model with plugging in closed form solution gives DPO loss function. ▷ DPO paper has an example of IMDB sentiment.


Caveat. This is a (part of) scribing note fom Data-Centric Large Language Models (CS 294-288; Fall 2025; UC Berkeley) taught by Professor Sewon Min. The note may contain typos, inaccurate information, or inadvertent errors; any mistakes or errors are mine. The note do not represent the views of the instructor, guest speakers, or any other participants of the course. The note is published with the instructor’s permission to post publicly. The content of the note has not been reviewed or endorsed by the instructor, guest speakers, or any other participants of the course. If you spot an error, I’d appreciate a correction. For any comments, here is my email address.