LLM Post-Training
This chapter provides an in-depth exploration of post-training methods for large language models from a reinforcement learning perspective. Pre-training endows LLMs with powerful language capabilities, but "predicting the next token" is not the same as "behaving as humans expect." Post-training uses various RL and RL-variant methods to align LLM behavior with human preferences.
Contents
- LLM Post-Training & Reinforcement Learning — From RLHF to GRPO, from DPO to RLVR, comprehensively covering the RL principles and mathematical derivations behind post-training methods