Reasoning

← Back to main

Lecture: Part 1

Omitted.

Lecture: Part 2

Introduction. ▷ Open AI o1 paper was published at October 2024. ▷ Qwen model family: an open source family of models from Alibaba; strong performance on math datasets. ▷ RLVR = train LLMs with verifiable outcome rewards (e.g. correctness, checks for math). But the question is: are we teaching new reasoning, or just surfacing latent capabilities.

Spurious rewards paper. (Shao et al.) Spurious rewards (i.e., other than answer correctness rewards) can produce large MATH gains in Qwen2.5; but effects are model dependant. ▷ Experiments setup. Models are Qwen2.5-Math, and Qwen general variants, Llama3, OLMo2. RL algorithm was GRPO. ▷ How are Weak Rewards and Spurious Rewards applied in training? Weak rewards = Majority vote, One-shot RL, Format reward. Spurious reward: Incorrect label, Random reward. ▷ Main claim: even random, incorrect, format rewards can yield large improvements on MATH-500. However, does not generalized to other models, Llama and OLMo; non-MATH Qwen model was not improved. ▷ They also compare learning curve for each type of rewards. ▷ Why incorrect rewards work? One reasonable intuition: incorrect labels are obtained from majority voting, and may still be close to ground truth answer.

Spurious rewards: Ablation study. ▷ Code reasoning (RLVP upweights pre-training biases): Qewn2.5-Math frequently produce Python-like code as CoT. When you compare Accuracy of CoT by code vs. by language, Qwen shows better accuracy in Code, while other models are better at Language. This means, perhaps, RLVR (even with spurious rewards) increases the frequency of code reasoning and this correlates with improved performance. ▷ Causal intervention on code reasoning. Their hypothesis is, Code reasoning drives performance gain with Spurious rewards. Then put a prompt saying, let's solve this using Python.

Reasoning or Memorization? (Wu et. al. 2025) Revisiting Spurious reward paper. ▷ Models can be further improved through some RL. ▷ Rewards don't have to be accurate to improve Qwen performance. ▷ This paper proposes two hypotheses: (i) data contaminations (parts of evaluations leaked into pre-training data); (ii) strong math capacity (Qwen is a strong math model and can deal with noisy updates.) -- and they conclude the first hypothesis may be correct, and the second one may not be correct.

a. 1st hypothesis. ▷ Qwen memorizes prompts: When the author feeds partial question statements from MATH-500, Qwen outputs the remaining question.statements. ▷ They compare this partial-prompt completion rate between Qwen and Llama; and Qwen shows significantly higher completion rate.

b. 2nd hypothesis. ▷They made RandomCalculation dataset, which is basically calculating arithmatic problem (written in LaTeX code.) ▷With this dataset, spurious rewards works but much worse than correct rewards, and incorrect rewards does not work at all


Caveat. This is a (part of) scribing note fom Data-Centric Large Language Models (CS 294-288; Fall 2025; UC Berkeley) taught by Professor Sewon Min. The note may contain typos, inaccurate information, or inadvertent errors; any mistakes or errors are mine. The note do not represent the views of the instructor, guest speakers, or any other participants of the course. The note is published with the instructor’s permission to post publicly. The content of the note has not been reviewed or endorsed by the instructor, guest speakers, or any other participants of the course. If you spot an error, I’d appreciate a correction. For any comments, here is my email address.