Mixture of experts

← Back to main

Lecture

Mixture of Experts (MoE). ▷ ReLu Strikes Back (2023) paper was about sparsity. ▷ Mixture of Experts:: sparse Feed-forward Network (FFN), which replace one FFN with many experts and router. This increases capacity, but not too much FLOPs, because there are many total parameters but few active parameters per token.

OLMoE. (2024) OLMoE-1B-7B: active 1B parameter among total 7B parameters. ▷ Key decisions in designing an MoE model: Determining the number of activated and total parameters; the design of the experts; the choice of the routing algorithm; initializing from a dense model; changing the training objective, using auxiliary losses. ▷ OLMoE's choice: 1.3B active parameters out of a toatal 6.9B; two auxiliary losses. They used 3x less FLOPs and 2x less time, compared to the dense 1.3B model. They only used the routed experts, rather than any shared experts. Though router saturation happens.

DeepSeek V3 Report OLMoE includes a load-balancing loss, but this can hurt quality at times if router optimizes for balancing over routing. DeepSeek-V3, uses loss-free balancing and node-limited routing, with total of 671B parameters with 37B active per token. They observed some expert specialization pattern.

Branch-Train-Mix. (2024) Problem: training LLMs to perform well across multiple specialized domains in a synchronized manner is costly and hard. ▷ Prior work: BTM (Branch-Train-Merge) branches seed LLM into domain models, trains dense models in parallel, then averages weights. ▷ Resembles the post-training philosophy?


Caveat. This is a (part of) scribing note fom Data-Centric Large Language Models (CS 294-288; Fall 2025; UC Berkeley) taught by Professor Sewon Min. The note may contain typos, inaccurate information, or inadvertent errors; any mistakes or errors are mine. The note do not represent the views of the instructor, guest speakers, or any other participants of the course. The note is published with the instructor’s permission to post publicly. The content of the note has not been reviewed or endorsed by the instructor, guest speakers, or any other participants of the course. If you spot an error, I’d appreciate a correction. For any comments, here is my email address.