Memory

← Back to main

Lecture

Closely follow this (guest lecture) slide from the course website.

Memorization. LLM has memorization in parameter. Is scaling all we need for factuality? Tail facts are hard to remember (Simple QA; from Open AI) but still scaling law is winning. Actually, more than 75\% of answers can be found in Wikipedia.

How to learn facts in parametric memory efficiently? ▷ A series of paper from Meta: Physics of Language Models (Part 3.1 and 3.3); Knowledge storage and extraction; Knowledge capacity scaling laws. ▷ Some of their claims: GPT2, trained with standard AdamW, consistently achieves a 2bit/param capacity ratio across all data settings after sufficient training; ▷ However, achieving a 2bit/param capacity requires each knowledge piece to be visited 1k times; when we reduce this to 100 times, the memory cap decreases to 1bit/param. ▷

Learning facts at scale with active reading. ▷ Parametric memory facts: Dense parameters store information efficiently; good at tail facts; but nothing is matched to RAG. ▷ Parametric memory questions: What about procedural and episodic memory? Can we match RAG in recall? What about continual learning? procedural Expanding memory without more compute. Dense models couple memory and compute: FFNs can be interpreted as key-value memory.

Memory layers. ▷ A concept of 'memory layers' from 2015 paper from Meta. You can use product keys, i.e., getting two 1k keys to make one 1M key. ▷ UltraMemV2 by ByteDance. Adopts a variation of PEER. Memory layers at every layer. Improved initialization. Competitive performance with strong MOE baselines. Works well across knowledge, coding, etc.

Continual learning Memory layers have natural advantages for continual learning: efficient at learning facts (more learning) and they are sparsely updated (less forgetting). This is empirically shown, compared to full FT and LoRA. ▷ There are currently six paradigms on continual learning: more on this blog post. ▷ Is in-context learning really learning? (by Adrian de Wynter; Microsoft) ICL imitates gradient-based learning to some extent but degrades over time. ▷ What is needed for continual learning? We may need 'real' gradient-based learning (not just ICL); we may need parametric memory; we may need selective memory updates.


Caveat. This is a (part of) scribing note fom Data-Centric Large Language Models (CS 294-288; Fall 2025; UC Berkeley) taught by Professor Sewon Min. The note may contain typos, inaccurate information, or inadvertent errors; any mistakes or errors are mine. The note do not represent the views of the instructor, guest speakers, or any other participants of the course. The note is published with the instructor’s permission to post publicly. The content of the note has not been reviewed or endorsed by the instructor, guest speakers, or any other participants of the course. If you spot an error, I’d appreciate a correction. For any comments, here is my email address.