Evaluation

Lecture

NLP benchmarks before MMLU. ▷ GLUE (and SuperGlue): combines multipe English sentence/sentence-pair classification and similarity tasks; SuperGlue includes tasks that demand deeper reasoning. ▷ And more.

Modern NLP benchmarks. ▷ MMLU (Measuring Massive Multitask Language Understanding): multiple-choice benchmark covering approx 57 academic field. ▷ LiveBench: newer benchmark with frequently updated questions; aiming to avoid test set contamination. ▷ GPQA (Graduate Level Problems in Quantitative Analysis; Google-Proof Q\&A). ▷ MTbench. ▷ And more.

Data collections (per type). ▷ From existing natural sources: MMLU from online collected by grad and undergrad students; SWE-bench; AIME. ▷ Experts manually create dataset: GPQA hires PhD students, SimpleQA asked AI trainers. ▷ LLM assisted dataset curation: ToolLLM prompts ChatGPT to generate.

MMLU. ▷ Motivations: previous benchmark evaluate linguistic skills more than overall language understanding. ▷ They did experiments that LLM at that time performs well in common sense and linguistics, but not good for knowledge (MMLU). ▷ They collect questions and datasets manually by grad and undergrad students (e.g. GRE, USMLE); total 15k question. Human level accuracy: 35\% for unspecialized humans; 87\% for USMLE. ▷ The paper has many experiments. ▷ Few shot prompting increase both accuracy and confidence (i.e. probability assigned to the answer).

SimpleQA. (2024, OpenaI) A benchmark that evaluates the ability of language models to answer short, fact-seeking questions. Several other famous Factual QA benchmarks: TriviaQA, Natural Questions (Google). For example, Deepseek does good job in TriviaQA but not in Natural Questions.

Caveat. This is a (part of) scribing note fom Data-Centric Large Language Models (CS 294-288; Fall 2025; UC Berkeley) taught by Professor Sewon Min. The note may contain typos, inaccurate information, or inadvertent errors; any mistakes or errors are mine. The note do not represent the views of the instructor, guest speakers, or any other participants of the course. The note is published with the instructor’s permission to post publicly. The content of the note has not been reviewed or endorsed by the instructor, guest speakers, or any other participants of the course. If you spot an error, I’d appreciate a correction. For any comments, here is my email address.

Seunghoon Paik

Evaluation

Lecture