Language Models Are Like Playing Minesweeper

2023-10-13

Introduction

Testing matters. We write test cases for software because reliability and standards matter. The same is true for language-model–powered systems: they’re exciting and useful, but how do we evaluate them? A helpful analogy is Minesweeper—every prompt change can reveal safe ground or blow up your assumptions.


Evaluation Challenges

1) Prompt Sensitivity

Ask: Are we measuring the model or the prompt? Tiny prompt changes can swing quality drastically—making experiments brittle and results hard to compare across teams. [1]

2) Construct Validity

If two researchers want to measure the “same thing” but use different prompts and get different outputs, what exactly is the construct being measured? We need clearer definitions, better transparency reports, and datasets reflecting real usage. Traditional exams assess human generalization; LLMs excel at recall and patterning, so we must rethink what “ability” means in this context. [1]

3) Contamination (Data Leakage)

Did GPT-4 truly pass USMLE—or did it memorize answers? Performance drops on post-cutoff coding questions suggest training data effects. In medicine and law, it’s still difficult to produce clean, decisive measurements without contamination. [1, 2]

4) Reproducibility

ML results have long suffered from train–test leakage, which can produce misleadingly high scores. Non-determinism compounds the issue for LLMs; re-running the same setup should yield comparable outcomes, yet often doesn’t without careful controls. [3]

Bottom line: current evaluations often mix prompting, memorization, and metric brittleness—obscuring what models actually understand.


Closed-Source LLMs (and Why Open Source Matters)

Open source is crucial for science and resilience. If widely used closed models were discontinued, research and products would suffer. Policies like “AI developer licensing” can unintentionally tilt the field toward closed vendors, harming openness and competition. Supporting open models keeps the ecosystem healthy and verifiable.


What’s the Next Square?

Like Minesweeper, each evaluation “click” must be careful and principled. The path forward:

  • Define constructs precisely (what ability are we measuring?).
  • Report prompts, seeds, datasets, and settings.
  • Test for contamination and perform ablations.
  • Prefer tasks that reward reasoning and grounded answers, not just recall.
  • Invest in open-source tooling, datasets, and models.

References

[1] Arvind Narayanan, “Evaluating LLMs is a minefield.” https://www.cs.princeton.edu/~arvindn/talks/evaluating_llms_minefield/
[2] Mbakwe, A. B., Lourentzou, I., Celi, L. A., Mechanic, O. J., & Dagan, A. (2023). ChatGPT passing USMLE shines a spotlight on the flaws of medical education. PLOS Digital Health, 2(2), e0000205.
[3] The Reproducible ML Initiative (Princeton). https://reproducible.cs.princeton.edu/