Are LLMs Truly Bayesian?

2025-09-06

Are LLMs Truly Bayesian?

Paper: arXiv:2507.11768

A Deep Dive into “LLMs are Bayesian, in Expectation, not in Realization”

Introduction

Large Language Models (LLMs) have transformed not only natural language processing but the entire field of artificial intelligence. One of their most striking abilities is in-context learning (ICL): adapting to new tasks simply from examples provided in the prompt, without parameter updates.

This behavior has often been interpreted as evidence that “LLMs are performing Bayesian inference.” However, the recent paper “LLMs are Bayesian, in Expectation, not in Realization” offers a critical perspective: LLMs appear Bayesian on average, but they are not strictly Bayesian in each individual realization (i.e., for every specific prompt ordering).



The Core of Bayesian Inference: The Martingale Property

In Bayesian statistics, the posterior should remain consistent regardless of the order in which data is observed.

  • This is captured by the martingale property: each new observation should not shift the expected posterior, only refine its certainty.
  • In other words: “data order should not matter.”

LLMs, however, rely on positional encodings. These inject order information into each token, allowing Transformers to understand word order in language. But the side effect is:

  • If the order of examples in a prompt changes, the model’s output can change as well.
  • This breaks the martingale property → meaning strict Bayesian consistency is lost.


Bayesian in Expectation, Not in Realization

The paper resolves this apparent contradiction with a nuanced claim:

  • On average (across permutations of examples), LLMs behave in line with Bayesian inference.
  • On any single ordering, they may deviate.

In other words:

  • Bayesian in expectation → across permutations, their predictions align with Bayesian theory.
  • Not Bayesian in realization → for one given order, results may differ.


Theoretical Contributions

  1. Error scaling from martingale violations

    • Errors decay as Θ(log n / n).
    • This differs from classical Bayesian scaling laws.
  2. Optimality under Minimum Description Length (MDL)

    • Averaged across orderings, the excess risk matches Bayesian efficiency at O(n⁻¹/²).
  3. Posterior representation

    • The model’s implicit posterior aligns with true Bayesian posteriors when viewed through sufficient statistics.
  4. Closed-form optimal Chain-of-Thought (CoT) length

    • k* = Θ(√n · log(1/ε)).
    • This provides a principled way to limit reasoning length without performance loss, reducing compute cost.
    • Importantly, this result directly ties back to Bayesian principles: LLMs’ “Bayesian in expectation” behavior mathematically guarantees that their reasoning chains also have an optimal stopping point, just as Bayesian inference has an optimal sample efficiency bound.


Experimental Findings

  • GPT-3 was tested and shown to violate martingale properties exactly as theory predicts.
  • With only 20 examples, the model already achieved 99% of theoretical entropy limits.
  • This demonstrates remarkable data efficiency, far faster than classical Bayesian methods.


Why Does This Matter?

  1. Theoretical clarity

    • The common slogan “LLMs are Bayesian” is misleading.
    • The correct phrasing: “LLMs are Bayesian in expectation, not in realization.”
  2. Understanding ICL

    • Provides a mathematically grounded explanation for how LLMs generalize from context.
  3. Practical implications

    • Permutation averaging: averaging across different prompt orders reduces variance.
    • Optimal CoT formula: helps cut unnecessary reasoning steps, saving compute.
    • Debiasing methods: could mitigate order-sensitivity issues.
  4. Transparency and reliability

    • Establishing where LLMs are and aren’t Bayesian improves interpretability, uncertainty estimation, and trust.


Efficiency and Cost-Effectiveness

Interestingly, the fact that LLMs are not fully Bayesian actually brings advantages:

  • Efficiency

    • They achieve near-Bayesian performance with very few examples.
  • Cost-effectiveness

    • Exact Bayesian methods would require enormous compute (considering all permutations).
    • LLMs sidestep this by being “good enough on average,” giving faster, cheaper results.

Small inconsistencies are tolerated in exchange for practical performance gains.



Future Research Directions

  • Permutation-invariant Transformers (e.g., Set Transformer, DeepSets)
  • Better uncertainty quantification methods for LLMs
  • Optimized Chain-of-Thought strategies based on theoretical scaling laws


Conclusion

So, are LLMs Bayesian?

  • No, not strictly — they break Bayesian consistency at the realization level.
  • Yes, in expectation — across orders, their behavior aligns with Bayesian inference.

And crucially: because LLMs are Bayesian in expectation, their reasoning chains also inherit a Bayesian-style efficiency law, guaranteeing an optimal stopping point for Chain-of-Thought.

This subtle but crucial distinction reshapes how we should think about LLMs. By acknowledging it, we can design models and prompting strategies that are more efficient, cost-effective, and trustworthy.