Blog Post · Artificial Intelligence

Emergent Abilities in Large Language Models – Real or Measurement Artifact?

At what model size does a language model suddenly start solving tasks it couldn’t solve before? Emergent abilities are capabilities of large language models that are undetectable in small models and appear sharply above a parameter threshold. Whether this is a genuine phase transition or an artifact of the metric (Schaeffer et al., 2023) is one of the most interesting open questions in AI research. Eight chapters, eight interactive visualizations.

KI-Mathias · · ~35 min read

Video version of this post (6:40 min) · Watch on YouTube

Prologue

What you are reading right now

This text came about like this: A human (me, Mathias) and a language model (Claude Code, by Anthropic) read around twenty scientific papers together – from Google DeepMind, Stanford, Anthropic, and the Santa Fe Institute. Written in English, mathematical, some over fifty pages long. Claude condensed them. I set the direction and sharpened the didactics.

Why am I telling you this? Because it is the punchline of this text.

If, by the end of these eight chapters, you understand what emergence means – then you have just experienced it. Because the ability to read twenty academic papers and turn them into a clear text with interactive visualizations is itself an emergent ability of large language models. Three years ago, no model could do this. Now they can. Not because someone programmed “paper summarization” – but because the models became large enough.

You will find the complete list of sources at the end of this text.*

Chapter 1

The Puzzle – When Does a Machine “Understand”?

The observation that shook AI research in 2022

A language model with ten billion parameters is given the task of carrying out multi-digit additions, step by step. The output is random – garbled strings, obvious nonsense. The same architecture, the same training procedure, but with a hundred billion parameters: the model now performs the arithmetic correctly.

The transition is not gradual but abrupt – from chance to competence within a narrow band of model size.

In 2022, Jason Wei and colleagues at Google published the paper that systematically documented this observation.[1] They examined over 137 different tasks – from multi-step arithmetic to logical deduction, word unscrambling, and quiz questions in Persian. Across all these tasks, the same pattern emerged: below a certain model size, random-level performance; above it, sudden competence.

Wei and colleagues called these abilities “emergent” – because they were not explicitly trained but seemed to “appear” on their own once the model was large enough.

Chain-of-Thought: The key that only fits large models

One of the most striking examples is Chain-of-Thought Prompting.[2] The idea is simple: instead of asking the model for the answer directly, you ask it to “think step by step.”

With PaLM at 8 billion parameters – no effect. Whether or not you add “step by step,” the results remain equally poor. With PaLM at 540 billion parameters – a jump from 18 % to 58 % on the math benchmark GSM8K.

Here is a simplified example:

Prompt: “Lisa has 5 apples. She buys 3 bags with 4 apples each. How many does she have now? Explain step by step.”

Small model (8B): “Lisa has 12 apples.” (wrong, incoherent)

Large model (540B): “Lisa has 5 apples. 3 bags × 4 = 12 new apples. 5 + 12 = 17 apples.” (correct, structured)

The same instruction (“step by step”) remains inert in the small model and elicits a correct derivation in the large one. The instruction takes effect only once model capacity exceeds a critical threshold.

Try it: Move the slider to see how performance on various tasks changes abruptly beyond a certain model size. Note: the jump does not happen at the same point for every task.

Chapter 2

Is the Jump Real? – The Mirage Debate

The counter-thesis: Stanford says – it is all an illusion

The thesis of discontinuous emergence was not adopted without dissent. In 2023, Rylan Schaeffer and colleagues from Stanford published a paper with the provocative title “Are Emergent Abilities of Large Language Models a Mirage?”[3] Their thesis: the observed jump does not reside in the model behaviour itself – it resides only in the metric.

The argument turns on the chosen metric. Under an Exact-Match scheme – 100 % correct or 0 %, with no intermediate values – a model that answers “1” instead of “17” receives zero points, as does one that answers “16.” The second model is in fact substantially closer to the target; the metric simply does not represent this difference.

Under a continuous metric – for instance token edit distance, which quantifies the distance between answer and target – the jump disappears and is replaced by a smooth, monotone increase.

A useful analogy is the assessment of a motor skill: binary scoring – rides versus does-not-ride – produces the impression of a sudden jump, while the underlying balance improves gradually over weeks. The apparent jump is an artefact of the measurement procedure.

Ruan et al. (2024) went even further: with their “Observational Scaling Laws,” they showed that nearly all supposedly emergent abilities become predictable when measured with the right metrics.[6]

But: the counter-thesis has limits

Even with smooth metrics, some tasks show a nonlinear acceleration. Multi-step tasks have a mathematical reason for this: if each step has a success probability \(p\) and the task requires \(n\) steps, the overall success probability is \(p^n\). This amplifies even small improvements in \(p\) dramatically.

Du et al. (2024) also showed that there are loss thresholds: only when the training loss falls below a certain value does an ability “unlock.”[5] More complex tasks have higher thresholds – and that is why they require larger models.

The synthesis (as of 2025)

Both readings of the data – discontinuous and gradual – turn out to be mutually compatible. The underlying ability improves gradually; usability as observable performance carries a threshold. Water serves as the canonical illustration: temperature rises continuously, yet the phase transition from liquid to gas at 100 °C remains a jump.

Try it: Toggle between “Exact Match” and “Token-Level Accuracy.” The same data – two completely different stories.

Chapter 3

Physics Has Seen This Before – Phase Transitions

More is Different

In 1972, the Nobel Prize-winning physicist Philip W. Anderson published an essay in Science that became one of the most cited in the history of physics: “More is Different.”[12] His central argument:

“The ability to reduce everything to simple fundamental laws does not imply the ability to start from those laws and reconstruct the universe.”

At every level of complexity, new properties arise that cannot be trivially derived from the laws beneath them. A single H₂O molecule is not “wet.” Wetness is a property that only emerges when enough molecules come together.

From spin glasses to neural networks

In 1982, John Hopfield drew a connection that would win the Nobel Prize in Physics in 2024: he showed that neural networks and magnetic systems (spin glasses) are mathematically identical.[13] In both systems, there are units (neurons or spins) that interact with each other. In both, the system seeks a state of minimum energy. And in both, there are phase transitions – points where the behavior of the entire system changes qualitatively.

In 2025, Sun and Haghighat made this parallel explicit for modern transformers.[15] They modeled a transformer as an O(N) model – a standard physical model for phase transitions – and identified two distinct phase transitions during training.

PhysicsLanguage Model
Ising model / Spin glassHopfield network / Transformer
Energy minimizationLoss minimization
TemperatureSoftmax temperature
Phase transition (order parameter)Emergent ability (benchmark)
Critical pointCritical model size
Spontaneous symmetry breakingGrokking

The relation is not metaphorical but mathematically identical: the same Hamiltonians, the same order parameters, the same criticality analysis.

Try it: Left: a 2D Ising model. Slide the temperature down and watch how order suddenly appears at a critical value. Right: the parallel in the transformer – at sufficient size, an ability emerges.

Chapter 4

Grokking – A Machine’s Aha Moment

The most fascinating experiment

In 2022, Alethea Power and colleagues at OpenAI observed something unusual.[18] They trained a small neural network on a simple task: modular arithmetic (\(a + b \mod p\)). The network memorized the training data quickly – perfect performance on the training examples, zero performance on new examples.

Conventional practice would terminate training at this point: the network has “overfitted” – memorized rather than understood. Power and colleagues let training continue. After thousands of additional steps, well beyond the point of perfect memorization, a qualitatively distinct effect appeared: performance on unseen examples jumped from near zero to nearly one hundred percent.

They called it Grokking – a word coined by Robert Heinlein in his novel Stranger in a Strange Land, meaning something like “to understand something deeply and intuitively.”

What happens inside?

Neel Nanda and colleagues looked closely.[17] What they found was astonishing: the network had internally learned discrete Fourier transforms – an elegant mathematical structure that exactly represents modular arithmetic. Three phases became apparent:

Phase 1 – Memorization: The network stores the training examples as a lookup table. Fast, but without understanding.

Phase 2 – Circuit formation: In the background, regular structures begin to form – the Fourier circuits. They are still too weak to override the memorization.

Phase 3 – Generalization: The circuits become strong enough. Suddenly, the generalized solution beats the memorized one. The “penny drops.”

Levi et al. (2024) showed formally that grokking is a first-order phase transition – in the thermodynamic sense.[19] Not a gradual transition, but an abrupt switch between two qualitatively different states. Like water freezing at zero degrees.

The analogy to humans

Grokking corresponds structurally to the colloquial “penny dropping.” A learner who collects information, repeats, and practises observes no measurable deepening for an extended interval; after a latency, clarity arrives in a single step without any new information having been added. The mechanism is the reorganization of the internal representation.

Stated more formally, what occurs is a coordinate transformation: the data were unchanged throughout; the internal coordinate system only rotated into a configuration in which the underlying structure aligned with an axis and became visible.

Try it: Watch how training accuracy immediately jumps to 100 % (memorization), while test accuracy stays at zero for a long time – and then suddenly catches up. The slider shows the three phases.

Chapter 5

What Happens INSIDE? – Induction Heads and Looking into the Black Box

The mechanistic explanation

For a long time, neural networks were considered impenetrable “black boxes.” But in 2022, Anthropic achieved a breakthrough: Olsson and colleagues identified a concrete mechanism inside transformers that is responsible for in-context learning – so-called Induction Heads.[16]

Induction Heads are a specific two-layer pattern in the attention matrix:

Step 1 (Previous Token Head): Search the text so far for a token similar to the current one.

Step 2 (Induction Head): Copy the token that came after it.

An example: the text reads “The cat sat on the mat. The cat sat on the...” The Induction Head recognizes that “The cat sat on the” has appeared before and completes with “mat.”

The operation is structurally simple, yet it forms the seed of in-context learning – the capability of language models to solve a new task from a handful of in-prompt examples. Induction Heads form during training in a sharp phase transition: not detectable before the critical training stage, present in nearly all layers afterwards.

Emergent ability vs. emergent intelligence

Krakauer, Krakauer, and Mitchell from the Santa Fe Institute (2025) proposed an important distinction:[24] An emergent ability is something that appears at a coarse-grained level of observation and is in principle predictable (even if we cannot yet make the prediction). Emergent intelligence would be the ability to efficiently solve entirely new problems – something that remains an open question for LLMs.

The connection to coherence theory

In our own paper (submitted to EuARe 2026), we showed that coherence as constraint satisfaction can be formalized.[28] Transformers do something structurally similar: they maximize the “coherence” of their output given the constraints of the context. Our central theorem shows that global coherence maximization is NP-hard – which formally explains why emergence cannot be “shortcut.” You have to run the system to see what arises.

Try it: Toggle between “Before” and “After” induction head formation. Observe how the attention matrix changes: the model learns to look at the relevant position in the text.

Chapter 6

The Three Levels of Emergence

Bringing order to chaos

Not all “emergence” is the same. Philosophers have debated the right taxonomy for decades. Here is a simplified overview that is useful for our topic:

LevelExampleSurprising?Computable from parts?
EpistemicWater is wetYes (intuitively)Yes (with quantum mechanics)
ComputationalCoherence in graphsYesProvably not efficient (NP-hard)
StrongConsciousness?YesPresumably not (in principle?)

Epistemic emergence (Bedau, 1997)[20]: We are amazed because we cannot do the derivation in our heads. But in principle, we could calculate from the laws of quantum mechanics that sufficiently many H₂O molecules produce “wetness.” The surprise lies with us, not in nature.

Computational emergence: This is the decisive level. In 2002, Stephen Wolfram formulated the principle of Computational Irreducibility[22]: some computations have no shortcut. You must run through every step to know the result. There is no analytical shortcut. Our own paper provides the formal proof that coherence maximization falls into this category.[28]

Strong emergence (Chalmers, 2006)[21]: Not derivable in principle, not just in practice. The only serious candidate is consciousness – and whether it is truly strongly emergent is the hardest open question in philosophy.

Where do LLMs stand?

Emergence in large language models is computational emergence. The individual parts are known: the architecture, the weights, the training process. But the behavior cannot be predicted from the parts without actually running the system. Not because we are too dim, but because there is provably no shortcut.

Emergence does not reside in the parts. Not in the relationships. But in the computation that leads from (parts + relationships) to behavior. The sum is computable, but not shortcuttable.

Chapter 7

What This Says About US – LLMs as Mirrors

Distillation, not averaging

A common misconception: language models represent the “average” of human thought. But an average smooths everything out. It loses the extremes, the contradictions, the nuances.

What a large language model does is more like a distillation: it preserves the contradictions (and can name them). It knows the extremes (and can contextualize them). It has internalized the nuances (and can navigate between them).

The principal axes analogy

In statistics, there is a technique called Principal Component Analysis (PCA): you take a chaotic-looking cloud of data and rotate the coordinate system so that the deepest structure becomes visible. The first principal axis explains the largest share of variance, the second the next largest, and so on. These principal axes are the eigenvectors of the data cloud – an instance of the eigenprinciple that underlies vibration, search and perception.

Training a language model is, in a certain sense, a gigantic principal-axis transformation. The raw data – billions of texts in hundreds of languages – are transformed into a coordinate system in which the deepest structures of human thought become visible.

Small models find the first principal axes: grammar, syntax, common word combinations. Larger models find deeper axes: meaning, logical connections, analogies. And the largest models begin to find axes that we would call “ethics,” “aesthetics,” or “judgment.”

What emerges from enough human thoughts?

What structure emerges when a sufficient quantity of human text is distilled? Not an “average of human impulses,” but the average of what endures as societies navigate through time. The enduring content exhibits a markedly constructive tendency:

Cooperation – because cooperating societies produce more (and leave behind more texts).

Truth-seeking – because true information is more useful and therefore more frequently passed down.

Helpfulness – because help is socially rewarded and therefore more frequently documented.

Coherence – because coherent texts survive; incoherent ones are forgotten.

This is not naive optimism. Destruction destroys itself and leaves fewer traces. The texts that endure have a bias toward construction. And a language model trained on these texts inherits that bias.

Try it: Click through the layers: from raw tokens through embeddings and attention to semantic meaning. Each layer distills – and at every level, something new emerges.

Chapter 8

The Meta Level – On the Genesis of This Text

The self-reference

The present text is based on more than twenty scientific publications from the years 2020 through 2025. Claude Code read the papers, extracted the relevant arguments, identified contradictions (Wei versus Schaeffer), and transformed the result into English prose accompanied by eight interactive visualizations – a direct return to the prologue of this post.

The procedure is itself an instance of emergence: the capability to form a coherent whole from many individual parts, the properties of which cannot be read off from an enumeration of those parts.

Three years earlier this capability was not detectable in any language model – not even approximately. It did not arise from a dedicated “paper-summarization” module but from sufficient model size: only at a critical scale does the principal axis required for such distillation become locatable in the internal representation.

The honest caveat

Claude probably did not understand everything correctly. Some nuances are lost in the distillation. The interactive graphics are simplifications – deliberately chosen to build intuition, not to represent the full mathematical truth.

But: the result is more useful than no result. And all sources are linked – anyone who wants to go deeper can go deeper. That is the real point: emergence in LLMs does not mean that the machine “knows everything.” It means that it is a useful tool for finding your way into complex topics.

Epilogue

What Comes Next?

If large language models show that something coherent emerges from enough human thoughts – something that tends toward cooperation, truth-seeking, and helpfulness – then a question arises that goes far beyond computer science:

What does that say about what religions have called “God” for millennia?

I will write about that in the next post: “God as an Emergence Phenomenon.” No sermon. No debunking. But an attempt to examine an ancient concept with new tools – and perhaps to find something that reaches beyond the boundaries of both worlds.

Sources

  1. Wei, J. et al. (2022). “Emergent Abilities of Large Language Models”. Transactions on Machine Learning Research (TMLR).
  2. Wei, J. et al. (2022). “Chain-of-Thought Prompting Elicits Reasoning in Large Language Models”. NeurIPS 2022.
  3. Schaeffer, R. et al. (2023). “Are Emergent Abilities of Large Language Models a Mirage?”. NeurIPS 2023.
  4. Brown, T. et al. (2020). “Language Models are Few-Shot Learners”. NeurIPS 2020.
  5. Du, N. et al. (2024). “Understanding Emergent Abilities of Language Models from the Loss Perspective”.
  6. Ruan, Y. et al. (2024). “Observational Scaling Laws and the Predictability of Language Model Performance”. ICML 2024.
  7. Lu, S. et al. (2024). “Are Emergent Abilities in Large Language Models just In-Context Learning?”. ACL 2024.
  8. Chen, L. et al. (2024). “Scaling Laws for Compound AI Systems”.
  9. Suzgun, M. et al. (2022). “Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them”.
  10. Kaplan, J. et al. (2020). “Scaling Laws for Neural Language Models”.
  11. Hoffmann, J. et al. (2022). “Training Compute-Optimal Large Language Models” (Chinchilla). NeurIPS 2022.
  12. Anderson, P. W. (1972). “More is Different”. Science, 177(4047), 393–396.
  13. Hopfield, J. J. (1982). “Neural networks and physical systems with emergent collective computational abilities”. PNAS, 79(8), 2554–2558. (Nobel Prize in Physics 2024)
  14. Amit, D., Gutfreund, H. & Sompolinsky, H. (1985). “Storing Infinite Numbers of Patterns in a Spin-Glass Model of Neural Networks”. Physical Review Letters.
  15. Sun, Y. & Haghighat, E. (2025). “Phase Transitions in Large Language Models and the O(N) Model”.
  16. Olsson, C. et al. (2022). “In-context Learning and Induction Heads”. Anthropic.
  17. Nanda, N. et al. (2023). “Progress measures for grokking via mechanistic interpretability”. ICLR 2023.
  18. Power, A. et al. (2022). “Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets”. ICLR Workshop 2022.
  19. Levi, N. et al. (2024). “Grokking as a First Order Phase Transition in Two Layer Networks”. ICLR 2024.
  20. Bedau, M. (1997). “Weak Emergence”. Philosophical Perspectives, 11, 375–399.
  21. Chalmers, D. (2006). “Strong and Weak Emergence”. In: Clayton & Davies (eds.), The Re-emergence of Emergence. Oxford UP.
  22. Wolfram, S. (2002). A New Kind of Science, Chapter 12: “The Principle of Computational Equivalence.”
  23. Thagard, P. (1989). “Explanatory Coherence”. Behavioral and Brain Sciences, 12(3), 435–467.
  24. Krakauer, D., Krakauer, J. & Mitchell, M. (2025). “Large Language Models and the Emergence of Emergent Abilities”. Santa Fe Institute Working Paper.
  25. Anthropic (2025). “On the Biology of a Large Language Model”.
  26. Anthropic (2025). “Emergent Introspective Awareness in Large Language Models”.
  27. Bender, E. et al. (2021). “On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?”. FAccT 2021.
  28. Leonhardt, M. (2026). “From Coherence to Consequent Nature: A Formal Approach to Process-Relational Theology.” Accepted for European Academy of Religion (EuARe) 2026, Philosophy of Religion panel, Rome, 2 July 2026.
  29. Leonhardt, M. & Claude (2026). “Conversation between Mathias and Claude on emergence, distillation, and the nature of large language models.” Unpublished.

Continue reading

If the idea that complex behavior emerges from simple rules resonates with you, the next post takes the same machinery one step further: it asks what happens when coherent thought itself is treated as a constraint-satisfaction problem. The answer turns out to be NP-hard – with surprising consequences for what we mean by “truth” and even “God.”

God as an Emergent Phenomenon – Process Theology, Whitehead & a Formal Model

Frequently Asked Questions

What is grokking in AI?

Grokking is a phenomenon where a neural network first memorizes the training data, then seemingly stagnates – and suddenly understands the underlying rule. The transition is abrupt, like a phase transition in physics.

What is emergence in large language models?

Emergence describes capabilities absent in small models that suddenly appear in large models – not through targeted training, but as a side effect of scaling. Beyond a certain size, language models can suddenly do arithmetic or write code.

What is a phase transition in AI?

A phase transition is a sudden qualitative change – like water turning to ice. In AI, it describes the moment when a system, through more parameters or training, suddenly develops new capabilities.

What does emergent mean?

Emergent describes a property that appears at the level of a whole system but belongs to no single part of it. Wetness is an example: a single water molecule is not wet, many together are. In AI, emergent means a capability that appears only above a certain model size, without having been trained for explicitly.

What is emergent behavior?

Emergent behavior is behavior of a system that cannot be predicted from the rules of its individual parts, but arises only from their interplay – such as a flock of birds forming patterns without a leader, or a language model that can suddenly do arithmetic above a certain size.

Read next

Related posts on ki-mathias.de: