Prologue
What you are reading right now
This text came about like this: A human (me, Mathias) and a language model (Claude Code, by Anthropic) read around twenty scientific papers together – from Google DeepMind, Stanford, Anthropic, and the Santa Fe Institute. Written in English, mathematical, some over fifty pages long. Claude condensed them. I set the direction and sharpened the didactics.
Why am I telling you this? Because it is the punchline of this text.
If, by the end of these eight chapters, you understand what emergence means – then you have just experienced it. Because the ability to read twenty academic papers and turn them into a clear text with interactive visualizations is itself an emergent ability of large language models. Three years ago, no model could do this. Now they can. Not because someone programmed “paper summarization” – but because the models became large enough.
You will find the complete list of sources at the end of this text.*
Chapter 1
The Puzzle – When Does a Machine “Understand”?
The observation that shook AI research in 2022
Imagine a language model with ten billion parameters. You give it a task: multi-digit addition, step by step. The result is random – garbled strings, obvious nonsense. Now take the same model, the same architecture, the same training process – but with a hundred billion parameters. And suddenly it can do arithmetic.
Not a little better. Not gradually. Abruptly. From chance to competence.
In 2022, Jason Wei and colleagues at Google published the paper that systematically documented this observation.[1] They examined over 137 different tasks – from multi-step arithmetic to logical deduction, word unscrambling, and quiz questions in Persian. Across all these tasks, the same pattern emerged: below a certain model size, random-level performance; above it, sudden competence.
Wei and colleagues called these abilities “emergent” – because they were not explicitly trained but seemed to “appear” on their own once the model was large enough.
Chain-of-Thought: The key that only fits large models
One of the most striking examples is Chain-of-Thought Prompting.[2] The idea is simple: instead of asking the model for the answer directly, you ask it to “think step by step.”
With PaLM at 8 billion parameters – no effect. Whether or not you add “step by step,” the results remain equally poor. With PaLM at 540 billion parameters – a jump from 18 % to 58 % on the math benchmark GSM8K.
Here is a simplified example:
Prompt: “Lisa has 5 apples. She buys 3 bags with 4 apples each. How many does she have now? Explain step by step.”
Small model (8B): “Lisa has 12 apples.” (wrong, incoherent)
Large model (540B): “Lisa has 5 apples. 3 bags × 4 = 12 new apples. 5 + 12 = 17 apples.” (correct, structured)
The same instruction (“step by step”) triggers nothing in the small model but a correct derivation in the large one. The key fits – but only if the lock is large enough.
Try it: Move the slider to see how performance on various tasks changes abruptly beyond a certain model size. Note: the jump does not happen at the same point for every task.
Chapter 2
Is the Jump Real? – The Mirage Debate
The counter-thesis: Stanford says – it is all an illusion
Not everyone was convinced. In 2023, Rylan Schaeffer and colleagues from Stanford published a paper with the provocative title “Are Emergent Abilities of Large Language Models a Mirage?”[3] Their thesis: the jump does not exist in reality – it exists only in the metric.
The argument is elegant: if you use “Exact Match” as your metric – meaning either 100 % correct or 0 % – then of course you see a jump. Because a model that answers “1” instead of “17” gets zero points. Just like one that answers “16.” But the second model is much closer.
Switch to a continuous metric – say, token edit distance, which measures how far the answer is from the target – and the jump disappears. Instead, you see a smooth, steady climb.
An analogy: imagine you are measuring whether someone can ride a bicycle. Binary – yes or no – it looks like a sudden jump. But their balance has been improving gradually for weeks. The “jump” is an artifact of your measurement method.
Ruan et al. (2024) went even further: with their “Observational Scaling Laws,” they showed that nearly all supposedly emergent abilities become predictable when measured with the right metrics.[6]
But: the counter-thesis has limits
Even with smooth metrics, some tasks show a nonlinear acceleration. Multi-step tasks have a mathematical reason for this: if each step has a success probability \(p\) and the task requires \(n\) steps, the overall success probability is \(p^n\). This amplifies even small improvements in \(p\) dramatically.
Du et al. (2024) also showed that there are loss thresholds: only when the training loss falls below a certain value does an ability “unlock.”[5] More complex tasks have higher thresholds – and that is why they require larger models.
The synthesis (as of 2025)
Both are right. The ability improves gradually, but usability has a threshold. Like water heating up steadily – but at 100 °C it boils. The temperature rises continuously. The transition from liquid to gas is still a jump.
Try it: Toggle between “Exact Match” and “Token-Level Accuracy.” The same data – two completely different stories.
Chapter 3
Physics Has Seen This Before – Phase Transitions
More is Different
In 1972, the Nobel Prize-winning physicist Philip W. Anderson published an essay in Science that became one of the most cited in the history of physics: “More is Different.”[12] His central argument:
“The ability to reduce everything to simple fundamental laws does not imply the ability to start from those laws and reconstruct the universe.”
At every level of complexity, new properties arise that cannot be trivially derived from the laws beneath them. A single H₂O molecule is not “wet.” Wetness is a property that only emerges when enough molecules come together.
From spin glasses to neural networks
In 1982, John Hopfield drew a connection that would win the Nobel Prize in Physics in 2024: he showed that neural networks and magnetic systems (spin glasses) are mathematically identical.[13] In both systems, there are units (neurons or spins) that interact with each other. In both, the system seeks a state of minimum energy. And in both, there are phase transitions – points where the behavior of the entire system changes qualitatively.
In 2025, Sun and Haghighat made this parallel explicit for modern transformers.[15] They modeled a transformer as an O(N) model – a standard physical model for phase transitions – and identified two distinct phase transitions during training.
| Physics | Language Model |
|---|---|
| Ising model / Spin glass | Hopfield network / Transformer |
| Energy minimization | Loss minimization |
| Temperature | Softmax temperature |
| Phase transition (order parameter) | Emergent ability (benchmark) |
| Critical point | Critical model size |
| Spontaneous symmetry breaking | Grokking |
The parallel is not just a metaphor. The mathematics is the same.
Try it: Left: a 2D Ising model. Slide the temperature down and watch how order suddenly appears at a critical value. Right: the parallel in the transformer – at sufficient size, an ability emerges.
Chapter 4
Grokking – A Machine’s Aha Moment
The most fascinating experiment
In 2022, Alethea Power and colleagues at OpenAI observed something unusual.[18] They trained a small neural network on a simple task: modular arithmetic (\(a + b \mod p\)). The network memorized the training data quickly – perfect performance on the training examples, zero performance on new examples.
Normally, you would stop here. The network has “overfitted” – memorized rather than understood. But Power and colleagues let the training continue. And after thousands of additional steps, long after the network had already memorized the data perfectly, something remarkable happened: Performance on new examples jumped suddenly from zero to nearly one hundred percent.
They called it Grokking – a word coined by Robert Heinlein in his novel Stranger in a Strange Land, meaning something like “to understand something deeply and intuitively.”
What happens inside?
Neel Nanda and colleagues looked closely.[17] What they found was astonishing: the network had internally learned discrete Fourier transforms – an elegant mathematical structure that exactly represents modular arithmetic. Three phases became apparent:
Phase 1 – Memorization: The network stores the training examples as a lookup table. Fast, but without understanding.
Phase 2 – Circuit formation: In the background, regular structures begin to form – the Fourier circuits. They are still too weak to override the memorization.
Phase 3 – Generalization: The circuits become strong enough. Suddenly, the generalized solution beats the memorized one. The “penny drops.”
Levi et al. (2024) showed formally that grokking is a first-order phase transition – in the thermodynamic sense.[19] Not a gradual transition, but an abrupt switch between two qualitatively different states. Like water freezing at zero degrees.
The analogy to humans
Grokking is the machine version of the “penny dropping.” You collect information, you repeat, you practice – and for a long time nothing seems to happen. And then, suddenly: clarity. Not because new information arrived, but because the internal system reorganized itself.
Or, more formally: Processing is a coordinate transformation. The data was already there. But the internal coordinate system had to rotate first so that the structure became visible.
Try it: Watch how training accuracy immediately jumps to 100 % (memorization), while test accuracy stays at zero for a long time – and then suddenly catches up. The slider shows the three phases.
Chapter 5
What Happens INSIDE? – Induction Heads and Looking into the Black Box
The mechanistic explanation
For a long time, neural networks were considered impenetrable “black boxes.” But in 2022, Anthropic achieved a breakthrough: Olsson and colleagues identified a concrete mechanism inside transformers that is responsible for in-context learning – so-called Induction Heads.[16]
Induction Heads are a specific two-layer pattern in the attention matrix:
Step 1 (Previous Token Head): Search the text so far for a token similar to the current one.
Step 2 (Induction Head): Copy the token that came after it.
An example: the text reads “The cat sat on the mat. The cat sat on the...” The Induction Head recognizes that “The cat sat on the” has appeared before and completes with “mat.”
That sounds simple. But it is the seed of in-context learning – the ability of language models to solve new tasks from just a few examples in the prompt. And the crucial point: Induction Heads emerge during training in a sharp phase transition. Before that, they do not exist. Afterward, they are present throughout the model.
Emergent ability vs. emergent intelligence
Krakauer, Krakauer, and Mitchell from the Santa Fe Institute (2025) proposed an important distinction:[24] An emergent ability is something that appears at a coarse-grained level of observation and is in principle predictable (even if we cannot yet make the prediction). Emergent intelligence would be the ability to efficiently solve entirely new problems – something that remains an open question for LLMs.
The connection to coherence theory
In our own paper (submitted to EuARe 2026), we showed that coherence as constraint satisfaction can be formalized.[28] Transformers do something structurally similar: they maximize the “coherence” of their output given the constraints of the context. Our central theorem shows that global coherence maximization is NP-hard – which formally explains why emergence cannot be “shortcut.” You have to run the system to see what arises.
Try it: Toggle between “Before” and “After” induction head formation. Observe how the attention matrix changes: the model learns to look at the relevant position in the text.
Chapter 6
The Three Levels of Emergence
Bringing order to chaos
Not all “emergence” is the same. Philosophers have debated the right taxonomy for decades. Here is a simplified overview that is useful for our topic:
| Level | Example | Surprising? | Computable from parts? |
|---|---|---|---|
| Epistemic | Water is wet | Yes (intuitively) | Yes (with quantum mechanics) |
| Computational | Coherence in graphs | Yes | Provably not efficient (NP-hard) |
| Strong | Consciousness? | Yes | Presumably not (in principle?) |
Epistemic emergence (Bedau, 1997)[20]: We are amazed because we cannot do the derivation in our heads. But in principle, we could calculate from the laws of quantum mechanics that sufficiently many H₂O molecules produce “wetness.” The surprise lies with us, not in nature.
Computational emergence: This is the decisive level. In 2002, Stephen Wolfram formulated the principle of Computational Irreducibility[22]: some computations have no shortcut. You must run through every step to know the result. There is no analytical shortcut. Our own paper provides the formal proof that coherence maximization falls into this category.[28]
Strong emergence (Chalmers, 2006)[21]: Not derivable in principle, not just in practice. The only serious candidate is consciousness – and whether it is truly strongly emergent is the hardest open question in philosophy.
Where do LLMs stand?
Emergence in large language models is computational emergence. The individual parts are known: the architecture, the weights, the training process. But the behavior cannot be predicted from the parts without actually running the system. Not because we are too dim, but because there is provably no shortcut.
Emergence does not reside in the parts. Not in the relationships. But in the computation that leads from (parts + relationships) to behavior. The sum is computable, but not shortcuttable.
Chapter 7
What This Says About US – LLMs as Mirrors
Distillation, not averaging
A common misconception: language models represent the “average” of human thought. But an average smooths everything out. It loses the extremes, the contradictions, the nuances.
What a large language model does is more like a distillation: it preserves the contradictions (and can name them). It knows the extremes (and can contextualize them). It has internalized the nuances (and can navigate between them).
The principal axes analogy
In statistics, there is a technique called Principal Component Analysis (PCA): you take a chaotic-looking cloud of data and rotate the coordinate system so that the deepest structure becomes visible. The first principal axis explains the largest share of variance, the second the next largest, and so on.
Training a language model is, in a certain sense, a gigantic principal-axis transformation. The raw data – billions of texts in hundreds of languages – are transformed into a coordinate system in which the deepest structures of human thought become visible.
Small models find the first principal axes: grammar, syntax, common word combinations. Larger models find deeper axes: meaning, logical connections, analogies. And the largest models begin to find axes that we would call “ethics,” “aesthetics,” or “judgment.”
What emerges from enough human thoughts?
If you distill enough human texts – what comes out? Not the “average of human impulses,” but the average of what remains when societies navigate through time. And those are remarkably constructive things:
Cooperation – because cooperating societies produce more (and leave behind more texts).
Truth-seeking – because true information is more useful and therefore more frequently passed down.
Helpfulness – because help is socially rewarded and therefore more frequently documented.
Coherence – because coherent texts survive; incoherent ones are forgotten.
This is not naive optimism. Destruction destroys itself and leaves fewer traces. The texts that endure have a bias toward construction. And a language model trained on these texts inherits that bias.
Try it: Click through the layers: from raw tokens through embeddings and attention to semantic meaning. Each layer distills – and at every level, something new emerges.
Chapter 8
The Meta Level – What You Just Read
The self-reference
Back to the prologue. This text is based on over twenty scientific publications. Claude Code read these papers, extracted the relevant arguments, identified contradictions (Wei vs. Schaeffer), and distilled the result into prose with eight interactive visualizations.
That is not magic. That is emergence: the ability to form a coherent whole from many individual parts – one that is more than the sum of its parts.
Three years ago, no language model could do this. Not even close. Not because someone programmed “paper summarization” as a feature. But because the models became large enough to find the principal axis.
The honest caveat
Claude probably did not understand everything correctly. Some nuances are lost in the distillation. The interactive graphics are simplifications – deliberately chosen to build intuition, not to represent the full mathematical truth.
But: the result is more useful than no result. And all sources are linked – anyone who wants to go deeper can go deeper. That is the real point: emergence in LLMs does not mean that the machine “knows everything.” It means that it is a useful tool for finding your way into complex topics.
Epilogue
What Comes Next?
If large language models show that something coherent emerges from enough human thoughts – something that tends toward cooperation, truth-seeking, and helpfulness – then a question arises that goes far beyond computer science:
What does that say about what religions have called “God” for millennia?
I will write about that in the next post: “God as an Emergence Phenomenon.” No sermon. No debunking. But an attempt to examine an ancient concept with new tools – and perhaps to find something that reaches beyond the boundaries of both worlds.
Sources
- Wei, J. et al. (2022). “Emergent Abilities of Large Language Models”. Transactions on Machine Learning Research (TMLR).
- Wei, J. et al. (2022). “Chain-of-Thought Prompting Elicits Reasoning in Large Language Models”. NeurIPS 2022.
- Schaeffer, R. et al. (2023). “Are Emergent Abilities of Large Language Models a Mirage?”. NeurIPS 2023.
- Brown, T. et al. (2020). “Language Models are Few-Shot Learners”. NeurIPS 2020.
- Du, N. et al. (2024). “Understanding Emergent Abilities of Language Models from the Loss Perspective”.
- Ruan, Y. et al. (2024). “Observational Scaling Laws and the Predictability of Language Model Performance”. ICML 2024.
- Lu, S. et al. (2024). “Are Emergent Abilities in Large Language Models just In-Context Learning?”. ACL 2024.
- Chen, L. et al. (2024). “Scaling Laws for Compound AI Systems”.
- Suzgun, M. et al. (2022). “Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them”.
- Kaplan, J. et al. (2020). “Scaling Laws for Neural Language Models”.
- Hoffmann, J. et al. (2022). “Training Compute-Optimal Large Language Models” (Chinchilla). NeurIPS 2022.
- Anderson, P. W. (1972). “More is Different”. Science, 177(4047), 393–396.
- Hopfield, J. J. (1982). “Neural networks and physical systems with emergent collective computational abilities”. PNAS, 79(8), 2554–2558. (Nobel Prize in Physics 2024)
- Amit, D., Gutfreund, H. & Sompolinsky, H. (1985). “Storing Infinite Numbers of Patterns in a Spin-Glass Model of Neural Networks”. Physical Review Letters.
- Sun, Y. & Haghighat, E. (2025). “Phase Transitions in Large Language Models and the O(N) Model”.
- Olsson, C. et al. (2022). “In-context Learning and Induction Heads”. Anthropic.
- Nanda, N. et al. (2023). “Progress measures for grokking via mechanistic interpretability”. ICLR 2023.
- Power, A. et al. (2022). “Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets”. ICLR Workshop 2022.
- Levi, N. et al. (2024). “Grokking as a First Order Phase Transition in Two Layer Networks”. ICLR 2024.
- Bedau, M. (1997). “Weak Emergence”. Philosophical Perspectives, 11, 375–399.
- Chalmers, D. (2006). “Strong and Weak Emergence”. In: Clayton & Davies (eds.), The Re-emergence of Emergence. Oxford UP.
- Wolfram, S. (2002). A New Kind of Science, Chapter 12: “The Principle of Computational Equivalence.”
- Thagard, P. (1989). “Explanatory Coherence”. Behavioral and Brain Sciences, 12(3), 435–467.
- Krakauer, D., Krakauer, J. & Mitchell, M. (2025). “Large Language Models and the Emergence of Emergent Abilities”. Santa Fe Institute Working Paper.
- Anthropic (2025). “On the Biology of a Large Language Model”.
- Anthropic (2025). “Emergent Introspective Awareness in Large Language Models”.
- Bender, E. et al. (2021). “On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?”. FAccT 2021.
- Leonhardt, M. & Claude (2026). “Coherence Structures and Emergent Attractors in Constraint-Satisfaction Networks.” Submitted to European Academy of Religion (EuARe) 2026.
- Leonhardt, M. & Claude (2026). “Conversation between Mathias and Claude on emergence, distillation, and the nature of large language models.” Unpublished.