Blog post · Mathematics · AI

Hopfield Networks – From Spin Glass to Attention

Why every language model is, at its core, a Hopfield network from 1982 – and what that reveals about memory, pattern recognition and attention. A chain in seven chapters, with interactive MNIST demos.

KI-Mathias· · ~45 min read

Chapter 1

Three demos in thirty seconds

In October 2024 John Hopfield and Geoffrey Hinton received the Nobel Prize in Physics. The committee's citation: “for foundational contributions that led to modern AI”. An unusual award. Neither is a physicist in the strict sense – Hopfield was a solid-state theorist, Hinton a cognitive scientist. The prize was for something both had begun more than forty years earlier: a mathematical model originally intended only to remember images.

Today this very model, in a modernised variant, is the mechanism that drives every language model: every chatbot, every image AI, every long-context Transformer. The underlying operation runs in the world's data centres very frequently – a quantitative estimate follows in Chapter 7.

Three examples that show why:

Demo 1: A noisy image becomes legible again

A vivid example: a photo overlaid with noise – a QR code with scratches, a digit scribbled too quickly. A Hopfield network from 1982 reconstructs a legible image from it, provided the original pattern was stored in the network.

The next chapter builds this concretely: ten handwritten digits are stored, one of them is corrupted with noise, the network runs – and after roughly two hundred pixel flips the original has reappeared. The mathematics behind this reconstruction is a spin-glass model that Ernst Ising developed in 1925 for magnetic crystals. Hopfield showed in 1982 that the same mathematics can also perform a kind of image memory, as soon as the spins are read as pixels and the magnetic couplings as connection patterns between pixels. An analytic line that led almost unnoticed from the iron magnet to pattern recognition.

Demo 2: A language model finds the right word

When a language model answers, it looks back at its context for each individual word it generates: “which of the previous 8000 tokens is currently relevant?” This is called attention. Vaswani et al. introduced it in 2017 in their paper “Attention Is All You Need”, and it has been the central operation of every GPT, Claude, Llama or Gemini architecture ever since.

In 2020 Hubert Ramsauer and his team in Linz showed that the attention operation is mathematically identical to a modernised version of Hopfield's model. Not analogous – but the same equation, written with different letters. Hence the title of their paper: “Hopfield Networks Is All You Need”. A small provocation aimed at Vaswani, but above all a serious observation.

The consequence: what remembered pixels in the spin glass in 1982 remembers words in the language model today. It is not a comparable but the same architecture – only with different contents for the stored patterns.

Demo 3: Memristor chips solve NP-hard problems in a single step

A third application that shows the broader picture: in 2020 engineers at HP Labs demonstrated a chip made of memristors – analogue components whose resistance can be tuned electrically. In a cross-bar arrangement, the physics of this circuit is mathematically equivalent to a Hopfield network. The optimisation problem is programmed directly into the memristor values, current flows, and the solution is read off at the voltages. This happens in a single analogue step, without iterative computation.

For NP-hard problems such as VLSI layout routing or the travelling-salesman problem, the resulting energy efficiency reported in the measurements is about four orders of magnitude better than for digital procedures. In September 2024 a group at Peking University showed that a column of memristors is mathematically equivalent to a Hopfield attractor network. Hopfield's model thus no longer describes only an algorithm running on a computer, but a physical component.

The thread

Three very different scenes: remembering images, generating words, solving NP-hard problems. One mathematics. This post builds the bridge step by step: we start with the 1982 original and its simplest form (Chapter 2), look at where it breaks (Chapter 3), repair it with two leaps (Chapters 4 and 5), and finally see a surprising property that carries the network beyond what we showed it (Chapter 6). Chapter 7 shows where the whole thing is actually used today – sometimes under the name “Hopfield”, often under another.

If you have read the Eigenvalues post: much here is a continuation. Hopfield's network is also an eigenvalue problem – but one with provable convergence and a non-linear sign step. The pseudoinverse from Chapter 4 is ridge regression with \(\lambda = 0\). The kernel trick from Chapter 6 is mathematically the same as there. Anyone who has read the Eigenvalues post can read this one as the vertical companion.

Chapter 2

The original – energy that rolls

A landscape with valleys

A useful image for what follows is a hilly landscape with valleys of varying depth. If a marble is dropped into it, it does not roll at random but systematically downhill, and it does not roll forever: it comes to rest in a valley.

A Hopfield network can be described in exactly the same structural way. The geographical landscape is replaced by an energy landscape. The marble is replaced by a state of the network – an assignment of values to the neurons. The “rolling” is replaced by an update rule that gradually changes this state so that the energy decreases. And “coming to rest in a valley” is replaced by fixed points of the update rule, at which the network stays.

The decisive choice lies in the construction of the energy landscape: if the valleys are placed exactly where the stored patterns lie, then the update rule becomes a recognition procedure. A noisy state is led by “rolling” into the nearest valley – that is, to the stored pattern closest to the noisy input. Recognition as gravity.

The ingredients, one by one

More concretely. A Hopfield network has:

  • \(N\) neurons, each taking a value in \(\{-1, +1\}\). In the MNIST demos that follow, \(N = 784\), i.e. \(28 \times 28\) pixels, where \(-1\) = white, \(+1\) = black.
  • A state \(\mathbf{v} \in \{-1,+1\}^N\), the current assignment of all neurons. This is the “marble” in the landscape.
  • A weight matrix \(W \in \mathbb{R}^{N\times N}\) that describes the connections between neurons. The energy landscape is built from it; the construction rule follows shortly.
  • An energy function:
$$E(\mathbf{v}) \;=\; -\tfrac{1}{2}\,\mathbf{v}^\top W\,\mathbf{v}$$

This so-called quadratic form turns the vector and the matrix into a scalar. \(E(\mathbf{v})\) describes how well the state \(\mathbf{v}\) fits the connection structure \(W\): low when it agrees with the couplings laid down in the network, high when it contradicts them.

The Hebb rule: putting the valleys in the right places

This raises the construction question: how is \(W\) chosen so that the valleys of the energy landscape lie at the stored patterns \(\boldsymbol{\xi}_1, \ldots, \boldsymbol{\xi}_p\)? Donald Hebb had already given a biologically motivated answer in his book The Organization of Behavior (1949): “Neurons that fire together, wire together.” When two neurons take the same value in a stored pattern, their connection should be strengthened; otherwise weakened. Mathematically in one line:

$$W \;=\; \frac{1}{N}\sum_{\mu=1}^{p}\, \boldsymbol{\xi}_\mu\, \boldsymbol{\xi}_\mu^\top, \qquad W_{ii} = 0$$

This describes a summation: each stored pattern contributes its own outer product to the matrix, and all these outer products are superimposed. Two neurons that carry the same sign across many patterns accumulate a strong positive coupling; two with opposite signs across many patterns accumulate a strong negative one. The diagonal is set to zero so that neurons do not couple with themselves – a convention without which the dynamics could get stuck in a trivial self-fixed point.

The update rule: let it roll

With \(W\) as the landscape, the state is iterated. Per step, exactly one neuron is checked and updated if necessary:

$$v_i \;\leftarrow\; \mathrm{sign}\!\left(\sum_{j=1}^{N} W_{ij}\, v_j\right)$$

The sum \(\sum_j W_{ij}\,v_j\) describes the weighted vote of all other neurons for neuron \(i\): every other neuron \(j\) pulls it along its coupling \(W_{ij}\) in one direction. If the sum is positive, \(v_i\) is set to \(+1\); if it is negative, to \(-1\). So each neuron makes a local decision that takes into account the current configuration of the rest.

The order in which neurons are updated is asynchronous: in each micro-step a randomly chosen neuron is updated, then the next, then the next. A full pass over all \(N\) neurons is called a sweep. The network has converged once a full sweep produces no further flip.

The Lyapunov guarantee: convergence as theorem

The central mathematical statement of the Hopfield model is this: every accepted flip lowers the energy.

This can be shown in a short calculation. Suppose neuron \(i\) is flipped from \(v_i\) to \(v_i' = -v_i\) during a sweep. Then the change in energy is:

$$\Delta E \;=\; E(\mathbf{v}') - E(\mathbf{v}) \;=\; (v_i - v_i') \cdot \sum_{j \neq i} W_{ij}\, v_j \;=\; 2\,v_i \cdot \sum_{j \neq i} W_{ij}\, v_j$$

But the update rule says: we flip \(v_i\) precisely when the sign of \(v_i\) no longer agrees with the sign of the sum \(\sum_j W_{ij} v_j\). That is: after the flip they agree; before, they did not. Precisely in this case \(2 v_i \sum_j W_{ij} v_j\) is negative – and hence \(\Delta E < 0\). The energy falls.

With a bounded energy function on a finite state space (\(2^N\) possible states) and a monotonically decreasing energy, it follows: the network must converge. It cannot enter an infinite loop, because that would require an energy value to be assumed twice – in contradiction with the strict decrease.

This property is called a Lyapunov function, after the Russian mathematician Alexander Lyapunov (1857–1918), who developed the concept generally for dynamical systems. What Hopfield essentially showed is that the network has a Lyapunov function, namely its own energy. Convergence follows from that.

Interim summary. We have a network with \(N\) binary neurons, a weight matrix \(W\), an energy function, and an update rule. The update rule lowers the energy monotonically, so the network always comes to rest in a fixed point. If the valleys of the energy lie where our stored patterns are, this is a pattern-recognition machine. That is Hopfield 1982, almost in its entirety.

Watching it: recognition as gravity

The theoretical apparatus is complete – and can be observed concretely. In the following demo, ten handwritten MNIST digits (0 through 9, one of each) are stored as patterns in the network. The noisy input is led back step by step, with the energy staircase live alongside.

Try it. Pick a digit, push the noise slider to the right (e.g. to 20 %), click “Start”. Watch the pixels rearrange themselves – and how the energy slides one little staircase step lower with each flip. At the end: standstill, as soon as the nearest valley has been reached.

What becomes visible here is an ordered sequence of local decisions: noisy state → one neuron detects its conflict with the rest of the vote → flip → energy falls by a small amount → next neuron. The iteration ends as soon as no neuron detects a conflict any more. The energy staircase on the right is not a didactic prop pasted on afterwards but the Lyapunov function in real time.

This is the idealised picture. The reality of Hopfield networks on real data is harsher, and the next chapter brings that out. The mathematics behind it – energy, asynchronous update, Lyapunov convergence – nevertheless remains exactly the one just derived.

Chapter 3

Where it breaks – the Hebb trap

The previous chapter ended with a remark: the picture shown was idealised. The following makes concrete what this idealisation runs into. To that end, the Hopfield model is applied to exactly the setting in which it should seemingly do its job – recovering ten handwritten digits – and the result is recorded matter-of-factly.

A test that breaks

In the demo of Chapter 2, one digit was noised at a time and led back. Everything ran cleanly. The natural next step: store all ten digits in parallel, recall each one from its noisy version, and check whether each finds its own attractor.

Try it. Pick a noise level and click “Recall all”. In the Hebb mode all ten queries end up in the same final state – an image that is no longer any of the stored digits. Switching to “Pseudoinverse” on the right makes the problem disappear; why, is the subject of the next chapter.

The observation is briefly stated: Hebb produces on this data set not ten distinct attractors but a single one. Every noisy input is drawn into the same state. Visually this state is neither a 0 nor a 1 nor a 2 – it is a shape that does not appear in any of the stored patterns. In the literature, such a state is called a spurious state: a valley of the energy landscape that arose from the construction without corresponding to any storage intent.

A diagnosis through the spectrum

The cause can be read off the world matrix \(W\). For this purpose, the eigenvalues and eigenvectors of \(W_{\mathrm{Hebb}}\) are computed; for the ten first MNIST digits and \(N=784\) the following picture emerges:

$$\lambda_1 \approx 6.65, \quad \lambda_2 \approx 0.65, \quad \lambda_3 \approx 0.48, \quad \ldots$$

Between the first and the second eigenvalue there is a factor of about ten. This describes an energy landscape that possesses one dominant direction while all others are comparatively flat. Exactly along this dominant direction the state is asymptotically pulled, independently of where it started.

Which direction this is can be checked concretely. Let

$$\bar{\boldsymbol{\xi}} \;=\; \frac{1}{p}\sum_{\mu=1}^{p} \boldsymbol{\xi}_\mu$$

be the mean over all stored patterns. \(\bar{\boldsymbol{\xi}}\) describes an “average digit” that contains all per-pixel tendencies of the ten originals: background pixels mostly dark, foreground pixels at locations where many digits overlap (such as the centre of the image). The cosine between \(\bar{\boldsymbol{\xi}}\) and the first eigenvector of \(W_{\mathrm{Hebb}}\) is about \(0.9999\). The two vectors are therefore effectively identical. The dominant eigenvector is the mean vector.

Interim summary. The tenfold spectral gap is not an implementation bug. It arises systematically because Hebb adds each pattern individually and shared pixel tendencies pile up in the sum – ten roughly parallel bias contributions become a single dominant bias contribution. This is what makes the “average digit” the deepest valley of the energy landscape. Every query rolls there.

The structural conditions that are violated

The problem can be made more precise by naming the preconditions under which the Hebb rule provably works. These are two:

  1. Orthogonality. The stored patterns should be approximately pairwise perpendicular, that is, \(\boldsymbol{\xi}_\mu^\top \boldsymbol{\xi}_\nu \approx 0\) for \(\mu \neq \nu\). For MNIST digits the pairwise inner products lie between roughly 400 and 600 (with \(N = 784\)) – they share about two thirds of their pixels and are anything but orthogonal.
  2. Zero mean. The mean of each individual pattern should be near zero, i.e. each pattern should contain roughly as many \(+1\) as \(-1\) pixels. The ten MNIST digits have mean pixel values between \(-0.63\) and \(-0.90\): in each image there is substantially more background (white, i.e. \(-1\)) than ink (black, i.e. \(+1\)). This deviates considerably from the required zero mean.

Both preconditions are therefore violated, simultaneously and substantially. The failure of the Hebb rule on MNIST is hence not a surprise but expected – it is part of the mathematically well-understood behaviour of the construction outside its range of validity.

An attempted repair and its new problem

A nearby repair attempt is now considered: the patterns are centred before being summed. Concretely, the shared mean vector \(\bar{\boldsymbol{\xi}}\) is subtracted from each pattern, and the Hebb matrix is then formed from the centred patterns \(\tilde{\boldsymbol{\xi}}_\mu = \boldsymbol{\xi}_\mu - \bar{\boldsymbol{\xi}}\):

$$W_{\mathrm{centred}} \;=\; \frac{1}{N}\sum_{\mu=1}^{p} \tilde{\boldsymbol{\xi}}_\mu\, \tilde{\boldsymbol{\xi}}_\mu^\top, \qquad W_{ii} = 0$$

This operation describes the removal of the shared bias. In the spectrum it shows up clearly: the ratio \(\lambda_1 / \lambda_2 \approx 10\) becomes roughly \(1.4\). The dominant direction has disappeared, the spectrum is markedly flatter.

However, in place of one defect another now appears. If the same recall experiment is run with the centred matrix, the network now lands at the negations of the stored patterns: the stored 7 becomes the inverted image – black background with bright blobs where the original 7 had black pixels. This shows in the fact that the reconstructed images have between 507 and 618 bright pixels, while the originals had only 39 to 146.

Behind this behaviour stands a fundamental property of the Hopfield model: the energy function \(E(\mathbf{v}) = -\tfrac{1}{2}\mathbf{v}^\top W \mathbf{v}\) is symmetric under \(\mathbf{v} \to -\mathbf{v}\), i.e. \(E(\mathbf{v}) = E(-\mathbf{v})\). It follows that along with every pattern, its negation is automatically also an attractor. For strongly unbalanced originals with much background and little ink, the negation is assigned the same energy as the original. As long as a bias term had favoured the original, this symmetry existed only formally; after centring it takes full effect. The network therefore snaps just as often onto the negative side.

That is: centring removes the bias sink, but not the actual structural problem. The latter is the strong correlation between the stored patterns, which is invariant under subtraction of the mean. A single defect was traded for another.

What remains

The task for the next chapter follows from this finding. A correct construction of \(W\) must compensate for two defects simultaneously: the dominant bias contribution and the correlation between the patterns. The next chapter shows that a single modification of the construction does both – and that this modification is mathematically exactly a well-known operation from linear regression. Anyone who has read the Eigenvalues post will recognise it.

Chapter 4

First leap – the pseudoinverse

From the previous chapter a precise task remains: the world matrix \(W\) must be constructed so that neither a dominant bias sink can form nor the \(\pm\)-symmetry between patterns and their negations can take effect. The following shows that a single modification of the construction does both. This is a classical result of Personnaz, Guyon and Dreyfus from 1985 – long before Hopfield's network attained its present status as a precursor of the Transformer.

The single place of difference

Let \(X \in \{-1,+1\}^{N \times p}\) be the matrix whose \(p\) columns are the stored patterns \(\boldsymbol{\xi}_1, \ldots, \boldsymbol{\xi}_p\). The two construction formulas can then be placed directly side by side:

$$W_{\mathrm{Hebb}} \;=\; \frac{1}{N}\,X\,X^\top$$
$$W_{\mathrm{PI}} \;\;\,= \;X\,(X^\top X)^{-1}\,X^\top$$

The only difference is the factor \((X^\top X)^{-1}\), which is inserted between the two \(X\) factors. In Hebb, the identity matrix (times \(1/N\)) effectively sits there. In the pseudoinverse, the inverse of the Gram matrix of the stored patterns does.

\(X^\top X\) describes the pairwise geometry of the stored patterns: each entry \((X^\top X)_{\mu\nu} = \boldsymbol{\xi}_\mu^\top \boldsymbol{\xi}_\nu\) is the inner product between two patterns and hence a measure of their similarity. For the MNIST sample from Chapter 3 these values lie between 400 and 600 (out of 784) – the off-diagonal is anything but zero. The inverse \((X^\top X)^{-1}\) corrects exactly this overlap. The correlations are thereby factored out of the world matrix before it becomes the energy landscape.

The orthogonal special case

One special case makes the relation between the two rules precise. If the patterns are pairwise orthogonal, \(\boldsymbol{\xi}_\mu^\top \boldsymbol{\xi}_\nu = N\,\delta_{\mu\nu}\), then \(X^\top X = N \cdot I_p\) and hence \((X^\top X)^{-1} = \frac{1}{N}\, I_p\). Both constructions yield the same matrix in that case.

That is: the Hebb rule is the special case of the pseudoinverse rule that arises when the stored patterns are already uncorrelated. The additional factor only acts when it has something to do. Outside this ideal case, it is the only means by which the correlation is removed from the result of the construction.

The algebraic guarantee

An identity that explains the entire recall behaviour follows directly from the construction. Let \(\boldsymbol{\xi}_p\) be one of the stored patterns. Then \(\boldsymbol{\xi}_p\) is the \(p\)-th column of \(X\); hence \(X^\top \boldsymbol{\xi}_p\) gives the vector of inner products \((X^\top \boldsymbol{\xi}_p)_\mu = (X^\top X)_{\mu p}\). It follows:

$$W_{\mathrm{PI}}\,\boldsymbol{\xi}_p \;=\; X\,(X^\top X)^{-1}\,X^\top \boldsymbol{\xi}_p \;=\; X\,(X^\top X)^{-1}\,(X^\top X)\,\mathbf{e}_p \;=\; X\,\mathbf{e}_p \;=\; \boldsymbol{\xi}_p$$

This means: every stored pattern is mapped to itself by \(W_{\mathrm{PI}}\). \(\boldsymbol{\xi}_p\) is an eigenvector with eigenvalue one. Since the sign function preserves the sign, \(\boldsymbol{\xi}_p\) remains a strict fixed point of the Hopfield iteration. This guarantee holds independently of the correlation among the patterns – it follows from the construction alone, not from validity assumptions.

For the Hebb rule a comparable statement is only possible under orthogonality of the patterns. For correlated patterns, it leads to cross-talk terms that mix a stored pattern with components of the others.

Interim summary. With a single change – inserting the factor \((X^\top X)^{-1}\) into the construction – a construction with validity assumptions becomes a construction with an algebraic guarantee. The bias-sink question dissolves along with it: without cross-talk there is no amplification of shared pixel tendencies.

Watching it: the Hebb–PI transition

The following demo uses a slider \(\alpha \in [0, 1]\) that linearly interpolates between the two world matrices:

$$W_\alpha \;=\; (1-\alpha)\,W_{\mathrm{Hebb}} + \alpha\,W_{\mathrm{PI}}$$

At \(\alpha = 0\) we have pure Hebb, at \(\alpha = 1\) pure pseudoinverse. Three quantities are read off live: the largest eigenvalue of \(W_\alpha\), the spectral gap \(\lambda_1/\lambda_2\), and the recall final state of a noisy digit.

Try it. Move the slider slowly from left to right. Watch how the top eigenvalue drops from about 6.7 to 1.0 – and how the spectral gap collapses from roughly 10× to 1×. Exactly at the point where the spectral dominance disappears, the recall begins to agree with the stored pattern again.

Connection to the Eigenvalues post

Anyone who has read the Eigenvalues post will recognise the formula \(X(X^\top X)^{-1}X^\top\). There it was introduced as the ridge regression operator, in the general form \(X(X^\top X + \lambda I)^{-1}X^\top\) with regularisation parameter \(\lambda \geq 0\). The pseudoinverse special case used here is the limit \(\lambda \to 0\): no regularisation, exact projection onto the subspace spanned by the patterns.

This connection is not merely formal. It shows that the Hopfield model with the pseudoinverse is mathematically the same as a linear regression procedure without regularisation – applied to the problem of projecting every vector onto the span of the stored patterns. In the Eigenvalues post it was shown that this operation becomes numerically unstable when the column vectors are strongly correlated. Precisely this is the case for MNIST – and here we see why the pseudoinverse has to put up with a limit in the next section.

The dynamical capacity limit

With the algebraic identity \(W_{\mathrm{PI}}\boldsymbol{\xi}_p = \boldsymbol{\xi}_p\) one might assume that arbitrarily many patterns can be stored, as long as \(p \leq N\) and the patterns are linearly independent. Algebraically this is correct – the identity holds structurally. Dynamically, however, things look different. In practical recall the network is started from a noisy input, and the question is not only whether \(\boldsymbol{\xi}_p\) is a fixed point but how large its basin of attraction is.

A measurable version of this question can be answered empirically on the MNIST example: \(p\) patterns are drawn at random from the training set, each is recalled with 10 % pixel noise, and one observes how often the final state again matches the original.

\(p\) Hebb Pseudoinverse
100 %100 %
1000 %100 %
1500 %97 %
2000 %32 %
2500 %1 %
3000 %0 %

Between \(p = 150\) and \(p = 250\) a sharp phase transition occurs. Below it, the behaviour lies near the algebraic ideal; above it, the behaviour collapses. Notable is the location of the transition: it occurs well below the theoretical bound \(p = N = 784\), beyond which the patterns can no longer be linearly independent.

The explanation lies in the basins of attraction around the stored patterns. With few patterns they lie far apart, and 10 % pixel noise is not enough to push the recall into the basin of another pattern. As \(p\) grows, the basins shrink, and from a critical pattern count onwards the noise is enough to steer the final state to a neighbouring attractor. The identity \(W_{\mathrm{PI}}\boldsymbol{\xi}_p = \boldsymbol{\xi}_p\) holds unchanged – only the point \(\boldsymbol{\xi}_p\) is no longer reachable from its noisy neighbourhood.

This limit cannot be removed by a cleverer construction within the \(W\) scheme. It is a structural property of the scheme itself: as long as recall is iterated via \(\mathrm{sign}(W\mathbf{v})\), capacity is limited by the geometry of the basins. The next chapter shows that the scheme can be abandoned – and that in doing so an operation arises that has been deployed worldwide since 2017 under another name.

Chapter 5

Second leap – Modern Hopfield = Attention

The previous chapter ended at a barrier that could not be broken within the scheme used so far. The pseudoinverse makes every stored pattern an exact fixed point, but its dynamical capacity remains modest – about 150 patterns for MNIST, far below the algebraic bound of 784. The problem is not the construction of a better \(W\) but the scheme itself: as long as recall is iterated via \(\mathrm{sign}(W\mathbf{v})\), capacity is tied to the properties of a fixed, timeless operator.

The following shows that this scheme can be abandoned. In place of the world matrix steps an input-dependent activation that aggregates afresh over all stored patterns for every query. Hubert Ramsauer and colleagues formalised this step in 2020 under the title Hopfield Networks Is All You Need – an allusion to Vaswani et al.'s 2017 paper Attention Is All You Need. The title is not accidental: both works describe the same operation.

What Modern Hopfield no longer has

In the classical Hopfield model three components are tied together: a world matrix \(W\), a quadratic energy function \(E(\mathbf{v}) = -\tfrac{1}{2}\mathbf{v}^\top W \mathbf{v}\), and a linear update rule \(\mathbf{v} \leftarrow \mathrm{sign}(W\mathbf{v})\). Modern Hopfield breaks with all three at once:

Component Classical Hopfield (Chap. 2–4) Modern Hopfield (Ramsauer 2020)
Operator\(W \in \mathbb{R}^{N\times N}\), timelessno matrix – direct lookup via \(X\)
Update\(\mathrm{sign}(W\mathbf{v})\), linear in \(\mathbf{v}\)\(X\cdot\mathrm{softmax}(\beta X^\top \mathbf{v})\), non-linear
Energy\(-\tfrac{1}{2}\mathbf{v}^\top W\mathbf{v}\), quadraticlog-sum-exp \(+ \tfrac{1}{2}\|\mathbf{v}\|^2\)
Convergenceiterative, many sweepsin one step (for sufficiently large \(\beta\))
Capacityalgebraically \(\leq N\), dynamically much less\(\Omega(\exp(N))\) – exponential

Three aspects are noteworthy. First: there is no longer a timeless world matrix into which the stored patterns are burnt in. Instead, for every query \(\mathbf{v}\) it is evaluated afresh which of the stored patterns are relevant at all. Second: the update rule is no longer linear – the softmax brings the central non-linearity into play. Third: capacity scales not linearly but exponentially with \(N\). This puts the theoretical work of Krotov and Hopfield (2016) on dense associative memory into concrete form.

The new update rule, in three steps

The central formula reads:

$$\mathbf{v}_{\mathrm{new}} \;=\; X \cdot \mathrm{softmax}\!\bigl(\beta\,X^\top \mathbf{v}\bigr)$$

Here \(X \in \mathbb{R}^{N \times p}\) is, as before, the matrix of stored patterns (columns are the \(\boldsymbol{\xi}_p\)), and \(\beta > 0\) is a temperature parameter. The operation decomposes into three consecutive steps:

Step 1 – inner products. \(X^\top \mathbf{v}\) describes the similarity of the input \(\mathbf{v}\) to each individual stored pattern. The result is a vector with \(p\) entries, where each entry \(\boldsymbol{\xi}_\mu^\top \mathbf{v}\) measures how strongly \(\mathbf{v}\) and \(\boldsymbol{\xi}_\mu\) point in the same direction. High values correspond to similar patterns, low values to dissimilar ones.

Step 2 – softmax. The softmax operation turns this vector into a probability distribution over the \(p\) patterns:

$$\mathrm{softmax}(\beta\,X^\top\mathbf{v})_\mu \;=\; \frac{e^{\beta\,\boldsymbol{\xi}_\mu^\top \mathbf{v}}}{\sum_\nu e^{\beta\,\boldsymbol{\xi}_\nu^\top \mathbf{v}}}$$

This distribution describes which patterns are currently relevant. For small \(\beta\) it is approximately uniform – all patterns are weighted similarly. For large \(\beta\) it concentrates on the single pattern with the highest inner product – a hard selection, effectively a nearest neighbour. The parameter \(\beta\) therefore continuously controls between soft averaging and sharp choice.

Step 3 – weighted average. The final result is a linear combination of the stored patterns, weighted by their softmax probabilities:

$$\mathbf{v}_{\mathrm{new}} \;=\; \sum_{\mu=1}^p \boldsymbol{\xi}_\mu \cdot \frac{e^{\beta\,\boldsymbol{\xi}_\mu^\top \mathbf{v}}}{\sum_\nu e^{\beta\,\boldsymbol{\xi}_\nu^\top \mathbf{v}}}$$

In the limit \(\beta \to \infty\) this becomes exactly \(\boldsymbol{\xi}_{\mu^*}\), where \(\mu^*\) is the index of the pattern with the largest inner product with \(\mathbf{v}\). That is: for large \(\beta\), Modern Hopfield becomes a pure 1-nearest-neighbour operation over the stored patterns. For intermediate \(\beta\) it is a softer variant that considers several similar patterns simultaneously.

Watching it: what the temperature parameter controls

In the following demo \(\beta\) can be varied interactively. The softmax distribution over the ten stored digits and the resulting recall state become visible. Two extremes mark the scale: for small \(\beta\) the answer merges into an averaged image; for large \(\beta\) it becomes the sharp choice of a single digit.

Try it. Move \(\beta\) upwards from 0.1. Watch how the softmax bars concentrate into a dominant peak – and how the recall state changes from a mushy average digit into a clear image. From \(\beta \geq 5\) the choice is typically unambiguous.

The new energy function

Modern Hopfield possesses – like the classical model – a Lyapunov function, an energy that does not rise under the update step. It is, however, no longer a quadratic form but a log-sum-exp construction:

$$E(\mathbf{v}) \;=\; -\frac{1}{\beta}\,\log\!\sum_{\mu=1}^p e^{\beta\,\boldsymbol{\xi}_\mu^\top \mathbf{v}} \;+\; \tfrac{1}{2}\|\mathbf{v}\|^2 \;+\; C$$

The first term describes a soft maximum function: for large \(\beta\) it approaches the negative maximum \(-\max_\mu \boldsymbol{\xi}_\mu^\top \mathbf{v}\); for small \(\beta\) it converges to a uniform average over all patterns. The second term is a quadratic regulariser that keeps the recall state on a finite sphere. The constant \(C\) is relevant only for the theoretical analysis.

This energy has two properties that the classical scheme could not achieve. First: every stored pattern \(\boldsymbol{\xi}_\mu\) is a local minimum – an exponentially sharp one, since the log-sum-exp function creates its own deep dimple around each \(\boldsymbol{\xi}_\mu\). Second: the number of these dimples is not bounded by the eigenvalues of a matrix but solely by the geometry of the stored vectors. This explains the exponential capacity – up to \(\Omega(\exp(N))\) patterns can in principle be stored separately without their dimples merging.

Interim summary. Modern Hopfield replaces the quadratic energy by a log-sum-exp form and the linear update by a softmax lookup. Together, the two properties raise capacity from linear to exponential – and make iteration unnecessary, because one step suffices. So far, this is a self-consistent generalisation of the Hopfield model. What follows is the unexpected surprise: the same operation has stood since 2017 under a different name at the centre of modern artificial intelligence.

The identity with the Transformer attention mechanism

In the summer of 2017 Ashish Vaswani and colleagues at Google published Attention Is All You Need. It introduced the Transformer – an architecture that today powers essentially every large language model, every vision transformer and every multimodal AI. At the centre of this architecture sits an operation called scaled dot-product attention in the paper:

$$\mathrm{Attention}(Q, K, V) \;=\; V \cdot \mathrm{softmax}\!\bigl(K^\top Q / \sqrt{d_k}\bigr)$$

Three input matrices are processed: \(Q\) (queries, the current queries), \(K\) (keys, the addresses of the stored contents), and \(V\) (values, the stored contents themselves). The scaling factor \(\sqrt{d_k}\) normalises the inner products against the dimension. The result is a softmax-weighted linear combination of the value vectors, steered by the similarity between query and keys.

The identity with the Modern Hopfield update rule follows by a single substitution:

$$\mathrm{Hopfield:}\quad \mathbf{v} \;\leftarrow\; X \cdot \mathrm{softmax}(\beta\,X^\top \mathbf{v})$$
$$\mathrm{Attention:}\;\;\;\; V \cdot \mathrm{softmax}(K^\top Q / \sqrt{d_k})$$

Set \(Q = \mathbf{v}\), \(K = X\), \(V = X\), and \(\beta = 1/\sqrt{d_k}\). Both equations become identical. This is not an analogy, not a structural similarity – but the same operation, expressed in two different notations.

In the Transformer case, \(K\) and \(V\) are usually different projections of the input, not the same matrix as in Hopfield. \(Q\) is generated by yet another projection. These variations extend the Modern Hopfield form by additional learnable transformations – they leave the fundamental operation in the middle (softmax-weighted lookup) untouched. Anyone who accepts the simplified special case \(K = V\) and \(Q\) as the raw input has the Modern Hopfield update in front of them.

What this identity means in practice

Three observations follow from the identity that would not be possible without it:

  1. Every Transformer is an associative memory. When a language model predicts a token, it asks in each attention layer: which of my earlier tokens are currently relevant? This question is answered by a Modern Hopfield update. In a typical LLM with context length 8000, every generated token triggers a Hopfield update over up to 8000 stored vectors in every attention layer.
  2. The capacity question dissolves differently. Classical Hopfield had a capacity limit that on MNIST already kicked in well below \(N = 784\). Modern Hopfield has exponential capacity – for \(N = 768\) (a typical embedding dimension in Transformers) these are enough patterns to address effectively unlimited context. This is part of the reason the Transformer architecture proved so successful.
  3. Interpretability returns. What was known as a black-box mechanism can now be analysed with the tools of Hopfield theory: convergence guarantees, energy landscapes, metastable states. Ramsauer and colleagues showed in their work that Transformer heads in early layers often perform a global averaging (corresponding to small \(\beta\)), whereas in deeper layers they access individual tokens sharply (large \(\beta\)). This characterisation had not been possible before.

A historical loop

The order of the discoveries is remarkable. Krotov and Hopfield published their dense-associative-memory theory in 2016 – one year before Vaswani's Transformer. The exponential capacity and the log-sum-exp energy construction were already worked out, without any visible link to a practical language-model architecture. Vaswani and colleagues in turn arrived at their attention form in 2017 by iteration on concrete translation problems, without reference to Krotov-Hopfield. Only Ramsauer 2020 recognised: both paths lead to the same operation.

Such an independent rediscovery is not uncommon in mathematics. It is an indication that the underlying structure is not a design decision but a forced consequence of the requirements. What Hopfield formulated in 1982 as a model for biological memory, what Krotov in 2016 formalised as a theoretical generalisation, and what Vaswani in 2017 built as a practical mechanism for language translation – it is a single mathematical operation that appeared as the natural solution in three different contexts.

The next chapter shows that this operation has a further property that classical Hopfield did not possess: under suitable assumptions on the data, it generalises – meaning, it attracts not only stored patterns but also unseen configurations from their neighbourhood. This property is the reason language models do not only memorise but also produce.

Chapter 6

What it tells us about cognition

So far it has been shown that Modern Hopfield recovers stored patterns with virtually arbitrary capacity. The open question that the previous chapter only hinted at is a different one: what happens when the network sees an input that is not a stored pattern? Does it recognise something it never directly learned? Does it generalise?

This question is not academic. A language model that only memorises would be worthless; its usefulness lies precisely in formulating answers to questions no one has posed before. If the attention operation behind these models is mathematically identical to Hopfield, then Hopfield must also be capable of producing something that goes beyond rote learning. The following shows under which conditions this succeeds – and under which it does not.

An honest preliminary diagnosis: on MNIST it does not succeed

First a factual inventory. Ten handwritten digits from the MNIST training set are stored, and ten unseen digits from the test set are then used as inputs – how often does the network recognise the correct class?

Our own experimental values (with 100 stored patterns, 500 unseen test images and 5 % pixel noise, documented in detail in the scientific companion notes to this post):

Procedure Class hit rate on unseen test images
Random classifier (10 classes)10.0 %
1-nearest neighbour on the 100 stored patterns69.6 %
Hebb Hopfield11.6 % (≈ random, bias sink)
Pseudoinverse Hopfield65.2 %
Modern Hopfield67.2 %

The picture is unambiguous: neither pseudoinverse nor Modern Hopfield beats the trivial 1-nearest-neighbour classifier. They sit close together, slightly below it. The class of procedures studied here therefore produces no genuine generalisation on MNIST – it produces a soft-lookup variant of nearest neighbour. This is an important, sober finding that the rest of the discussion must not forget.

Reframing the question

This result could be read as an endpoint – Hopfield is simply not a classifier. It can also be read differently: perhaps the problem lies not with the learning rule but with the data. MNIST digits possess no explicitly accessible structure; each is its own pixel pattern, without the network knowing the underlying geometry (strokes, loops, arcs) as such.

What if the world were built differently? What if every pattern were not a self-contained image but a sparse composition from a manageable number of components that the network could, in principle, identify? Precisely this case was studied by Matteo Negri, Carlo Lucibello and collaborators in a 2024 paper titled Random Features Hopfield Networks generalize retrieval to previously unseen examples. Their finding: under this structural assumption a Hopfield network can genuinely generalise – and a three-stage phase diagram appears.

The setup: patterns as feature mixtures

The construction principle is simple. Let \(F \in \{-1, +1\}^{N \times D}\) be a random feature matrix whose \(D\) columns are the components of the world. Each pattern is built as a sparse mixture of these components: a coefficient vector \(\mathbf{c} \in \mathbb{R}^D\) has exactly \(L\) entries equal to one, the rest are zero. This produces the pattern

$$\boldsymbol{\xi} \;=\; \mathrm{sign}\bigl(F\,\mathbf{c}\bigr) \;\in\; \{-1, +1\}^N$$

\(F\,\mathbf{c}\) describes the superposition of the chosen \(L\) component columns; \(\mathrm{sign}(\cdot)\) reduces the result back to a binary configuration. A pattern in this setup is thus always a binary function of an explicit, small selection of components. The decisive parameter is \(L\): for \(L = 1\) each pattern is exactly one component; for larger \(L\) it is an increasingly dense mixture.

For the following investigation, three different sets of patterns are generated on the same feature matrix \(F\):

  • A training set of \(p\) patterns whose coefficient vectors are chosen at random. These are stored in the network.
  • A features set consisting of the \(D\) individual components \(F_{:,d}\). These are not stored – they are what the network should ideally discover.
  • A test set of further patterns from the same distribution, but with new coefficient vectors that did not appear in training.

For each of these three sets the magnetisation is measured:

$$m(\boldsymbol{\xi}) \;=\; \frac{1}{N}\,\mathrm{sign}(W\boldsymbol{\xi})^\top \boldsymbol{\xi}$$

\(m\) describes how strongly the network throws the vector \(\boldsymbol{\xi}\) back onto itself: a value of one means a hard fixed point (every pixel preserved), zero means random output, negative values mean anti-attractors. A magnetisation of order 0.9 or above is referred to as stable in the literature.

Three phases

As the training set grows – expressed by the ratio \(\alpha = p/N\) – three qualitatively different behaviours appear in order:

Storage phase  (\(\alpha\) small). Only the training patterns are stable. The features lie outside the stored attractors, the test patterns likewise. This is the classical Hopfield behaviour: the network has memorised concrete points.

Learning phase  (\(\alpha\) medium). As the pattern set grows, the magnetisation of the individual training patterns falls, while that of the features rises. At a transition point the two curves cross. Here the decisive phase change takes place: the network no longer stores the individual examples but discovers the common components from which they are built. A sparse world becomes visible.

Generalisation phase  (\(\alpha\) large). At even higher values of \(\alpha\) the test magnetisation also becomes positive. In other words: unseen mixtures of the same components likewise become attractors of the network – without ever having been stored. The network has grasped the components so completely that it accepts any reasonable combination of them as a legitimate world configuration.

Try it. Move the slider for the storage load \(\alpha\) and watch the three curves. For small \(\alpha\) only Train (blue) is high; for medium \(\alpha\) Features (green) take over; for large \(\alpha\) Test (red) also begins to stabilise. The Hebb / Pseudoinverse toggle reveals a surprise: the same learning rule that achieved nothing on MNIST generalises perfectly here – as soon as the world has the right structure.

The surprise: pseudoinverse generalises hard

What was visible as a soft, slowly rising generalisation in pure Hebb becomes a hard jump with the pseudoinverse rule. For sufficiently large \(\alpha\), test patterns are stabilised exactly – their magnetisation reaches one.

This is mathematically tractable. The pseudoinverse projects onto the subspace spanned by the training patterns. As soon as this subspace is large enough to contain all possible feature mixtures of \(F\), every such mixture – whether seen in training or not – becomes an eigenvector of \(W_{\mathrm{PI}}\) with eigenvalue one. \(\mathrm{sign}(W_{\mathrm{PI}}\,\boldsymbol{\xi}_{\mathrm{test}}) = \boldsymbol{\xi}_{\mathrm{test}}\) then follows algebraically by necessity.

This is therefore not an empirical success story but a structural result: when the data geometry and the learning-rule geometry fit together, generalisation is a mathematical consequence, not a learning achievement.

What this says about cognition

From this finding follows a fundamental inversion of the usual view of the relation between learning rule and data. In ML practice one often speaks as if learning rules possessed an ability to generalise – some regularise better, some worse, some are more expressive. The random-features-Hopfield setup shows something else: the same learning rule (pseudoinverse) that reached only 1-NN-like performance on MNIST delivers perfect generalisation in the strict sense here. It was not the learning rule. It was the data.

More generally: generalisation is not an internal property of a procedure. It arises from the compatibility of two geometries – the geometry of the world matrix and the geometry of the stored patterns. When the world consists of a few explicitly composable components, a Hopfield network can discover these components and accept every new mixture of them as a legitimate world configuration. When the world does not do that – like MNIST with its implicit, hard-to-access components – even the best learning rule remains a memory lookup.

This statement has a counterpart in epistemology: only that which has a sparse architecture becomes knowable. A world in which every point is a unique, irreducible point cannot be understood, it can only be memorised. A world made of a few deep structures, in which every concrete case is a composition of these structures, lends itself to abstraction – and hence to transfer to unseen cases.

Read from this perspective, the fact that language models generalise so successfully is no proof that their architecture carries any special property. It is rather a hint that natural language has a sparse structure: its concepts, idioms and constructions are a finite set of components from which infinitely many concrete sentences are composed. Language is Hopfield-friendly because it itself consists of features.

The next chapter shows where this observation becomes practically effective – in which industrial applications the Modern Hopfield architecture is actually used, and which property of the respective domain makes this possible.

Chapter 7

Where it runs today

The previous chapters traced the theoretical development: from the spin glass to classical Hopfield, from there via the pseudoinverse to the Modern Hopfield form, and finally to the identity with Transformer attention. Anyone who, with this theoretical apparatus in hand, asks where any of this is actually deployed in today's industry encounters a situation that may perhaps be summarised as follows: the classical Hopfield model has largely become obsolete as a general-purpose ML architecture; the Modern Hopfield procedure is ubiquitous under the name “attention”; and in some specialist domains the Hopfield reading is kept explicit because there the memory view is closest to the task.

The following describes five concrete application fields in which the Hopfield architecture – classical or modern – is actively used today. A summary table follows, together with the academic acknowledgement of this line through the 2024 Nobel Prize.

(1) Drug discovery – few-shot learning for new drug properties

In the pharmaceutical area a problem arises that overwhelms classical deep-learning pipelines: when a new drug class is investigated, often only a few dozen known example molecules are available. Classical classifiers need thousands of training examples to reach acceptable accuracy. Few-shot learning is the only practicable strategy here.

A working group around Sepp Hochreiter and Günter Klambauer at the University of Linz developed between 2020 and 2023 under the name MHNfs an architecture in which a Modern Hopfield layer holds a library of more than 100,000 context molecules as memory. For a new drug query a softmax lookup is aggregated over this library, and the resulting weighted mixture serves as an enriched representation. MHNfs achieves state of the art on the FS-Mol benchmark for few-shot property prediction. A related application in retrosynthesis prediction (which reaction produces a given target molecule?) also reaches top results and is several orders of magnitude faster than the previously usual methods.

Why does Hopfield win here in particular? The task is not primarily a classification but a memory operation: compare the new molecule with the 100,000 known ones, weight them by similarity, derive a representation from the weighted average. That is mathematically what Modern Hopfield does – and what the attention mechanism in a Transformer does with its context. Without the Hopfield perspective the exponential capacity would be hard to justify.

(2) Immune repertoire classification – COVID antibodies from millions of sequences

A human immune repertoire contains about a million different B-cell receptor sequences. The diagnostically important question is: did this person carry a particular infection – recognisable in a few rare, disease-specific sequences among millions of irrelevant ones? This multi-instance-learning task with a very low witness rate was classically virtually unsolvable.

In 2020 Michael Widrich and colleagues from the same Linz group published the procedure DeepRC (deep repertoire classification). The core: a Modern Hopfield attention layer over the entire repertoire, capable of attending over millions of sequences at once. On simulated and real data on SARS-CoV-2 infection, DeepRC clearly outperformed previous methods. Practical added value: the procedure extracts the sequence motifs associated with a particular disease – direct support for the design of new vaccines and therapeutics.

Here the exponential capacity of Modern Hopfield is not a theoretical luxury but a precondition: a classical attention procedure would no longer be computationally manageable with \(10^6\) input sequences.

(3) Combinatorial optimisation on memristor hardware

NP-hard optimisation problems such as VLSI chip layout routing, the travelling-salesman problem in logistics, or graph partitioning in distributed systems are classical application fields for Hopfield networks, since Hopfield and Tank propagated the idea in the 1980s. For a long time this idea remained theoretical – Hopfield solvers were slower than specialised heuristics like simulated annealing or genetic algorithms.

A turn came from the hardware side. Memristors are analogue components whose electrical resistance can be programmed by an applied voltage. Connected in a cross-bar arrangement, they form a lattice that naturally implements the Hopfield dynamics: the current through each column computes the inner product of an entire matrix row with the current state, in a single analogue step. What in a digital computer runs as a sequence of multiplications happens here through the electrical properties of the circuit itself.

In 2020 a group at HP Labs demonstrated a memristor Hopfield chip that solves MAX-CUT problems in analogue (published in Nature Electronics). The reported energy efficiency: four orders of magnitude better than digital procedures on comparable problem sizes. In September 2024 a group at Peking University published in Nature Communications a formal proof that a column of memristors is mathematically equivalent to a Hopfield attractor network – not analogous, not approximately, but exactly.

This line is remarkable. What in most of this post was presented as a model – a mathematical abstraction simulated on a computer – is here physically realised. The circuit is the Hopfield network, not a simulation of it.

(4) Transformer attention – everywhere, without being called that

From Chapter 5 follows directly the most important application of all: every Transformer forward pass, every token, every attention head of a modern AI is a Modern Hopfield iteration. When a large language model such as GPT, Claude, Gemini or Llama generates a response, in each of its dozens of attention layers a Hopfield update is computed for every token over up to tens of thousands of context vectors. The analogous statement holds for vision Transformers (ViT, DINOv2, Swin), for multimodal models and for diffusion models with cross-attention.

As to the quantitative order of magnitude, precise numbers are missing because the operators do not publish them. What can be said with confidence: the Modern Hopfield operation is with high probability the most-executed mathematical operation on today's compute infrastructure – carried out in the GPU data centres of OpenAI, Anthropic, Google, Meta and a dozen other providers. Under the name “attention”, not under the name “Hopfield”.

(5) Hopfield layers as a module in PyTorch models

For ML practitioners the Linz group has offered since 2020 a PyTorch library ml-jku/hopfield-layers that wraps the Modern Hopfield operation as a directly usable module. Three variants are distinguished: Hopfield for the association of two sets (query and memory), HopfieldPooling for aggregating operations in place of classical pooling, and HopfieldLayer for learning learnable memory slots.

These modules replace existing components such as LSTM, GRU or simple attention in existing architectures, without having to rebuild the connection topology. The published application areas range from tabular ML on UCI benchmarks through time-series forecasting and reinforcement learning with episodic memory to the drug and immune applications discussed above.

Synthesis

The five fields together give a clear picture: in every domain in which a concrete memory task has to be solved – and the precondition of exponential capacity is met – Modern Hopfield is today state of the art. Classical Hopfield is reduced to its specialised hardware niches.

Domain Work / group Year Measured or empirical statement
Drug discoveryMHNfs (Klambauer, Hochreiter et al.)2023SOTA on FS-Mol; 100k+ context molecules as memory
Immune repertoireDeepRC (Widrich et al.)2020SARS-CoV-2 classification from 10⁶ sequences; SOTA
Combinatorial optimisationMemristor Hopfield (HP Labs / Peking U.)2020, 20244 orders of magnitude energy advantage vs. digital
Transformer attentionevery LLM, every ViT, every multimodal modelsince 2017likely the most-executed operation in the world's GPU fleet
PyTorch moduleml-jku/hopfield-layerssince 2020drop-in for LSTM, pooling, attention

The confirmation: Nobel Prize 2024

In October 2024 the Nobel Prize in Physics was awarded to John J. Hopfield and Geoffrey E. Hinton, with the citation “for foundational discoveries and inventions that enable machine learning with artificial neural networks”. The choice of category – physics, not economics or computer science – reflects the historical origin of the Hopfield model in spin-glass theory.

The prize reads like a belated academic confirmation of what the preceding five fields demonstrate technically: Hopfield's 1982 model is not just one historical model among many but an architecture whose mathematical substance enabled the AI revolution of the second half of the 2010s – even though this substance was marketed under another name.

The epilogue shows what this line means for the understanding of memory, attention and cognition in general – and where this post deliberately refrains from going beyond what the mathematics actually supports.

Epilogue

The one operation that carries the century

What was followed across seven chapters can be summed up in one observation: one mathematical operation – a quadratic energy function with its monotone update rule – suffices to remember digits, solve NP-hard problems, classify antibodies and model language. The applications differ; the operator does not.

Four domains, the same mathematics

Application Variant What is remembered Where the energy lives
Image memory (1982)classical (Hebb / PI)pixel patternsquadratic over pixel couplings
Combinatorial optimisationclassical, in memristor hardwarespin configuration of the solutionnegative cost function of the problem
Multi-instance learningModern Hopfieldrepertoire elementslog-sum-exp over similarities
LLM token predictionTransformer attention (= Modern Hopfield)context tokenssoftmax-weighted average

The last column shows where the mathematics changes: not in its substance but in the functional form of the energy. The quadratic form from Chapter 2 is the historically first, simplest case. The log-sum-exp form from Chapter 5 is the continuous generalisation. Both belong to the same family of Lyapunov functions with guaranteed convergence.

What this line shows

When the same mathematical structure is discovered independently in four disparate disciplines – solid-state physics in 1925, pattern recognition in 1982, language processing in 2017 and hardware optimisation in 2020 – this is no longer a design coincidence. It is an indication that the structure is not an invention but a discovery. It waits to be rediscovered in every setting whose task is the same: to relate a finite set of states to one another such that an input-dependent answer falls out.

This is also what the 2024 Nobel Prize implicitly acknowledges. It went to Hopfield and Hinton not for a particular technical application but for the mathematical substrate from which today's AI grew. It is the rare constellation in which an award honours less a single piece of work than an entire tradition.

What this post does not say

Three explicit limits, so that nothing is over-interpreted:

First: Hopfield is not a universal ML tool. For many tasks – image classification on large balanced data sets, generative modelling, audio-to-text – specialised architectures (ConvNets, diffusion, Conformer) are clearly superior. Where the Hopfield architecture wins, it does not win through its own universality but through the fit of its memory view to the task structure.

Second: the generalisation property in Chapter 6 was shown on synthetic data with explicit feature structure. On real data such as MNIST it does not hold without further steps. Anyone who wants to carry it over must first extract a feature basis (PCA, dictionary learning, embeddings). The picture shown here is a theoretical possibility statement, not a direct MNIST trick.

Third: the identity between Modern Hopfield and Transformer attention does not mean that every classical ML algorithm is “in truth” a Hopfield network. It holds very precisely between the Modern Hopfield update rule and the scaled dot-product attention mechanism. Other architectures (diffusion, state-space models, ConvNets) have their own mathematical structures that do not coincide with Hopfield.

Cross-references to other posts

Several places in this post connect to earlier work on this blog:

  • Eigenvalues & AI – the pseudoinverse from Chapter 4 is mathematically identical to ridge regression with \(\lambda = 0\). Anyone who has read the Eigenvalues post in depth already knows this connection – here it is the bridge from the Hopfield scheme of Chapter 4 to the generalisation in Chapter 6. The exponential kernel from Chapter 5 also appears there in the kernel-trick section in a related form.
  • KRR Chat: under the hood – in the KRR-Chat post a language model is shown as a kernel ridge regression lookup. With the tools of the present post the same lookup can also be read as a Modern Hopfield recall: the stored training tokens as memory, the query as query, the softmax weighting as activation profile. The two posts describe the same procedure from two different mathematical angles.
  • God as emergence – in the God post the same \(W\) is read philosophically as the world matrix. The point of Chapter 6 here – cognition is a property of the shared geometry of world and cognitive apparatus – is the formal sister of Whitehead's consequent nature. Anyone who has read both posts has the two halves of the same observation.
  • Quantum physics with arrows – in the quantum post the propagator was introduced as a sum over eigenstates. This Mercer-like kernel structure reappears in this post in the log-sum-exp energy. The same mathematics in three disciplines is no coincidence but a recurring structural answer.

Closing remark

What Hopfield formulated in 1982 as a model of a biological memory is today the architecture that drives every language model. Anyone who knows the arc sees, in every chatbot answer, an iteration of a spin-glass model originally only able to recognise digits. The past is not closed; it is still running.

FAQ

Frequently asked questions

Why is it called attention and not Hopfield?

The terms were coined independently. Ashish Vaswani and colleagues introduced the attention mechanism in 2017 as a practical solution for machine translation and chose the name in reference to intuitions from the psychology of attention. The mathematical kinship to Hopfield's model was first formalised by Ramsauer and colleagues in 2020 – at a point at which the Transformer name had already been established. In the ML community the name attention has therefore prevailed, even though the name Hopfield layer would be mathematically more appropriate.

Is it worth building Hopfield layers into one's own models?

It depends on the task. For standard classification on large balanced data sets a conventional architecture (ConvNet, simple MLP, gradient boosting) almost always beats Hopfield layers. For memory-centric tasks – few-shot learning, multi-instance learning, episodic memory in reinforcement learning – it is often worth building a Hopfield layer into the architecture, and ready-made PyTorch components exist for it (see sources below). A small rule of thumb: if the question “which of my stored items is currently relevant?” is central to the problem, Hopfield is a natural candidate.

Where is the boundary to diffusion models?

Both architectures use energy functions, but for different tasks. Hopfield stores a finite set of discrete attractors and pulls an input onto the nearest attractor – a memory operation. Diffusion models learn a continuous probability distribution over all possible outputs and sample from it – a generative operation. For image generation diffusion models are clearly better suited; for exact recall of stored contents Modern Hopfield is better. The two can be combined – in practice this is rarely done.

Why did Hopfield receive the physics Nobel and not the one for computer science?

There is no Nobel Prize for computer science (the Turing Award fills this role, without being a Nobel Prize). But regardless of that, the choice of the physics category is substantively consistent: Hopfield's model originates formally from the spin-glass theory of solid-state physics; the Lyapunov stability analysis is a physical standard method; and the memristor hardware line from Chapter 7 is even, in the concrete sense, a physical component. Anyone who classifies Hopfield as computer science overlooks the physical substance; the committee did not.

Does this also work without MNIST – on text, audio, video?

Yes, in two flavours. First: Modern Hopfield works on any vector space in which similarity is definable as an inner product. For text, token embeddings serve as stored patterns; for audio, spectral representations; for video, frame features. That is what every Transformer does anyway. Second: classical Hopfield with a \(\pm 1\) state space is only suitable for discrete tasks like QR-code restoration or MAX-CUT – not directly for continuous modalities.

What is the difference to Boltzmann machines?

Boltzmann machines are the stochastic sister of Hopfield networks. They have the same energy function, but a probabilistic state change with probability \(\propto e^{-\Delta E / T}\) instead of the deterministic sign update. It follows: Boltzmann machines learn probability distributions instead of fixed patterns, can sample from these distributions, and are generative models. Hopfield networks are their deterministic simplification. Hinton, who received the 2024 Nobel Prize together with Hopfield, essentially contributed the Boltzmann side to Hopfield's.

What about the bias sink on MNIST – is that an implementation bug?

No. The behaviour is mathematically expected and well understood. Hebb learning works only under two conditions – orthogonal patterns and a zero-mean distribution – and MNIST violates both. In the scientific companion repository for this post this was checked in detail: dtype check, comparison with a hand-computed reference, test with orthogonal synthetic patterns (3/3 perfect), bias elimination by centring. Hebb works as specified; MNIST simply lies outside its range of validity.

Sources

Literature

Original works

  • E. Ising. Beitrag zur Theorie des Ferromagnetismus. Z. Phys. 31, 253–258 (1925). Springer
  • D. O. Hebb. The Organization of Behavior: A Neuropsychological Theory. John Wiley & Sons, New York (1949). Internet Archive (full text)
  • J. J. Hopfield. Neural networks and physical systems with emergent collective computational abilities. PNAS 79(8): 2554–2558 (1982). PNAS
  • L. Personnaz, I. Guyon, G. Dreyfus. Information storage and retrieval in spin-glass like neural networks. J. Phys. Lett. 46, 359–365 (1985). EDP Sciences
  • D. Krotov, J. J. Hopfield. Dense Associative Memory for Pattern Recognition. NeurIPS 2016. arXiv:1606.01164
  • A. Vaswani, N. Shazeer, N. Parmar et al. Attention Is All You Need. NeurIPS 2017. arXiv:1706.03762
  • H. Ramsauer, B. Schäfl, J. Lehner et al. Hopfield Networks is All You Need. ICLR 2021. arXiv:2008.02217
  • M. Negri, F. Tudisco, C. Lucibello et al. Random Features Hopfield Networks generalize retrieval to previously unseen examples (2024). arXiv:2407.05658

Applications

  • M. Widrich, B. Schäfl, M. Pavlović et al. Modern Hopfield Networks and Attention for Immune Repertoire Classification. NeurIPS 2020. arXiv:2007.13505
  • F. Cai, S. Kumar, T. Van Vaerenbergh et al. Power-efficient combinatorial optimization using intrinsic noise in memristor Hopfield neural networks. Nature Electronics 3, 409–418 (2020). Nature Electronics
  • J. Schimunek, P. Seidl, L. Friedrich et al. Context-enriched molecule representations improve few-shot drug discovery (2023). arXiv:2305.09481
  • Z. Sun et al. Memristor attractor network model. Nature Communications (September 2024). Peking University press release

Tools and code

  • ml-jku Linz. hopfield-layers – PyTorch implementation of the Modern Hopfield layers. GitHub

Academic recognition

  • Nobel Prize in Physics 2024 – J. J. Hopfield, G. E. Hinton. For foundational discoveries and inventions that enable machine learning with artificial neural networks. nobelprize.org

Scientific companion material for this post

The experiments and findings on which Chapter 3 (bias sink), Chapter 4 (capacity limit of PI) and Chapter 6 (random-features generalisation) rest were carried out and documented independently in preparation for this post. The full code repository with all detailed analyses will eventually be made public; until then, access is available on request.