Blog Post · Mathematics · AI

Eigenvalues & AI

Why artificial intelligence is an eigenvalue problem – and what that has to do with quantum mechanics. A mathematical chain in seven chapters, with interactive visualizations.

KI-Mathias· · ~50 min read

Chapter 1

The Glass Bead Game

Every time you run a Google search, a computer solves an eigenvalue problem. It computes which webpage is the most important – and the answer is an eigenvector.

Every time a neural network recognizes an image or writes a text, eigenvalues control what the network learns and what it ignores.

And in quantum mechanics? The allowed energies of an atom – the discrete spectral lines that Niels Bohr could not explain in 1913 – are also eigenvalues. Schrödinger titled his famous 1926 paper simply: “Quantization as an Eigenvalue Problem.”

Three different worlds. The same mathematics. That is no coincidence – and in this post, you will see why.

We build a chain: from the simplest linear regression through kernel methods and Google PageRank to neural networks. Each link grows from the previous one. At the end stands the insight: Artificial intelligence is an eigenvalue problem.

If you have read the quantum post: There you saw the propagator \(K(B,A,t) = \sum_n \phi_n(B)\,\phi_n^*(A)\,e^{-i E_n t/\hbar}\) – a sum over eigenstates. The energies \(E_n\) are eigenvalues. Here you will discover that the kernel in AI has exactly the same structure. Same letter \(K\). Same pattern. More in the epilogue.

Chapter 2

The Shortest Path

A shadow on the ground

Imagine standing in the midday sun. Your shadow falls straight onto the ground – a flat image of your three-dimensional self. That is a projection: the best possible representation of you in a lower dimension.

What does “best possible” mean? The shortest distance from a point to a surface is always perpendicular. That is the foundation of everything that follows.

The dot product: two vectors, one number

To express “perpendicular” mathematically, we need the dot product: \(\langle \mathbf{a}, \mathbf{b} \rangle = a_1 b_1 + a_2 b_2 + \ldots + a_n b_n\). Two vectors are perpendicular when their dot product is zero.

Projection onto a line

Given a vector \(\mathbf{y}\) and a direction \(\mathbf{a}\), the closest point on the line through \(\mathbf{a}\) is:

$$\hat{y} = \frac{\langle \mathbf{a}, \mathbf{y} \rangle}{\langle \mathbf{a}, \mathbf{a} \rangle} \cdot \mathbf{a}$$

The error \(\mathbf{y} - \hat{y}\) is perpendicular to \(\mathbf{a}\). That is the whole idea.

Line through a y ŷ r = y − ŷ

The general case: the normal equation

In practice, you project onto a subspace spanned by the columns of a matrix \(X\). The condition “error perpendicular to all columns” becomes the normal equation:

$$X^T X\,\mathbf{c} = X^T\mathbf{y} \quad\Longrightarrow\quad \mathbf{c} = (X^TX)^{-1}\,X^T\mathbf{y}$$

Concrete example: \(X = (1,\; 0.5)^T\), \(\mathbf{y} = (3,\; 4)^T\). Then \(X^TX = 1.25\), \(X^T\mathbf{y} = 5\), so \(c = 4\). The projection is \(\hat{\mathbf{y}} = (4,\; 2)^T\). The error \((-1,\; 2)^T\) is perpendicular to \(X\): \(1 \cdot (-1) + 0.5 \cdot 2 = 0\). ✓

Beautiful formula. But at its center sits \((X^TX)^{-1}\) – a matrix inversion. It costs \(O(n^3)\) operations, is numerically unstable, and – as we will see – unnecessary.

Chapter 3

Patience over Genius

What if we never had to compute the inverse? What if all we need to do is correct our own error, again and again?

Project. Error. Repeat.

Step 1: Compute a rough first approximation: \(\hat{\mathbf{y}}_1 = X \cdot X^T \mathbf{y}\).

Step 2: Compute the error (the residual): \(\mathbf{r}_1 = \mathbf{y} - \hat{\mathbf{y}}_1\).

Step 3: Project the error itself and add it: \(\hat{\mathbf{y}}_2 = \hat{\mathbf{y}}_1 + X \cdot X^T \mathbf{r}_1\).

And again. And again. Each iteration brings us closer to the exact solution.

The pattern in the residuals

After \(n\) steps, the residual has the form:

$$\mathbf{r}_n = (I - X \cdot X^T)^n \cdot \mathbf{y}$$

If the matrix \((I - XX^T)\) “shrinks” – all its eigenvalues have magnitude less than 1 – then \(\mathbf{r}_n \to \mathbf{0}\). The error vanishes. (This assumes the data is suitably scaled. In practice, one introduces a learning rate \(\eta\) to ensure this – we keep it simple here.)

The mathematical structure behind this convergence is the Neumann series:

$$\sum_{k=0}^{\infty} A^k = (I - A)^{-1} \quad\text{when all eigenvalues of } A \text{ have magnitude less than 1}$$

This is the matrix version of the geometric series \(\sum q^k = 1/(1-q)\). And it means: our iteration converges to \((X^TX)^{-1}X^T\mathbf{y}\) – exactly the closed-form solution from Chapter 2.

Concrete example: numbers instead of symbols

For \(X = (1,\; 0.5)^T\) and \(\mathbf{y} = (3,\; 4)^T\) with exact solution \(\hat{\mathbf{y}}^* = (4,\; 2)^T\):

n \(\hat{\mathbf{y}}_n\) \(\mathbf{r}_n\) \(\|\mathbf{r}_n\|\)
0(0, 0)(3, 4)5.00
1(3.13, 1.56)(−0.13, 2.44)2.44
2(3.47, 1.73)(−0.47, 2.27)2.32
5(3.84, 1.92)(−0.84, 2.08)2.25
20(3.99, 2.00)(−0.99, 2.00)2.24
(4, 2)(−1, 2)2.24

The approximation moves closer to (4, 2) with each iteration. And the final error \((-1, 2)\) is perpendicular to \(X\): \(1 \cdot (-1) + 0.5 \cdot 2 = 0\). Exactly as it should be.

Try it yourself

Click “Next Step” and watch how the blue approximation approaches the green exact solution. The red arrow (residual) gets shorter with each step.

Wait. Stop. Read that last sentence again. The iteration does not converge to some approximation. It converges to the exact solution – the same one you get from matrix inversion. Without ever inverting a matrix. Just by patiently correcting your own error.

The elegant proof

At equilibrium, the solution no longer changes: \(\hat{\mathbf{y}}' = X^T\mathbf{y} + (I - X^TX)\,\hat{\mathbf{y}}'\). Rearranging: \(X^TX\,\hat{\mathbf{y}}' = X^T\mathbf{y}\), hence \(\hat{\mathbf{y}}' = (X^TX)^{-1}X^T\mathbf{y}\). The normal equation. Q.E.D.

What you now hold: Iteration replaces inversion. Patience replaces genius. But a question remains: why does it converge? What determines the speed?

Chapter 4

What Matrices Really Do

A matrix is a machine: vector in, vector out. Most vectors get both stretched and rotated. But there are specialists – vectors whose direction does not change:

$$A\mathbf{v} = \lambda\,\mathbf{v}$$

Such a \(\mathbf{v}\) is called an eigenvector, the \(\lambda\) its eigenvalue. The matrix merely scales the eigenvector by \(\lambda\) – no rotation, no tilting.

A worked example

Take \(A = \bigl(\begin{smallmatrix} 2 & 1 \\ 0 & 3 \end{smallmatrix}\bigr)\). Characteristic polynomial: \(\det(A - \lambda I) = (2-\lambda)(3-\lambda) = 0\). Eigenvalues: \(\lambda_1 = 2\), \(\lambda_2 = 3\).

Eigenvectors: for \(\lambda_1 = 2\), solve \((A-2I)\mathbf{v} = 0\): \(\mathbf{v}_1 = (1, 0)^T\). For \(\lambda_2 = 3\): \(\mathbf{v}_2 = (1, 1)^T\). Check: \(A \cdot (1,1)^T = (3,3)^T = 3 \cdot (1,1)^T\). ✓

Try it yourself

Drag the white vector. When it points along an eigenvector, \(A\mathbf{v}\) changes only the length – not the direction. Try different matrices to compare stretching, shearing, and rotation.

What does this mean concretely?

Apply the matrix 10 times: \(\lambda_1^{10} = 2^{10} = 1024\), but \(\lambda_2^{10} = 3^{10} = 59\,049\). The larger eigenvalue dominates. After 20 applications: \(2^{20} \approx 10^6\), \(3^{20} \approx 3.5 \times 10^9\). The direction of \(\mathbf{v}_2\) overwhelms everything else.

Why eigenvalues determine convergence

The key formula: \(A^n = V\,\Lambda^n\,V^{-1}\), where \(\Lambda\) is the diagonal matrix of eigenvalues.

• If \(|\lambda| < 1\): \(\lambda^n \to 0\) – the component vanishes.
• If \(|\lambda| = 1\): stays constant.
• If \(|\lambda| > 1\): \(\lambda^n \to \infty\) – explosion.

Try it yourself

Slide \(|\lambda|\) and watch what happens to \(\lambda^n\). Below 1: the curve falls to zero. Above 1: it explodes.

Now the connection to Chapter 3: if \(\mu\) is an eigenvalue of \(X^TX\), then \((1 - \mu)\) is an eigenvalue of \((I - X^TX)\). The iteration converges when \(|1 - \mu| < 1\) for all \(\mu\). The eigenvalues of \(X^TX\) determine everything.

Symmetric matrices (and \(X^TX\) is always symmetric) have only real eigenvalues and orthogonal eigenvectors. Large eigenvalues = directions with lots of variation (signal). Small = little variation (noise).

The DNA of a matrix is its eigenvalues. They tell you whether an iterative process converges, how fast, and where to. And they separate signal from noise.

Chapter 5

Stopping Early Pays Off

Imagine a student who memorizes 20 practice problems. In practice: 20 out of 20. On the exam with new problems: 8 out of 20. They learned the problems, not the principles. This is called overfitting.

What this has to do with our iteration

After \(n\) iterations, the component along the \(i\)-th eigenvector is retained by the factor \(1 - (1-\mu_i)^n\). Sounds abstract – but look at the numbers:

\(\mu_i\) \((1-\mu_i)\) after 3 iter. after 10 after 100
1.0 (strong)0.0100%100%100%
0.50.587.5%99.9%≈100%
0.10.927.1%65.1%≈100%
0.01 (weak)0.993.0%9.6%63.4%

Large eigenvalues (strong signal): immediately at 100%. Small eigenvalues (often noise): need hundreds of iterations. Fewer iterations = only signal is learned. More = noise too. Early stopping is a natural noise filter.

The equivalence to Ridge Regression

The classical remedy for overfitting – Ridge Regression – adds \(\lambda\) to the diagonal: \(\mathbf{c}_\lambda = (X^TX + \lambda I)^{-1}X^T\mathbf{y}\). The filter factor is \(\mu_i/(\mu_i + \lambda)\).

Compare with our iteration factor \(1 - (1-\mu_i)^n\). Both have the same effect: pass large \(\mu_i\), dampen small ones. As a rule of thumb (for moderate \(n\) and small \(\mu_i\)):

$$\lambda \approx \frac{1}{n}$$

More iterations = smaller \(\lambda\) = less regularization. Fewer iterations = larger \(\lambda\) = smoother solution. (The relationship is not exact, but both filter curves have the same shape for practical purposes.)

Try it yourself

Slide \(n\) and observe: the iteration filter (left, cyan) and the Ridge filter (right, yellow) have the same shape!

Wait. Stop. Every textbook introduces Ridge Regression as an artificial penalty term. But in reality, regularization is the most natural thing in the world: stop when it is good enough. The ridge parameter does not fall from the sky. It is the consequence of a simple decision: how many times do I correct my error?

The formula: \(\lambda \approx 1/n\). Early stopping = Ridge Regression. Both separate signal from noise, controlled by the eigenvalues of \(X^TX\).

Chapter 6

The Kernel Trick

So far everything was linear. But the world is not linear. Do we need a new theory? No. We need a new way of looking at the data.

Expanding features

If data is not linear in \(x\), make it linear in \((1, x, x^2)\). Extend the design matrix through a feature map \(\phi(x)\). In the new space the problem is linear – our entire machinery works again.

The key observation

In the iteration, \(X^TX\) appears everywhere – the matrix of pairwise dot products. We never need the features themselves – only their inner products.

A kernel function \(k(x_i, x_j) = \langle \phi(x_i), \phi(x_j) \rangle\) computes this inner product directly. The Gaussian kernel \(k(x,z) = \exp(-\|x-z\|^2/(2\sigma^2))\) has an infinite-dimensional feature space. But the kernel matrix \(K\) stays finite: \(n \times n\).

The transition: nothing changes

Replace \(X^TX\) with \(K\). The iteration becomes \(\hat{\mathbf{y}}_{n+1} = \hat{\mathbf{y}}_n + K(\mathbf{y} - \hat{\mathbf{y}}_n)\). Same algorithm. Different language. Infinite expressive power.

Try it yourself

Move the sliders for \(\lambda\) (regularization) and \(\sigma\) (kernel width). Small \(\lambda\) fits the data tightly. Large \(\lambda\) makes it too smooth. Find the sweet spot.

Prediction for a new point: \(f(x_*) = \sum_{i=1}^n \alpha_i\,k(x_i, x_*)\) – each training point “votes”, weighted by similarity.

Let that sink in: there are no abstract parameters, no weight matrices, no hidden layers. The training data is the model. The function \(f\) is completely determined by the data points and their kernel similarities. This is the Representer Theorem – and it holds for any loss function, not just the squared error.

The core of the kernel trick: \(X^TX \to K\). Nothing else changes. But the feature space becomes infinite-dimensional – and the expressive power universal.

Chapter 7

The Grand Unification

Light bouncing off walls

Imagine a white room with one deep-red wall. A lamp on the ceiling. What do you see? Not just red and white surfaces – but a soft reddish glow on the white floor near the red wall. Color bleeding: light bounces from the red wall to the floor, carrying color with it.

In computer graphics, the radiosity equation describes this: \(\mathbf{B} = \mathbf{E} + \rho F \mathbf{B}\). The solution?

$$\mathbf{B} = (I - \rho F)^{-1}\,\mathbf{E} = \sum_{k=0}^{\infty} (\rho F)^k\,\mathbf{E}$$

The same Neumann series as Chapter 3! Only the meaning changed:

• \(k=0\): Direct light – the lamp shines.
• \(k=1\): First bounce – light reflects once.
• \(k=2\): Second bounce – reflected light reflects again.
• Typically 3–5 bounces suffice before 99% of energy is absorbed.

The matrix \(\rho F\) is substochastic (rows sum to less than 1). Add an “absorber dimension” and it becomes stochastic (rows sum to 1). A stochastic matrix is a Markov chain. Its equilibrium is the dominant eigenvector – eigenvalue 1.

The surfer who never stops

Where in real life is there a massive eigenvalue problem for a stochastic matrix? Every time you use Google.

A B C D

A random surfer starts on a webpage, clicks a random outgoing link at each step. The Google matrix adds a damping factor: with \(d \approx 0.85\) the surfer follows a link; with \(1-d \approx 0.15\) they teleport to a random page:

$$G = d \cdot M + \frac{1-d}{n}\,\mathbf{1}\mathbf{1}^T$$

Power iteration \(\mathbf{r}^{(k+1)} = G\,\mathbf{r}^{(k)}\) converges to the dominant eigenvector – the PageRank. For our mini-web: B ranks highest (34%), followed by A (28%), D (24%), C (14%). Google computes this for billions of pages in ~50 iterations.

Wait. Stop. Regression (Chapter 3), radiosity, PageRank – three different problems, three times the same algorithm. Iterative matrix multiplication, controlled by eigenvalues. Three times the Neumann series. Three times convergence to the dominant eigenvector.

As powerful as a brain

The Representer Theorem (Kimeldorf & Wahba, 1970): the optimal KRR solution always has the form \(f^*(x) = \sum_{i=1}^n \alpha_i\,k(x_i, x)\).

The Universal Approximation Theorem (Cybenko, 1989): a neural network with one hidden layer can approximate any continuous function arbitrarily well.

The Gaussian kernel is universal (Micchelli et al., 2006): KRR with Gaussian kernel can also approximate any continuous function.

Both function classes are dense in \(C(K)\). They are equally powerful. And the Neural Tangent Kernel (Jacot et al., 2018) shows: in the idealized limit of infinite width and suitable initialization, a neural network literally becomes KRR. The connection is not a metaphor – it is a theorem (albeit a limiting-case theorem).

Of course, modern AI – Transformers, attention, stochastic gradient descent in non-convex landscapes – is more than an eigenvalue problem. But the eigenvalue structure is the mathematical core running through every layer. It does not explain everything – but it explains why the foundation holds.

Regression:  \(\hat{\mathbf{y}} = (X^TX + \lambda I)^{-1}X^T\mathbf{y}\) — eigenvalues control convergence
Radiosity:  \(\mathbf{B} = (I - \rho F)^{-1}\mathbf{E}\) — eigenvalues control light distribution
PageRank:  \(\mathbf{r} = \text{dom. eigenvector of } G\) — eigenvalue 1 determines ranking
KRR ≡ NN:  both dense in \(C(K)\) — NTK eigenvalues control learning

Epilogue

K = K: The Glass Bead Game

At the beginning of this post stood a formula from quantum mechanics. At the end stands a formula from machine learning. Place them side by side:

Propagator (Quantum Mechanics):
\(\displaystyle K(B,A,t) = \sum_n \psi_n(B)\;\psi_n^*(A)\;e^{-i\,E_n\,t/\hbar}\)

Kernel (Machine Learning – Mercer decomposition):
\(\displaystyle K(x,x') = \sum_n \lambda_n\;\psi_n(x)\;\psi_n(x')\)

Same letter \(K\). Same pattern: a sum over eigenfunctions \(\psi_n\), weighted by eigenvalues. The propagator describes how a quantum particle travels from A to B. The kernel describes how similar two data points are. Different stages, same mathematics. (The \(\psi_n\) here are the Mercer eigenfunctions of the kernel – related to, but distinct from, the feature map \(\phi\) in Chapter 6.)

In the quantum post, you saw how standing waves lead to discrete energy levels – the tones a quantum system likes to play. Here you saw how the eigenvalues of a kernel matrix determine what an algorithm learns and what it ignores. Same principle. Different stage.

Erwin Schrödinger titled his 1926 paper: “Quantization as an Eigenvalue Problem.”

Perhaps we should add: Intelligence as an eigenvalue problem.

Nature is richer than any single description we can give of it. But sometimes different descriptions share the same mathematical core. Eigenvalues are that core.