Blog Post · Mathematics · AI

Eigenvalues & AI

Why artificial intelligence is an eigenvalue problem – and what that has to do with quantum mechanics. A mathematical chain in seven chapters, with interactive visualizations.

KI-Mathias· · ~50 min read

Prefer watching over reading?

8 minutes · From stock prices to a language model in the browser

Watch on YouTube

Chapter 1

The Glass Bead Game

Every time you run a Google search, a computer solves an eigenvalue problem. It computes which webpage is the most important – and the answer is an eigenvector.

Every time a neural network recognizes an image or writes a text, eigenvalues control what the network learns and what it ignores.

And in quantum mechanics? The allowed energies of an atom – the discrete spectral lines that Niels Bohr could not explain in 1913 – are also eigenvalues. Schrödinger titled his famous 1926 paper simply: “Quantization as an Eigenvalue Problem.”

Three different worlds. The same mathematics. That is no coincidence – and in this post, you will see why.

We build a chain: from the simplest linear regression through kernel methods and Google PageRank to neural networks. Each link grows from the previous one. At the end stands the insight: Artificial intelligence is an eigenvalue problem.

If you have read the quantum post: There you saw the propagator \(K(B,A,t) = \sum_n \phi_n(B)\,\phi_n^*(A)\,e^{-i E_n t/\hbar}\) – a sum over eigenstates. The energies \(E_n\) are eigenvalues. Here you will discover that the kernel in AI has exactly the same structure. Same letter \(K\). Same pattern. More in the epilogue.

Chapter 2

The Shortest Path

A shadow on the ground

Imagine standing in the midday sun. Your shadow falls straight onto the ground – a flat image of your three-dimensional self. That is a projection: the best possible representation of you in a lower dimension.

What does “best possible” mean? The shortest distance from a point to a surface is always perpendicular. That is the foundation of everything that follows.

The dot product: two vectors, one number

To express “perpendicular” mathematically, we need the dot product: \(\langle \mathbf{a}, \mathbf{b} \rangle = a_1 b_1 + a_2 b_2 + \ldots + a_n b_n\). Two vectors are perpendicular when their dot product is zero.

Projection onto a line

Given a vector \(\mathbf{y}\) and a direction \(\mathbf{a}\), the closest point on the line through \(\mathbf{a}\) is:

$$\hat{y} = \frac{\langle \mathbf{a}, \mathbf{y} \rangle}{\langle \mathbf{a}, \mathbf{a} \rangle} \cdot \mathbf{a}$$

The error \(\mathbf{y} - \hat{y}\) is perpendicular to \(\mathbf{a}\). That is the whole idea.

Line through a y ŷ r = y − ŷ

The general case: the normal equation

In practice, you project onto a subspace spanned by the columns of a matrix \(X\). The condition “error perpendicular to all columns” becomes the normal equation:

$$X^T X\,\mathbf{c} = X^T\mathbf{y} \quad\Longrightarrow\quad \mathbf{c} = (X^TX)^{-1}\,X^T\mathbf{y}$$

Concrete example: \(X = (1,\; 0.5)^T\), \(\mathbf{y} = (3,\; 4)^T\). Then \(X^TX = 1.25\), \(X^T\mathbf{y} = 5\), so \(c = 4\). The projection is \(\hat{\mathbf{y}} = (4,\; 2)^T\). The error \((-1,\; 2)^T\) is perpendicular to \(X\): \(1 \cdot (-1) + 0.5 \cdot 2 = 0\). ✓

Beautiful formula. But at its center sits \((X^TX)^{-1}\) – a matrix inversion. It costs \(O(n^3)\) operations, is numerically unstable, and – as we will see – unnecessary.

Chapter 3

Patience over Genius

What if we never had to compute the inverse? What if all we need to do is correct our own error, again and again?

Project. Error. Repeat.

Step 1: Compute a rough first approximation: \(\hat{\mathbf{y}}_1 = X \cdot X^T \mathbf{y}\).

Step 2: Compute the error (the residual): \(\mathbf{r}_1 = \mathbf{y} - \hat{\mathbf{y}}_1\).

Step 3: Project the error itself and add it: \(\hat{\mathbf{y}}_2 = \hat{\mathbf{y}}_1 + X \cdot X^T \mathbf{r}_1\).

And again. And again. Each iteration brings us closer to the exact solution.

The pattern in the residuals

After \(n\) steps, the residual has the form:

$$\mathbf{r}_n = (I - X \cdot X^T)^n \cdot \mathbf{y}$$

If the matrix \((I - XX^T)\) “shrinks” – all its eigenvalues have magnitude less than 1 – then \(\mathbf{r}_n \to \mathbf{0}\). The error vanishes. (This assumes the data is suitably scaled. In practice, one introduces a learning rate \(\eta\) to ensure this – we keep it simple here.)

The mathematical structure behind this convergence is the Neumann series:

$$\sum_{k=0}^{\infty} A^k = (I - A)^{-1} \quad\text{when all eigenvalues of } A \text{ have magnitude less than 1}$$

This is the matrix version of the geometric series \(\sum q^k = 1/(1-q)\). And it means: our iteration converges to \((X^TX)^{-1}X^T\mathbf{y}\) – exactly the closed-form solution from Chapter 2.

Concrete example: numbers instead of symbols

For \(X = (1,\; 0.5)^T\) and \(\mathbf{y} = (3,\; 4)^T\) with exact solution \(\hat{\mathbf{y}}^* = (4,\; 2)^T\):

n \(\hat{\mathbf{y}}_n\) \(\mathbf{r}_n\) \(\|\mathbf{r}_n\|\)
0(0, 0)(3, 4)5.00
1(3.13, 1.56)(−0.13, 2.44)2.44
2(3.47, 1.73)(−0.47, 2.27)2.32
5(3.84, 1.92)(−0.84, 2.08)2.25
20(3.99, 2.00)(−0.99, 2.00)2.24
(4, 2)(−1, 2)2.24

The approximation moves closer to (4, 2) with each iteration. And the final error \((-1, 2)\) is perpendicular to \(X\): \(1 \cdot (-1) + 0.5 \cdot 2 = 0\). Exactly as it should be.

Try it yourself

Click “Next Step” and watch how the blue approximation approaches the green exact solution. The red arrow (residual) gets shorter with each step.

Wait. Stop. Read that last sentence again. The iteration does not converge to some approximation. It converges to the exact solution – the same one you get from matrix inversion. Without ever inverting a matrix. Just by patiently correcting your own error.

The elegant proof

At equilibrium, the solution no longer changes: \(\hat{\mathbf{y}}' = X^T\mathbf{y} + (I - X^TX)\,\hat{\mathbf{y}}'\). Rearranging: \(X^TX\,\hat{\mathbf{y}}' = X^T\mathbf{y}\), hence \(\hat{\mathbf{y}}' = (X^TX)^{-1}X^T\mathbf{y}\). The normal equation. Q.E.D.

What you now hold: Iteration replaces inversion. Patience replaces genius. But a question remains: why does it converge? What determines the speed?

Chapter 4

What Matrices Really Do

A matrix is a machine: vector in, vector out. Most vectors get both stretched and rotated. But there are specialists – vectors whose direction does not change:

$$A\mathbf{v} = \lambda\,\mathbf{v}$$

Such a \(\mathbf{v}\) is called an eigenvector, the \(\lambda\) its eigenvalue. The matrix merely scales the eigenvector by \(\lambda\) – no rotation, no tilting.

A worked example

Take \(A = \bigl(\begin{smallmatrix} 2 & 1 \\ 0 & 3 \end{smallmatrix}\bigr)\). Characteristic polynomial: \(\det(A - \lambda I) = (2-\lambda)(3-\lambda) = 0\). Eigenvalues: \(\lambda_1 = 2\), \(\lambda_2 = 3\).

Eigenvectors: for \(\lambda_1 = 2\), solve \((A-2I)\mathbf{v} = 0\): \(\mathbf{v}_1 = (1, 0)^T\). For \(\lambda_2 = 3\): \(\mathbf{v}_2 = (1, 1)^T\). Check: \(A \cdot (1,1)^T = (3,3)^T = 3 \cdot (1,1)^T\). ✓

Try it yourself

Drag the white vector. When it points along an eigenvector, \(A\mathbf{v}\) changes only the length – not the direction. Try different matrices to compare stretching, shearing, and rotation.

What does this mean concretely?

Apply the matrix 10 times: \(\lambda_1^{10} = 2^{10} = 1024\), but \(\lambda_2^{10} = 3^{10} = 59\,049\). The larger eigenvalue dominates. After 20 applications: \(2^{20} \approx 10^6\), \(3^{20} \approx 3.5 \times 10^9\). The direction of \(\mathbf{v}_2\) overwhelms everything else.

Why eigenvalues determine convergence

The key formula: \(A^n = V\,\Lambda^n\,V^{-1}\), where \(\Lambda\) is the diagonal matrix of eigenvalues.

• If \(|\lambda| < 1\): \(\lambda^n \to 0\) – the component vanishes.
• If \(|\lambda| = 1\): stays constant.
• If \(|\lambda| > 1\): \(\lambda^n \to \infty\) – explosion.

Try it yourself

Slide \(|\lambda|\) and watch what happens to \(\lambda^n\). Below 1: the curve falls to zero. Above 1: it explodes.

Now the connection to Chapter 3: if \(\mu\) is an eigenvalue of \(X^TX\), then \((1 - \mu)\) is an eigenvalue of \((I - X^TX)\). The iteration converges when \(|1 - \mu| < 1\) for all \(\mu\). The eigenvalues of \(X^TX\) determine everything.

Symmetric matrices (and \(X^TX\) is always symmetric) have only real eigenvalues and orthogonal eigenvectors. Large eigenvalues = directions with lots of variation (signal). Small = little variation (noise).

The DNA of a matrix is its eigenvalues. They tell you whether an iterative process converges, how fast, and where to. And they separate signal from noise.

Chapter 5

Stopping Early Pays Off

Imagine a student who memorizes 20 practice problems. In practice: 20 out of 20. On the exam with new problems: 8 out of 20. They learned the problems, not the principles. This is called overfitting.

What this has to do with our iteration

After \(n\) iterations, the component along the \(i\)-th eigenvector is retained by the factor \(1 - (1-\mu_i)^n\). Sounds abstract – but look at the numbers:

\(\mu_i\) \((1-\mu_i)\) after 3 iter. after 10 after 100
1.0 (strong)0.0100%100%100%
0.50.587.5%99.9%≈100%
0.10.927.1%65.1%≈100%
0.01 (weak)0.993.0%9.6%63.4%

Large eigenvalues (strong signal): immediately at 100%. Small eigenvalues (often noise): need hundreds of iterations. Fewer iterations = only signal is learned. More = noise too. Early stopping is a natural noise filter.

The equivalence to Ridge Regression

The classical remedy for overfitting – Ridge Regression – adds \(\lambda\) to the diagonal: \(\mathbf{c}_\lambda = (X^TX + \lambda I)^{-1}X^T\mathbf{y}\). The filter factor is \(\mu_i/(\mu_i + \lambda)\).

Compare with our iteration factor \(1 - (1-\mu_i)^n\). Both have the same effect: pass large \(\mu_i\), dampen small ones. As a rule of thumb (for moderate \(n\) and small \(\mu_i\)):

$$\lambda \approx \frac{1}{n}$$

More iterations = smaller \(\lambda\) = less regularization. Fewer iterations = larger \(\lambda\) = smoother solution. (The relationship is not exact, but both filter curves have the same shape for practical purposes.)

Try it yourself

Slide \(n\) and observe: the iteration filter (left, cyan) and the Ridge filter (right, yellow) have the same shape!

Wait. Stop. Every textbook introduces Ridge Regression as an artificial penalty term. But in reality, regularization is the most natural thing in the world: stop when it is good enough. The ridge parameter does not fall from the sky. It is the consequence of a simple decision: how many times do I correct my error?

The formula: \(\lambda \approx 1/n\). Early stopping = Ridge Regression. Both separate signal from noise, controlled by the eigenvalues of \(X^TX\).

This equivalence can be stated formally. In the language of spectral filtering, Theorem 1 of the Kalle paper (Leonhardt, 2026) reads:

Theorem 1 (Early Stopping ≈ Ridge Regression). The iterative filter \(f_i^{\mathrm{iter}}(n) = 1 - (1-\mu_i)^n\) and the ridge filter \(f_i^{\mathrm{ridge}}(\lambda) = \mu_i/(\mu_i + \lambda)\) produce equivalent spectral filtering at \(\lambda \approx 1/n\). Both separate signal (large \(\mu_i\)) from noise (small \(\mu_i\)) across the eigenvalue spectrum.

Concrete numbers for \(\lambda = 0{,}01\) and \(n = 10\) (Table 1 of the paper):

Eigenvalue \(\mu_i\) Ridge \(\mu_i/(\mu_i+\lambda)\) Iter. \(1-(1-\mu_i)^{10}\) Interpretation
1.00.9991.000Signal: passes through
0.10.9090.651Moderate: partially filtered
0.010.5000.096Noise: heavily suppressed
0.0010.0910.010Noise: nearly eliminated

The two columns are not identical – but their ranking across the spectrum is the same. Large eigenvalues pass through in both; small ones are suppressed in both. Early stopping is therefore not a trick, but implicit regularization – and the ridge parameter is not an auxiliary variable but the closed-form expression of the same spectral filter.

Chapter 6

The Kernel Trick

So far everything was linear. But the world is not linear. Do we need a new theory? No. We need a new way of looking at the data.

Expanding features

If data is not linear in \(x\), make it linear in \((1, x, x^2)\). Extend the design matrix through a feature map \(\phi(x)\). In the new space the problem is linear – our entire machinery works again.

The key observation

In the iteration, \(X^TX\) appears everywhere – the matrix of pairwise dot products. We never need the features themselves – only their inner products.

A kernel function \(k(x_i, x_j) = \langle \phi(x_i), \phi(x_j) \rangle\) computes this inner product directly. The Gaussian kernel \(k(x,z) = \exp(-\|x-z\|^2/(2\sigma^2))\) has an infinite-dimensional feature space. But the kernel matrix \(K\) stays finite: \(n \times n\).

The transition: nothing changes

Replace \(X^TX\) with \(K\). The iteration becomes \(\hat{\mathbf{y}}_{n+1} = \hat{\mathbf{y}}_n + K(\mathbf{y} - \hat{\mathbf{y}}_n)\). Same algorithm. Different language. Infinite expressive power.

Try it yourself

Move the sliders for \(\lambda\) (regularization) and \(\sigma\) (kernel width). Small \(\lambda\) fits the data tightly. Large \(\lambda\) makes it too smooth. Find the sweet spot.

Prediction for a new point: \(f(x_*) = \sum_{i=1}^n \alpha_i\,k(x_i, x_*)\) – each training point “votes”, weighted by similarity.

Let that sink in: there are no abstract parameters, no weight matrices, no hidden layers. The training data is the model. The function \(f\) is completely determined by the data points and their kernel similarities. This is the Representer Theorem – and it holds for any loss function, not just the squared error.

The core of the kernel trick: \(X^TX \to K\). Nothing else changes. But the feature space becomes infinite-dimensional – and the expressive power universal.

Chapter 7

The Grand Unification

Light bouncing off walls

Imagine a white room with one deep-red wall. A lamp on the ceiling. What do you see? Not just red and white surfaces – but a soft reddish glow on the white floor near the red wall. Color bleeding: light bounces from the red wall to the floor, carrying color with it.

In computer graphics, the radiosity equation describes this: \(\mathbf{B} = \mathbf{E} + \rho F \mathbf{B}\). The solution?

$$\mathbf{B} = (I - \rho F)^{-1}\,\mathbf{E} = \sum_{k=0}^{\infty} (\rho F)^k\,\mathbf{E}$$

The same Neumann series as Chapter 3! Only the meaning changed:

• \(k=0\): Direct light – the lamp shines.
• \(k=1\): First bounce – light reflects once.
• \(k=2\): Second bounce – reflected light reflects again.
• Typically 3–5 bounces suffice before 99% of energy is absorbed.

The matrix \(\rho F\) is substochastic (rows sum to less than 1). Add an “absorber dimension” and it becomes stochastic (rows sum to 1). A stochastic matrix is a Markov chain. Its equilibrium is the dominant eigenvector – eigenvalue 1.

The surfer who never stops

Where in real life is there a massive eigenvalue problem for a stochastic matrix? Every time you use Google.

A B C D

A random surfer starts on a webpage, clicks a random outgoing link at each step. The Google matrix adds a damping factor: with \(d \approx 0.85\) the surfer follows a link; with \(1-d \approx 0.15\) they teleport to a random page:

$$G = d \cdot M + \frac{1-d}{n}\,\mathbf{1}\mathbf{1}^T$$

Power iteration \(\mathbf{r}^{(k+1)} = G\,\mathbf{r}^{(k)}\) converges to the dominant eigenvector – the PageRank. For our mini-web: B ranks highest (34%), followed by A (28%), D (24%), C (14%). Google computes this for billions of pages in ~50 iterations.

Why making the matrix stochastic is not just elegant – it is a performance trick

There is a subtle algorithmic payoff hidden in the absorber construction that is worth naming explicitly. Compare the two ways to solve the same equation:

Naive approach (Neumann summation): compute the series $\mathbf{x} = \mathbf{b} + T\mathbf{b} + T^2\mathbf{b} + T^3\mathbf{b} + \ldots$ term by term. Each step requires one matrix-vector multiplication and one vector addition, plus an accumulator to hold the running sum. You cannot release the previous terms – you need the whole partial sum until you stop.

Stochastic approach (power iteration): once $T$ has been turned into a stochastic matrix $P$ via the absorber construction, the iteration collapses to $\mathbf{x}_{k+1} = P\mathbf{x}_k$. No accumulator. No additions between steps. Just one matrix-vector multiplication, then overwrite the vector, repeat. Memory stays at $O(n)$ throughout.

Why does this matter in practice?

GPU-friendliness: matrix-vector products are the single operation GPUs are optimized for. Power iteration is essentially one BLAS gemv call per step. Neumann summation interleaves gemv with axpy (vector addition) and requires the accumulator to stay resident, increasing memory bandwidth pressure.
No intermediate storage: in Neumann, the partial sum $\sum_{k=0}^{N-1} T^k\mathbf{b}$ must be kept separate from the next term $T^N\mathbf{b}$. In power iteration, only the current vector exists. On a billion-page web graph, this matters.
Convergence via spectral gap: power iteration's convergence rate is directly $|\lambda_2(P)|^k$. Google's damping factor $d$ is engineered exactly to guarantee a usable gap: $|\lambda_2(G)| \leq d = 0.85$. It is not a detail – it is the numerical stabilizer.

This insight carries over to Kernel Ridge Regression: the ridge parameter \(\lambda\) plays the role of the damping factor \(d\). Both stabilize the iteration by creating a spectral gap. The algorithmic consequence – that you can solve \((K + \lambda I)\boldsymbol{\alpha} = \mathbf{y}\) via GPU-friendly matrix-vector iteration instead of an \(O(D^3)\) direct solve – is the subject of Kalle v2, where we formally derive the absorber construction for KRR and benchmark direct solve against Block Preconditioned Conjugate Gradient (Block-PCG).

The surprising finding from the benchmarks. On Kalle's real training matrix (\(D = 6144\), \(V = 2977\)), Table 4 of the paper reports:

Solver Time Iterations Rel. err vs. direct
Direct (NumPy/LAPACK, Float64)2097 ms1reference
Block-PCG (NumPy, Float64)6731 ms14\(4.3 \times 10^{-7}\)
Block-PCG (PyTorch CPU, Float32)1704 ms11\(1.0 \times 10^{-5}\)
Block-PCG (PyTorch MPS / Apple GPU)1622 ms11\(1.0 \times 10^{-5}\)

The intuition is that Krylov methods only pay off at large \(D\) – that Gaussian elimination stays faster until several thousand dimensions. The reality: with a modern BLAS back-end, Block-PCG is already about 4× faster than direct solve at \(D = 6144\). The extra win from Apple's GPU is only 5 % – at this matrix size the host-to-device transfer latency nearly eats up the compute saving. The GPU becomes dominant only from \(D \geq 20{,}000\) onwards, where direct solve is no longer practical anyway.

The lesson: the lever for scaling KRR is not primarily hardware, it is the algorithmic reformulation – the absorber construction that re-shapes the system into a form modern matrix libraries can solve efficiently. This is the same intellectual move Google made 25 years ago with PageRank: not bigger servers, but re-framing the problem as an eigenvalue problem of a stochastic matrix.

Wait. Stop. Regression (Chapter 3), radiosity, PageRank – three different problems, three times the same algorithm. Iterative matrix multiplication, controlled by eigenvalues. Three times the Neumann series. Three times convergence to the dominant eigenvector.

As powerful as a brain

The Representer Theorem (Kimeldorf & Wahba, 1970): the optimal KRR solution always has the form \(f^*(x) = \sum_{i=1}^n \alpha_i\,k(x_i, x)\).

The Universal Approximation Theorem (Cybenko, 1989): a neural network with one hidden layer can approximate any continuous function arbitrarily well.

The Gaussian kernel is universal (Micchelli et al., 2006): KRR with Gaussian kernel can also approximate any continuous function.

Both function classes are dense in \(C(K)\). They are equally powerful. And the Neural Tangent Kernel (Jacot et al., 2018) shows: in the idealized limit of infinite width and suitable initialization, a neural network literally becomes KRR. The connection is not a metaphor – it is a theorem (albeit a limiting-case theorem).

Of course, modern AI – Transformers, attention, stochastic gradient descent in non-convex landscapes – is more than an eigenvalue problem. But the eigenvalue structure is the mathematical core running through every layer. It does not explain everything – but it explains why the foundation holds.

Regression:  \(\hat{\mathbf{y}} = (X^TX + \lambda I)^{-1}X^T\mathbf{y}\) — eigenvalues control convergence
Radiosity:  \(\mathbf{B} = (I - \rho F)^{-1}\mathbf{E}\) — eigenvalues control light distribution
PageRank:  \(\mathbf{r} = \text{dom. eigenvector of } G\) — eigenvalue 1 determines ranking
KRR ≡ NN:  both dense in \(C(K)\) — NTK eigenvalues control learning

Try it yourself: KRR Chat

We built a didactic proof-of-concept language model to demonstrate this principle at a scale where the mathematics remains visible – not to compete with production language models. It uses Kernel Ridge Regression with Random Fourier Features. No neural network. Just eigenvalues and matrix multiplication.

λ Try KRR Chat →Source on GitHub

Under the Hood: How the KRR Chatbot Works

The KRR Chat is not a simulation – it is a real, working language model, deliberately kept minimal to make the mathematical structure visible rather than hiding it in millions of parameters. It unifies the entire chain of this post in a single application: word hashing (\(\phi(x)\)), Random Fourier Features (kernel approximation), ridge regression (regularization) – all in three JavaScript functions.

The model’s output is color-coded: green for words that appear exactly in the training data (memorization), orange for words the model combines in new ways (generalization). Typically 60–80% are green – honestly showing: a KRR model with 505 words primarily memorizes, but generalizes at the seams between learned sentences.

Deep dive: The complete source code, the Float64 lesson, and what the model can (and cannot) do

💬 KRR Chat: Under the Hood →

Epilogue

K = K: The Glass Bead Game

At the beginning of this post stood a formula from quantum mechanics. At the end stands a formula from machine learning. Place them side by side:

Propagator (Quantum Mechanics):
\(\displaystyle K(B,A,t) = \sum_n \psi_n(B)\;\psi_n^*(A)\;e^{-i\,E_n\,t/\hbar}\)

Kernel (Machine Learning – Mercer decomposition):
\(\displaystyle K(x,x') = \sum_n \lambda_n\;\psi_n(x)\;\psi_n(x')\)

Same letter \(K\). Same pattern: a sum over eigenfunctions \(\psi_n\), weighted by eigenvalues. The propagator describes how a quantum particle travels from A to B. The kernel describes how similar two data points are. Different stages, same mathematics. (The \(\psi_n\) here are the Mercer eigenfunctions of the kernel – related to, but distinct from, the feature map \(\phi\) in Chapter 6.)

In the quantum post, you saw how standing waves lead to discrete energy levels – the tones a quantum system likes to play. Here you saw how the eigenvalues of a kernel matrix determine what an algorithm learns and what it ignores. Same principle. Different stage.

Erwin Schrödinger titled his 1926 paper: “Quantization as an Eigenvalue Problem.”

Perhaps we should add: Intelligence as an eigenvalue problem.

Nature is richer than any single description we can give of it. But sometimes different descriptions share the same mathematical core. Eigenvalues are that core.

Frequently Asked Questions

What are eigenvalues simply explained?

An eigenvalue describes how much a particular direction is stretched or compressed under a transformation. The stable directions are the eigenvectors, the factor is the eigenvalue. Eigenvalues appear everywhere: in quantum mechanics, in PageRank, in neural networks.

Why are eigenvalues needed in artificial intelligence?

The kernel matrix in machine learning has eigenvalues. Large eigenvalues correspond to signal, small eigenvalues correspond to noise. Regularization dampens the small eigenvalues – this is the mathematical core of avoiding overfitting.

What does PageRank have to do with eigenvalues?

PageRank computes the importance of web pages as the dominant eigenvector of the internet's link matrix. The page with the largest eigenvalue contribution is the most important.