Kalle: A Transparent Language Model Using Kernel Ridge Regression with Random Fourier Features

Q: What are eigenvalues simply explained?

An eigenvalue describes how much a particular direction is stretched or compressed under a transformation. The stable directions are the eigenvectors, the factor is the eigenvalue. Eigenvalues appear everywhere: in quantum mechanics, in PageRank, in neural networks.

Q: Why are eigenvalues needed in artificial intelligence?

The kernel matrix in machine learning has eigenvalues. Large eigenvalues correspond to signal, small eigenvalues correspond to noise. Regularization dampens the small eigenvalues – this is the mathematical core of avoiding overfitting.

Q: What does PageRank have to do with eigenvalues?

PageRank computes the importance of web pages as the dominant eigenvector of the internet's link matrix. The page with the largest eigenvalue contribution is the most important.

Leonhardt, Mathias

doi:10.5281/zenodo.19610333

Prefer watching over reading?

8 minutes · From stock prices to a language model in the browser

Watch on YouTube

Chapter 1

The Glass Bead Game

A Google search solves an eigenvalue problem in the background: which webpage carries the highest weight in the link graph? The answer is an eigenvector. A neural network learns, via the eigenvalues of its training operator, which features to retain and which to discard. In quantum mechanics, the allowed energies of an atom – the discrete spectral lines that Niels Bohr could not explain in 1913 – are likewise eigenvalues; Schrödinger therefore titled his 1926 paper Quantization as an Eigenvalue Problem.

That the same mathematical structure appears in three entirely different contexts is no accident. The reason is developed in what follows.

The argument runs as a chain: from linear regression through kernel methods and Google PageRank to neural networks. Each link follows from the previous one. The end point is the observation that modern artificial intelligence is, at its core, an eigenvalue problem.

If you have read the quantum post: There you saw the propagator $K(B,A,t) = \sum_n \phi_n(B)\,\phi_n^*(A)\,e^{-i E_n t/\hbar}$ – a sum over eigenstates. The energies $E_n$ are eigenvalues. Here you will discover that the kernel in AI has exactly the same structure. Same letter $K$. Same pattern. More in the epilogue.

Chapter 2

The Shortest Path

A shadow on the ground

A useful image for projection: a person in the midday sun. The shadow falls perpendicular to the ground and is a flat image of the three-dimensional figure – the best possible representation in a lower dimension. What “best possible” means here could be defined in various ways; linear algebra makes it concrete: the shortest distance from a point to a surface is always perpendicular. This property carries everything that follows.

The dot product: two vectors, one number

To express “perpendicular” mathematically, the dot product is used: $\langle \mathbf{a}, \mathbf{b} \rangle = a_1 b_1 + a_2 b_2 + \ldots + a_n b_n$. It captures intuitively the degree to which two vectors point in the same direction. Two vectors are perpendicular precisely when their dot product is zero; positive values indicate similar directions, negative ones opposite directions.

Projection onto a line

Let $\mathbf{y}$ be a vector and $\mathbf{a}$ a direction. The point on the line through $\mathbf{a}$ closest to $\mathbf{y}$ is:

\hat{y} = \frac{\langle \mathbf{a}, \mathbf{y} \rangle}{\langle \mathbf{a}, \mathbf{a} \rangle} \cdot \mathbf{a}

The fraction describes how far along $\mathbf{a}$ one must move so that the residual $\mathbf{y} - \hat{y}$ ends up perpendicular to $\mathbf{a}$. This is the elementary operation of all linear approximation.

The general case: the normal equation

In practice one projects not onto a line but onto a subspace spanned by the columns of a matrix $X$. The condition “error perpendicular to all columns” can be written compactly:

X^T X\,\mathbf{c} = X^T\mathbf{y} \quad\Longrightarrow\quad \mathbf{c} = (X^TX)^{-1}\,X^T\mathbf{y}

This is the so-called normal equation. Concrete example: $X = (1,\; 0.5)^T$, $\mathbf{y} = (3,\; 4)^T$. Then $X^TX = 1.25$, $X^T\mathbf{y} = 5$, and so $c = 4$. The projection is $\hat{\mathbf{y}} = (4,\; 2)^T$; the residual $(-1,\; 2)^T$ is perpendicular to $X$ since $1 \cdot (-1) + 0.5 \cdot 2 = 0$. ✓

The formula is compact, but contains a costly term in its centre: the inversion $(X^TX)^{-1}$. It costs $O(n^3)$ operations and is numerically unstable. In the next chapter it is shown that this inversion can be avoided.

Chapter 3

Patience over Genius

Inverting the matrix $X^TX$ is computationally expensive and numerically unstable; the inversion can be avoided by replacing it with an iterative procedure that relies on nothing other than repeated correction of the residual. In the following it is shown that this procedure yields the same solution as the normal equation, without requiring any matrix inversion.

Project, correct, repeat

Step 1: Compute a rough first approximation: $\hat{\mathbf{y}}_1 = X \cdot X^T \mathbf{y}$.

Step 2: Compute the error (the residual): $\mathbf{r}_1 = \mathbf{y} - \hat{\mathbf{y}}_1$.

Step 3: Project the error itself and add it: $\hat{\mathbf{y}}_2 = \hat{\mathbf{y}}_1 + X \cdot X^T \mathbf{r}_1$.

And again. And again. Each iteration brings us closer to the exact solution.

The pattern in the residuals

After $n$ steps, the residual has the form:

\mathbf{r}_n = (I - X \cdot X^T)^n \cdot \mathbf{y}

If the matrix $(I - XX^T)$ “shrinks” – all its eigenvalues have magnitude less than 1 – then $\mathbf{r}_n \to \mathbf{0}$. The error vanishes. (This assumes the data is suitably scaled. In practice, one introduces a learning rate $\eta$ to ensure this – we keep it simple here.)

The mathematical structure behind this convergence is the Neumann series:

\sum_{k=0}^{\infty} A^k = (I - A)^{-1} \quad\text{when all eigenvalues of } A \text{ have magnitude less than 1}

This is the matrix version of the geometric series $\sum q^k = 1/(1-q)$. And it means: our iteration converges to $(X^TX)^{-1}X^T\mathbf{y}$ – exactly the closed-form solution from Chapter 2.

Concrete example: numbers instead of symbols

For $X = (1,\; 0.5)^T$ and $\mathbf{y} = (3,\; 4)^T$ with exact solution $\hat{\mathbf{y}}^* = (4,\; 2)^T$:

n	$\hat{\mathbf{y}}_n$	$\mathbf{r}_n$	$\\|\mathbf{r}_n\\|$
0	(0, 0)	(3, 4)	5.00
1	(3.13, 1.56)	(−0.13, 2.44)	2.44
2	(3.47, 1.73)	(−0.47, 2.27)	2.32
5	(3.84, 1.92)	(−0.84, 2.08)	2.25
20	(3.99, 2.00)	(−0.99, 2.00)	2.24
∞	(4, 2)	(−1, 2)	2.24

The approximation moves closer to (4, 2) with each iteration. And the final error $(-1, 2)$ is perpendicular to $X$: $1 \cdot (-1) + 0.5 \cdot 2 = 0$. Exactly as it should be.

Try it yourself. Repeated clicking on “Next Step” shows the blue approximation moving toward the green exact solution; the red arrow (residual) shortens with each step.

This means: the iteration converges not to an approximation, but to the exact solution – the same one matrix inversion would yield – without inverting any matrix. Repeated correction of the residual is sufficient.

The elegant proof

At equilibrium, the solution no longer changes: $\hat{\mathbf{y}}' = X^T\mathbf{y} + (I - X^TX)\,\hat{\mathbf{y}}'$. Rearrangement yields $X^TX\,\hat{\mathbf{y}}' = X^T\mathbf{y}$, hence $\hat{\mathbf{y}}' = (X^TX)^{-1}X^T\mathbf{y}$. This is the normal equation. ∎

Interim summary: Iteration takes the place of inversion. What remains open is why the procedure converges and what determines its speed. This is clarified in the next chapter on eigenvalues.

Chapter 4

What Matrices Really Do

A matrix is a linear map: vector in, vector out. Most vectors are both stretched and rotated. There are, however, distinguished vectors whose direction is preserved; they are merely scaled:

A\mathbf{v} = \lambda\,\mathbf{v}

Such a $\mathbf{v}$ is called an eigenvector, the associated $\lambda$ its eigenvalue. The matrix describes intuitively: it stretches the eigenvector by the factor $\lambda$ and leaves its direction untouched.

A worked example

Take $A = \bigl(\begin{smallmatrix} 2 & 1 \\ 0 & 3 \end{smallmatrix}\bigr)$. Characteristic polynomial: $\det(A - \lambda I) = (2-\lambda)(3-\lambda) = 0$. Eigenvalues: $\lambda_1 = 2$, $\lambda_2 = 3$.

Eigenvectors: for $\lambda_1 = 2$, solve $(A-2I)\mathbf{v} = 0$: $\mathbf{v}_1 = (1, 0)^T$. For $\lambda_2 = 3$: $\mathbf{v}_2 = (1, 1)^T$. Check: $A \cdot (1,1)^T = (3,3)^T = 3 \cdot (1,1)^T$. ✓

Try it yourself

Drag the white vector. When it points along an eigenvector, $A\mathbf{v}$ changes only the length – not the direction. Try different matrices to compare stretching, shearing, and rotation.

What does this mean concretely?

If the matrix is applied ten times in succession, $\lambda_1^{10} = 2^{10} = 1024$ versus $\lambda_2^{10} = 3^{10} = 59\,049$. The larger eigenvalue dominates. After 20 applications: $2^{20} \approx 10^6$, $3^{20} \approx 3.5 \times 10^9$. The direction of $\mathbf{v}_2$ then overwhelms everything else.

Why eigenvalues determine convergence

The key formula reads $A^n = V\,\Lambda^n\,V^{-1}$, where $\Lambda$ is the diagonal matrix of eigenvalues. It describes intuitively that the effect of repeated matrix application is fully determined by the powers of the eigenvalues:

Try it yourself

Slide $|\lambda|$ and watch what happens to $\lambda^n$. Below 1: the curve falls to zero. Above 1: it explodes.

The connection to Chapter 3 follows from this: if $\mu$ is an eigenvalue of $X^TX$, then $(1 - \mu)$ is an eigenvalue of $(I - X^TX)$. The iteration converges precisely when $|1 - \mu| < 1$ for all $\mu$. The eigenvalues of $X^TX$ thus determine both whether the iteration converges and how fast.

Symmetric matrices – and $X^TX$ is always symmetric – have only real eigenvalues and orthogonal eigenvectors. Large eigenvalues correspond to directions with high variance (signal); small eigenvalues correspond to directions with low variance (noise).

Interim summary: The eigenvalues of a matrix characterize its behaviour under repeated application – convergence, divergence, and in particular the separation of signal and noise. The latter leads to the next chapter.

Chapter 5

Stopping Early Pays Off

An illustrative image: a student memorizes 20 practice problems – every number, every comma, every formulation. In practice, they reach 20 out of 20 points. In the exam, where new problems are posed, they reach only 8 out of 20. They have learned the concrete problems, not the underlying principles. This phenomenon is called overfitting.

What this has to do with our iteration

After $n$ iterations, the component along the $i$-th eigenvector is retained by the factor $1 - (1-\mu_i)^n$. What this formula describes intuitively becomes visible in the numbers:

$\mu_i$	$(1-\mu_i)$	after 3 iter.	after 10	after 100
1.0 (strong)	0.0	100%	100%	100%
0.5	0.5	87.5%	99.9%	≈100%
0.1	0.9	27.1%	65.1%	≈100%
0.01 (weak)	0.99	3.0%	9.6%	63.4%

Large eigenvalues (strong signal) reach nearly 100 % after only a few iterations. Small eigenvalues (typically noise) require hundreds of iterations. It follows that fewer iterations learn only the signal; more iterations also pick up the noise. Early stopping thus acts as a natural noise filter.

The equivalence to Ridge Regression

The classical remedy for overfitting – Ridge Regression – adds $\lambda$ to the diagonal: $\mathbf{c}_\lambda = (X^TX + \lambda I)^{-1}X^T\mathbf{y}$. The filter factor is $\mu_i/(\mu_i + \lambda)$.

Compare with our iteration factor $1 - (1-\mu_i)^n$. Both have the same effect: pass large $\mu_i$, dampen small ones. As a rule of thumb (for moderate $n$ and small $\mu_i$):

\lambda \approx \frac{1}{n}

More iterations = smaller $\lambda$ = less regularization. Fewer iterations = larger $\lambda$ = smoother solution. (The relationship is not exact, but both filter curves have the same shape for practical purposes.) Incidentally, this very pseudoinverse – Ridge Regression with $\lambda = 0$ – builds the weight matrix of a Hopfield network, the associative memory that every Transformer uses at its core.

Try it yourself. By sliding $n$, one sees that the iteration filter (left, cyan) and the Ridge filter (right, yellow) take the same shape. Early stopping and Ridge Regression act equivalently in filtering.

Textbooks typically introduce Ridge Regression as an artificial penalty term: large coefficients are “penalized”. From the perspective developed here, a different reading emerges: regularization is nothing other than the premature termination of a correction process. The ridge parameter is not an auxiliary variable but the closed-form expression of the same spectral filtering that arises through iteration at finite step count.

Interim summary: $\lambda \approx 1/n$. Early stopping and Ridge Regression are essentially equivalent mechanisms; both separate signal from noise, controlled by the eigenvalues of $X^TX$.

This equivalence can be stated formally. In the language of spectral filtering, Theorem 1 of the Kalle paper (Leonhardt, 2026) reads:

Theorem 1 (Early Stopping ≈ Ridge Regression). The iterative filter $f_i^{\mathrm{iter}}(n) = 1 - (1-\mu_i)^n$ and the ridge filter $f_i^{\mathrm{ridge}}(\lambda) = \mu_i/(\mu_i + \lambda)$ produce equivalent spectral filtering at $\lambda \approx 1/n$. Both separate signal (large $\mu_i$) from noise (small $\mu_i$) across the eigenvalue spectrum.

Concrete numbers for $\lambda = 0{,}01$ and $n = 10$ (Table 1 of the paper):

Eigenvalue $\mu_i$	Ridge $\mu_i/(\mu_i+\lambda)$	Iter. $1-(1-\mu_i)^{10}$	Interpretation
1.0	0.999	1.000	Signal: passes through
0.1	0.909	0.651	Moderate: partially filtered
0.01	0.500	0.096	Noise: heavily suppressed
0.001	0.091	0.010	Noise: nearly eliminated

The two columns are not identical, but their ranking across the spectrum is the same: large eigenvalues pass through in both cases, small ones are strongly suppressed. Early stopping is therefore implicit regularization; the ridge parameter is its closed-form expression.

Chapter 6

The Kernel Trick

So far the construction has been linear: lines, planes, polynomials. Real data, however, often follows non-linear patterns – parabolas, sine curves, chaotic structures. It is shown in what follows that this does not require a new theory, only a new reading of the existing construction.

Expanding features

If the data is not linear in $x$, it can be made linear in a vector of functions $(1, x, x^2, \ldots)$. The design matrix is extended through a feature map $\phi(x)$. In this new space the problem is linear, and the entire machinery developed so far applies unchanged.

The key observation

In the iteration, $X^TX$ appears throughout – the matrix of pairwise dot products. Entry $(i,j)$ is $\langle \phi(x_i), \phi(x_j) \rangle$. The individual features $\phi(x_i)$ are therefore never needed explicitly; only their inner products.

A kernel function $k(x_i, x_j) = \langle \phi(x_i), \phi(x_j) \rangle$ yields this inner product directly, without detour through the features. The Gaussian kernel $k(x,z) = \exp(-\|x-z\|^2/(2\sigma^2))$ has an infinite-dimensional feature space. The kernel matrix $K$ nevertheless remains finite: $n \times n$ for $n$ data points.

The transition

Replacing $X^TX$ with $K$ leaves the algorithm structurally unchanged: the iteration reads $\hat{\mathbf{y}}_{n+1} = \hat{\mathbf{y}}_n + K(\mathbf{y} - \hat{\mathbf{y}}_n)$, the residuals follow $\mathbf{r}_n = (I - K)^n \mathbf{y}$, convergence is determined by the eigenvalues of $K$, and regularization is controlled via $\lambda \approx 1/n$. It is the same algorithm, formulated over a higher-dimensional feature space.

Try it yourself

Move the sliders for $\lambda$ (regularization) and $\sigma$ (kernel width). Small $\lambda$ fits the data tightly. Large $\lambda$ makes it too smooth. Find the sweet spot.

Prediction for a new point: $f(x_*) = \sum_{i=1}^n \alpha_i\,k(x_i, x_*)$ – each training point “votes”, weighted by similarity.

What is notable here: there are no abstract parameters, no weight matrices, no hidden layers. The training data itself is the model. The function $f$ is completely determined by the data points and their kernel similarities. This is the statement of the Representer Theorem; it holds for any loss function, not just the squared error.

Interim summary: $X^TX \to K$. Algorithmically nothing changes, but the feature space becomes infinite-dimensional, and the expressive power correspondingly universal.

Chapter 7

The Grand Unification

Light bouncing off walls

An illustrative example from computer graphics: a white room with one deep-red wall and a ceiling lamp. Visible is not only the direct illumination; in addition, a soft reddish glow appears on the white floor near the red wall. This phenomenon, known as color bleeding, describes intuitively how light bounces from the red wall to the floor and carries colour with it.

In computer graphics, the mutual exchange of light between surfaces is described by the radiosity equation: $\mathbf{B} = \mathbf{E} + \rho F \mathbf{B}$. Here $\mathbf{E}$ is the self-emission of the lamps, $\rho$ the reflectivity of the surfaces, and $F$ the form factor matrix encoding which surfaces “see” one another. The solution reads:

\mathbf{B} = (I - \rho F)^{-1}\,\mathbf{E} = \sum_{k=0}^{\infty} (\rho F)^k\,\mathbf{E}

This is exactly the same Neumann series as in Chapter 3, with only the physical meaning of the individual terms changed:

• $k = 0$: direct light, the lamp shines.
• $k = 1$: first bounce, the light reflects once.
• $k = 2$: second bounce, the reflected light reflects again.
• In typical scenes, after three to five bounces about 99 % of the energy has been absorbed.

The matrix $\rho F$ is substochastic (rows sum to less than 1). Add an “absorber dimension” and it becomes stochastic (rows sum to 1). A stochastic matrix is a Markov chain. Its equilibrium is the dominant eigenvector – eigenvalue 1.

The surfer who never stops

A large eigenvalue problem for a stochastic matrix is solved in every Google search. The model is a random surfer who starts on a webpage and at each step follows one of the outgoing links at random. The associated transition matrix $M$ describes this random walk. Concrete example with 4 pages:

The Google matrix introduces a damping factor: with probability $d \approx 0.85$ the surfer follows a link; with probability $1 - d \approx 0.15$ they teleport to a random page:

G = d \cdot M + \frac{1-d}{n}\,\mathbf{1}\mathbf{1}^T

Power iteration $\mathbf{r}^{(k+1)} = G\,\mathbf{r}^{(k)}$ converges to the dominant eigenvector – the PageRank. For our mini-web: B ranks highest (34%), followed by A (28%), D (24%), C (14%). Google computes this for billions of pages in ~50 iterations.

Why making the matrix stochastic is not just elegant – it is a performance trick

There is a subtle algorithmic payoff hidden in the absorber construction that is worth naming explicitly. Compare the two ways to solve the same equation:

Naive approach (Neumann summation): compute the series $\mathbf{x} = \mathbf{b} + T\mathbf{b} + T^2\mathbf{b} + T^3\mathbf{b} + \ldots$ term by term. Each step requires one matrix-vector multiplication and one vector addition, plus an accumulator to hold the running sum. You cannot release the previous terms – you need the whole partial sum until you stop.

Stochastic approach (power iteration): once $T$ has been turned into a stochastic matrix $P$ via the absorber construction, the iteration collapses to $\mathbf{x}_{k+1} = P\mathbf{x}_k$. No accumulator. No additions between steps. Just one matrix-vector multiplication, then overwrite the vector, repeat. Memory stays at $O(n)$ throughout.

Why does this matter in practice?

• GPU-friendliness: matrix-vector products are the single operation GPUs are optimized for. Power iteration is essentially one BLAS gemv call per step. Neumann summation interleaves gemv with axpy (vector addition) and requires the accumulator to stay resident, increasing memory bandwidth pressure.
• No intermediate storage: in Neumann, the partial sum $\sum_{k=0}^{N-1} T^k\mathbf{b}$ must be kept separate from the next term $T^N\mathbf{b}$. In power iteration, only the current vector exists. On a billion-page web graph, this matters.
• Convergence via spectral gap: power iteration's convergence rate is directly $|\lambda_2(P)|^k$. Google's damping factor $d$ is engineered exactly to guarantee a usable gap: $|\lambda_2(G)| \leq d = 0.85$. It is not a detail – it is the numerical stabilizer.

This insight carries over to Kernel Ridge Regression: the ridge parameter $\lambda$ plays the role of the damping factor $d$. Both stabilize the iteration by creating a spectral gap. The algorithmic consequence – that you can solve $(K + \lambda I)\boldsymbol{\alpha} = \mathbf{y}$ via GPU-friendly matrix-vector iteration instead of an $O(D^3)$ direct solve – is the subject of Kalle v2, where we formally derive the absorber construction for KRR and benchmark direct solve against Block Preconditioned Conjugate Gradient (Block-PCG).

The surprising finding from the benchmarks. On Kalle's real training matrix ($D = 6144$, $V = 2977$), Table 4 of the paper reports:

Solver	Time	Iterations	Rel. err vs. direct
Direct (NumPy/LAPACK, Float64)	2097 ms	1	reference
Block-PCG (NumPy, Float64)	6731 ms	14	$4.3 \times 10^{-7}$
Block-PCG (PyTorch CPU, Float32)	1704 ms	11	$1.0 \times 10^{-5}$
Block-PCG (PyTorch MPS / Apple GPU)	1622 ms	11	$1.0 \times 10^{-5}$

The intuition is that Krylov methods only pay off at large $D$ – that Gaussian elimination stays faster until several thousand dimensions. The reality: with a modern BLAS back-end, Block-PCG is already about 4× faster than direct solve at $D = 6144$. The extra win from Apple's GPU is only 5 % – at this matrix size the host-to-device transfer latency nearly eats up the compute saving. The GPU becomes dominant only from $D \geq 20{,}000$ onwards, where direct solve is no longer practical anyway.

The lesson: the lever for scaling KRR is not primarily hardware, it is the algorithmic reformulation – the absorber construction that re-shapes the system into a form modern matrix libraries can solve efficiently. This is the same intellectual move Google made 25 years ago with PageRank: not bigger servers, but re-framing the problem as an eigenvalue problem of a stochastic matrix.

This shows: regression (Chapter 3), radiosity, and PageRank are, in their mathematical structure, the same problem. Three instances of iterative matrix-vector multiplication, controlled by eigenvalues; three instances of the Neumann series; three instances of convergence to the dominant eigenvector.

As powerful as a brain

One final connection remains: how powerful is Kernel Ridge Regression, really?

The Representer Theorem (Kimeldorf & Wahba, 1970) states that the optimal KRR solution always has the form $f^*(x) = \sum_{i=1}^n \alpha_i\,k(x_i, x)$. A finite number of coefficients, regardless of how high-dimensional the kernel space is.

The Universal Approximation Theorem (Cybenko, 1989): a neural network with one hidden layer can approximate any continuous function arbitrarily well.

The Gaussian kernel is universal (Micchelli et al., 2006): KRR with Gaussian kernel can also approximate any continuous function.

Both function classes are dense in $C(K)$. They are equally powerful. And the Neural Tangent Kernel (Jacot et al., 2018) shows: in the idealized limit of infinite width and suitable initialization, a neural network literally becomes KRR. The connection is not a metaphor – it is a theorem (albeit a limiting-case theorem).

Of course, modern AI – Transformers, attention, stochastic gradient descent in non-convex landscapes – is more than an eigenvalue problem. But the eigenvalue structure is the mathematical core running through every layer. It does not explain everything – but it explains why the foundation holds.

Regression: $\hat{\mathbf{y}} = (X^TX + \lambda I)^{-1}X^T\mathbf{y}$ — eigenvalues control convergence
Radiosity: $\mathbf{B} = (I - \rho F)^{-1}\mathbf{E}$ — eigenvalues control light distribution
PageRank: $\mathbf{r} = \text{dom. eigenvector of } G$ — eigenvalue 1 determines ranking
KRR ≡ NN: both dense in $C(K)$ — NTK eigenvalues control learning

Try it yourself: KRR Chat

We built a didactic proof-of-concept language model to demonstrate this principle at a scale where the mathematics remains visible – not to compete with production language models. It uses Kernel Ridge Regression with Random Fourier Features. No neural network. Just eigenvalues and matrix multiplication.

λ Try KRR Chat → Source on GitHub

Under the Hood: How the KRR Chatbot Works

The KRR Chat is not a simulation – it is a real, working language model, deliberately kept minimal to make the mathematical structure visible rather than hiding it in millions of parameters. It unifies the entire chain of this post in a single application: word hashing ($\phi(x)$), Random Fourier Features (kernel approximation), ridge regression (regularization) – all in three JavaScript functions.

The model’s output is color-coded: green for words that appear exactly in the training data (memorization), orange for words the model combines in new ways (generalization). Typically 60–80% are green – honestly showing: a KRR model with 505 words primarily memorizes, but generalizes at the seams between learned sentences.

Deep dive: The complete source code, the Float64 lesson, and what the model can (and cannot) do

💬 KRR Chat: Under the Hood →

Epilogue

K = K: The Glass Bead Game

At the beginning of this post stood a formula from quantum mechanics. At the end stands a formula from machine learning. Place them side by side:

Propagator (Quantum Mechanics):
$\displaystyle K(B,A,t) = \sum_n \psi_n(B)\;\psi_n^*(A)\;e^{-i\,E_n\,t/\hbar}$

Kernel (Machine Learning – Mercer decomposition):
$\displaystyle K(x,x') = \sum_n \lambda_n\;\psi_n(x)\;\psi_n(x')$

The same letter $K$, the same pattern: a sum over eigenfunctions $\psi_n$, weighted by eigenvalues. The propagator describes how a quantum particle propagates from A to B. The kernel describes intuitively how similar two data points are. Two different application contexts, the same mathematical structure. (In the kernel case, the $\psi_n$ are the Mercer eigenfunctions; they are related to, but distinct from, the feature map $\phi$ of Chapter 6.)

In the quantum post it was shown how standing waves lead to discrete energy levels – the tones a quantum system can produce. In the present post it was shown how the eigenvalues of a kernel matrix decide what an algorithm learns and what it disregards. It is the same principle in a different setting.

Schrödinger titled his 1926 paper Quantization as an Eigenvalue Problem. An analogous formulation for the present field would read: Intelligence as an eigenvalue problem.

Nature is richer than any single description. Multiple descriptions can nevertheless share the same mathematical core; eigenvalues are one such core. How far this principle reaches beyond artificial intelligence – from the vibrating tuning fork to face recognition – is developed in the post The Eigenprinciple.

Frequently Asked Questions

What are eigenvalues simply explained?

An eigenvalue describes how much a particular direction is stretched or compressed under a transformation. The stable directions are the eigenvectors, the factor is the eigenvalue. Eigenvalues appear everywhere: in quantum mechanics, in PageRank, in neural networks.

Why are eigenvalues needed in artificial intelligence?

The kernel matrix in machine learning has eigenvalues. Large eigenvalues correspond to signal, small eigenvalues correspond to noise. Regularization dampens the small eigenvalues – this is the mathematical core of avoiding overfitting.

What does PageRank have to do with eigenvalues?

PageRank computes the importance of web pages as the dominant eigenvector of the internet's link matrix. The page with the largest eigenvalue contribution is the most important.

Eigenvalues & AI

The Glass Bead Game

The Shortest Path

A shadow on the ground

The dot product: two vectors, one number

Projection onto a line

The general case: the normal equation

Patience over Genius

Project, correct, repeat

The pattern in the residuals

Concrete example: numbers instead of symbols

The elegant proof

What Matrices Really Do

A worked example

What does this mean concretely?

Why eigenvalues determine convergence

Stopping Early Pays Off

What this has to do with our iteration

The equivalence to Ridge Regression

The Kernel Trick

Expanding features

The key observation

The transition

The Grand Unification

Light bouncing off walls

The surfer who never stops

Why making the matrix stochastic is not just elegant – it is a performance trick

As powerful as a brain

Under the Hood: How the KRR Chatbot Works

K = K: The Glass Bead Game

Frequently Asked Questions

What are eigenvalues simply explained?

Why are eigenvalues needed in artificial intelligence?

What does PageRank have to do with eigenvalues?

Read next