In the post Eigenvalues & AI, we built a chain: from linear regression through kernel methods to neural networks – all connected through eigenvalues. At the end stood a live demo: the KRR Chat, a language model that works without a neural network.
This post is the technical deep-dive. We examine every component of the KRR Chat: the architecture, the pipeline, the complete source code – and the honest limitations. If you have read the main post, you will find the theory turned into running code here. If not – no problem: this post also works on its own.
Chapter 1
The Dual Model: Retrieval + Generation
KRR Chat consists of two independent systems that work together – similar to a person who first looks something up in a book and then answers in their own words.
System 1 – Retrieval (keyword search):
A corpus of 805 sentences about eigenvalues, kernel methods, quantum mechanics, and machine learning. For each question, the chat extracts keywords and finds the sentences with the most hits. This is search, not learning.
System 2 – Generation (KRR language model):
A Kernel Ridge Regression model trained on 104 curated sentences (505-word vocabulary). It learns: Given the last 5 words, which word comes next? This is genuine prediction – the same task that GPT-4 solves, just with a kernel instead of a neural network.
Why two systems?
A single KRR model on all 805 sentences fails due to the capacity limit: 1820 distinct words on 256 hash buckets means an average of 7 collisions per bucket. The model cannot distinguish “eigenvalue” from “convergence” when both hash to the same bucket.
The solution: division of labor. The retrieval system provides topically relevant context (broad knowledge, 805 sentences). The generation model produces fluent continuations (deep knowledge, 104 sentences, but with only 4 collisions per bucket). In AI research, this pattern is called RAG – Retrieval-Augmented Generation. Large language models like GPT-4 and Claude also use RAG to incorporate up-to-date knowledge.
Chapter 2
The Pipeline: From Question to Answer
When you ask “What are eigenvalues?”, the following happens:
- Keyword extraction: Stop words (what, are, the, is, ...) are removed. What remains:
["eigenvalues"] - Retrieval: All 805 sentences are searched for this keyword. Exact match = 5 points, partial match (e.g. “eigenvalue” in “eigenvalues”) = 2 points. Duplicates are removed. The top 2 sentences are displayed in cyan.
- Invisible seeding: The last 5 words of the last retrieval sentence are taken as the context for the generation model. This is the crucial transition: retrieval provides the starting point, generation carries on.
- KRR generation, word by word: From the 5-word context, the model computes:
- Feature vector \(\mathbf{x}\) via word hashing (128 buckets × 5 positions + 3 bigram hashes = 1024 dimensions)
- Random Fourier Features: \(\mathbf{z} = \sqrt{2/D}\,\cos(\mathbf{x}\boldsymbol{\omega} + \mathbf{b})\) with \(D = 1536\)
- Next word: \(\hat{y} = \arg\max_w\, \mathbf{z}^\top \mathbf{W}_{:,w}\)
Conversation memory
KRR Chat has a simple memory: for follow-up questions (e.g. “Tell me more” without a new keyword), no new retrieval is triggered. Instead, the model continues generating from the last word of the previous answer. This creates a natural flow – the answer picks up where it left off.
Try it yourself
Open the KRR Chat, ask “What are eigenvalues?” and then “Tell me more”. Watch how the second answer seamlessly continues from the first – without a new keyword search.
Chapter 3
The Three Colors: Memorization vs. Generalization
Every word in the answer is color-coded – and this is where things get interesting. The colors honestly show where each word comes from:
● Cyan – Retrieved. This sentence was found via keyword search in the corpus. No KRR computation needed – pure text search.
● Green – Generated & Verbatim. The KRR model predicted this word, and the word sequence appears exactly like this (as a 3- or 4-gram) in the training data. The model is reproducing learned knowledge.
● Orange – Generated & Novel. The KRR model predicted this word, but this word sequence does not exist in the training data. This is genuine generalization – the model combines learned patterns in a new way.
What the colors reveal
Typically, 60–80% of generated words are green (verbatim) and 20–40% are orange (generalized). What does that mean?
A KRR model with a 505-word vocabulary primarily memorizes. It has largely learned the 104 training sentences by heart and can reproduce them. But at the seams – where a retrieval sentence ends and generation takes over, or where the model navigates between two learned sentences – orange words appear. These are moments when the kernel space interpolates: the current context lies between two training points, and the model chooses a word that was never trained in this exact combination.
This honesty is intentional. Large language models like GPT-4 generalize elegantly – but you cannot see when they memorize and when they interpolate. KRR Chat makes this distinction visible.
Didactic point: The color coding is an X-ray of the model. With GPT-4, you only see the output. With KRR Chat, you additionally see whether the model is remembering or generalizing – a transparency that would otherwise require elaborate interpretability methods.
Chapter 4
The Source Code: Three Functions
The entire KRR Chat consists of three functions. Here is the complete inference code – readable, commented, no tricks. The full source code comprises ~120 lines of JavaScript.
1. encode() – Turning words into numbers
The feature map \(\phi(x)\): each word is mapped to an index via a hash function. The position in the context determines the weight – later words count more because they are closer to the word being predicted.
// Word → hash index (0..127). Each word gets a "fingerprint".
function hash(word) {
var h = 0;
for (var i = 0; i < word.length; i++)
h = (h * 31 + word.charCodeAt(i)) >>> 0;
return h % 128; // 128 buckets
}
// Context (5 words) → feature vector (1024 dimensions)
// Each word occupies a position × hash bucket.
// Position weighting: later words count more.
function encode(context) {
var features = new Float32Array(1024); // 5×128 + 3×128
for (var pos = 0; pos < 5; pos++) {
var weight = 0.4 + 0.6 * (pos / 4); // 0.4 → 1.0
features[pos * 128 + hash(context[pos])] += weight;
}
return features;
}
Mathematics: This is \(\phi(x)\) – a feature map that embeds words into a high-dimensional space. Instead of representing each word as a one-hot vector (which would require 505 dimensions per position with a 505-word vocabulary), we compress via hashing to 128 dimensions. Hash collisions are unavoidable (505 words onto 128 buckets ≈ 4 collisions/bucket), but the model still learns – as long as the collision rate is low enough.
2. rff() – Approximating the kernel
The Random Fourier Features following Rahimi & Recht (2007): a cosine transformation that makes the dot product in the transformed space equal the Gaussian kernel.
// φ(x) → z(x) via Random Fourier Features
// z(x) = √(2/D) · cos(x·ω + b)
// Then: z(x)ᵀz(x') ≈ k(x,x') = exp(-‖x-x'‖²/2σ²)
function rff(features) {
var z = new Float32Array(1536); // D = 1536
var scale = Math.sqrt(2.0 / 1536);
for (var j = 0; j < 1536; j++) {
var dot = 0;
for (var k = 0; k < 1024; k++)
dot += features[k] * omega[k * 1536 + j]; // ω: random, fixed
z[j] = Math.cos(dot + bias[j]) * scale;
}
return z; // kernel-space vector
}
Mathematics: The matrix \(\boldsymbol{\omega}\) contains random values drawn from a normal distribution with scaling \(1/\sigma\). It is drawn once and never modified – it is not a learned parameter. The cosine ensures that \(\mathbf{z}(x)^\top\mathbf{z}(x')\) approximates the Gaussian kernel \(k(x,x') = \exp(-\|x-x'\|^2/2\sigma^2)\). The larger \(D\), the better the approximation – at \(D = 1536\) the accuracy is sufficient for our model.
3. predict() – Determining the next word
The prediction: a single matrix-vector multiplication. No softmax, no attention, no feedforward network.
// Context → next word
// scores = z(x)ᵀ · W (W was learned offline via KRR)
function predict(contextWords) {
var z = rff(encode(contextWords));
var scores = new Float32Array(505); // one score per word
// Matrix-vector multiplication: scores = zᵀW
for (var j = 0; j < 1536; j++)
for (var v = 0; v < 505; v++)
scores[v] += z[j] * W[j * 505 + v];
return argmax(scores); // word with highest score
}
Mathematics: The weight matrix \(\mathbf{W}\) was computed offline via the ridge regression system: \(\mathbf{W} = (\mathbf{Z}^\top\mathbf{Z} + \lambda\mathbf{I})^{-1}\mathbf{Z}^\top\mathbf{Y}\). Here, \(\mathbf{Z}\) is the matrix of all RFF vectors from the training data, \(\mathbf{Y}\) is the one-hot matrix of target words, and \(\lambda\) is the regularization parameter. In the browser, only the multiplication \(\mathbf{z}^\top\mathbf{W}\) takes place – a single matrix operation per word.
How they work together
The three functions form a pipeline:
The entire chatbot is this pipeline in a loop – plus a keyword search that determines the starting point. Each iteration runs in under a millisecond; with WebGPU acceleration, even in microseconds.
Chapter 5
Why Float64? The Precision Lesson
A natural first approach: train a single KRR model on the entire 805-sentence corpus – directly in the browser. We tried this. And it fails instructively.
The browser training disaster
Three problems pile up:
Problem 1: Hash collisions. Vocabulary 1820, hash buckets 256. On average, 7 different words collide per bucket. The model cannot distinguish “eigenvalue” from “convergence” when both hash to the same bucket.
Problem 2: Float32 precision. The Gaussian elimination for the linear system \((\mathbf{Z}^\top\mathbf{Z} + \lambda\mathbf{I})\mathbf{W} = \mathbf{Z}^\top\mathbf{Y}\) requires about 15 significant decimal digits at \(D = 2048\). WebGL delivers only 7 (Float32). The result: “learning learning learning learning” – a single word dominates every context.
Problem 3: Condition number. The hash collisions increase the condition number of the matrix \(\mathbf{Z}^\top\mathbf{Z}\). A high condition number amplifies rounding errors exponentially – a vicious cycle.
The solution: train offline, predict online
Training happens offline with Float64 (NumPy) – a full 15 decimal digits of precision. The weight matrix \(\mathbf{W}\) is then compressed as Float16 and loaded in the browser.
Inference runs in the browser with Float32 – for a single matrix-vector multiplication, that is perfectly sufficient. The numerical instability arises only when solving the linear system, not when applying the solution.
Connection to large language models: GPT-4, Llama, and other LLMs use exactly the same principle: training in high precision (Float32 or BFloat16), inference in reduced precision (INT8, INT4). This technique is called quantization. KRR Chat demonstrates on a small scale what happens at scale with LLMs – and why it works.
Chapter 6
What the Model Can Do – and What It Cannot
Honesty is one of the most important virtues in AI research. Here is a sober assessment:
What KRR Chat can do
- Reproduce learned sentences: Questions about eigenvalues, kernel methods, PageRank, quantum mechanics – all topics covered in the 104 training sentences are answered fluently and correctly.
- Navigate between topics: Through the retrieval system, the chat can find topically relevant starting points and then continue via generation.
- Show transparency: The color coding makes visible what a language model does – a pedagogical quality that no large LLM offers.
What KRR Chat cannot do
- Generate truly new content: The model interpolates in kernel space between learned points. It cannot invent explanations that are not (at least in fragments) contained in the training data.
- Answer out-of-distribution questions: Ask about “Pythagoras” or “Shakespeare” – retrieval finds nothing, and generation has no meaningful starting point.
- Say “I don’t know”: Instead of admitting ignorance, the model keeps generating – even when the context no longer makes sense.
The stop-word problem
At transitions between topics – when the model has finished a learned sentence and cannot find a suitable follow-up – the output sometimes collapses into stop-word gibberish: “the the of and the is a...”. This happens because stop words are the most frequent words in the vocabulary and receive the highest score in uncertain contexts.
The fix: As soon as the model generates three consecutive stop words, generation is halted and a new retrieval is triggered (re-seeding). This interrupts the gibberish and finds a new topical anchor point.
Try it yourself
Ask a question outside the training domain, e.g. “What is Pythagoras?”. Watch how the retrieval finds no matching sentences and generation still tries to produce an answer – with noticeably more orange words.
Chapter 7
The Connection Back to Theory
Every component of KRR Chat corresponds to a concept from the Eigenvalues & AI post. Here is the complete mapping:
Word hashing \(\rightarrow\) Feature map \(\phi(x)\) (Chapter 6: Kernel Trick)
Random Fourier Features \(\rightarrow\) Kernel approximation \(k(x,x') \approx \mathbf{z}(x)^\top\mathbf{z}(x')\) (Chapter 6)
Gaussian elimination \(\rightarrow\) Solving the linear system \((\mathbf{Z}^\top\mathbf{Z} + \lambda\mathbf{I})^{-1}\) (Chapter 3: Iteration)
Ridge parameter \(\lambda\) \(\rightarrow\) Regularization = stopping early (Chapter 5: Regularization)
Hash collisions \(\rightarrow\) Condition number of the matrix (Chapter 3)
Float64 vs. Float32 \(\rightarrow\) Numerical stability (Chapter 5)
Word-by-word prediction \(\rightarrow\) Eigenvalues determine learning speed (Chapter 4: Eigenvalues)
Retrieval + Generation \(\rightarrow\) RAG architecture (Chapter 7: Unification)
KRR Chat is therefore not just a demo – it is a living textbook in which every line of code corresponds to a mathematical concept. The feature map is Chapter 6. The regularization is Chapter 5. The numerical instability is Chapter 3. Everything is connected.
And that is the real punchline: whether a model has a 505-word vocabulary or 50,000 – the mathematical structure is the same. Eigenvalues, kernels, regularization. The chain holds.
Frequently Asked Questions
What is the difference between KRR Chat and ChatGPT?
ChatGPT uses a neural network with billions of parameters and attention mechanisms. KRR Chat uses Kernel Ridge Regression with a single weight matrix (1536 × 505 = ~780,000 parameters). Both solve the same task – “given context, predict the next word” – but with entirely different mathematical tools. KRR Chat makes visible through color coding what the model memorizes and what it generalizes.
Why does KRR Chat respond in English?
The 104 training sentences and the retrieval corpus of 805 sentences are written in English because the mathematical terminology is predominantly English. The model has no understanding of “language” – it has only learned which English word is likely to follow which context. A German version would simply require German training sentences.
Can KRR Chat be trained with more data?
In principle yes, but the capacity of hash-based encoding sets limits. Beyond ~800 words of vocabulary, hash collisions increase too much, and numerical precision becomes the bottleneck. For larger models, you would need larger hash buckets, higher-dimensional Random Fourier Features – or indeed a neural network. That is exactly the point: KRR Chat shows where the boundary lies and why scaling to neural networks becomes necessary.