Deepfakes Explained

From vectors and autoencoders to the face swap – the math behind the illusion

1The Motivation

In summer 2024, I gave a tech talk at my company. The topic: nonlinear operations in high-dimensional spaces. Sounds abstract. It is – until you realize this is exactly the math behind deepfakes.

The question I wanted to answer: How does a computer transfer one person's face onto another so convincingly? The answer goes through vectors, matrices, dimensionality reduction, the kernel trick, and neural networks – and in the end, it's shockingly simple.

This post follows the arc of my presentation. Each chapter builds on the previous. By the end, you'll understand why a swapped decoder is enough to fake a face.

2Vectors & Matrices

Let's start at the beginning. A vector is a list of numbers. Two numbers describe a point in the plane, three a point in space:

$$\vec{A} = \begin{pmatrix} 2 \\ 3 \end{pmatrix} \quad \text{(2D)} \qquad \vec{B} = \begin{pmatrix} 1 \\ 4 \\ 2 \end{pmatrix} \quad \text{(3D)}$$

A matrix is a table of numbers that turns one vector into another. A 3×3 matrix can rotate, scale, or project a vector – all with a single multiplication:

$$\vec{v}_{\text{new}} = R \cdot \vec{v}_{\text{old}}$$

Try it – transform a vector live below:

3Orthogonality

A vector in 3D space can be written as a combination of three basis vectors $\vec{i}$, $\vec{j}$, and $\vec{k}$. These are perpendicular to each other – orthogonal. Each describes a completely independent direction.

What does that mean for real data? Take human attributes:

Orthogonal features carry no redundant information. And that's what becomes important later: if we can find and remove redundancy in data, we need fewer dimensions.

4Blind Source Separation

An example of orthogonality's power: the cocktail party problem. Someone speaks while music plays. Two microphones record the mixture. Mathematically:

$$\mathbf{x} = A \cdot \mathbf{s}$$

where $\mathbf{s}$ are the source signals and $A$ is the mixing matrix. If we can invert $A$:

$$\mathbf{s} = A^{-1} \cdot \mathbf{x}$$

The correlated, mixed data (a slanted parallelogram) becomes decorrelated, separated signals (a clean square). The signals are now orthogonal to each other.

Blind Source Separation: Inverting the mixing matrix
Left: correlated mixed signals. Right: separated source signals after decorrelation.

5High-Dimensional Data

So far we've thought in 2 or 3 dimensions. But mathematically, you can construct any number. A tesseract is a cube in 4D space – you can barely visualize it, but you can compute with it just fine:

What does this have to do with images? A black-and-white image with 10×20 pixels has 200 pixel values. You can think of it as a 200-dimensional vector – each pixel is a coordinate.

Are these 200 dimensions orthogonal? No. Neighboring pixels are highly correlated. If one pixel is bright, its neighbor probably is too. There's redundancy in the data.

6Dimensionality Reduction – PCA

Principal Component Analysis (PCA) finds a new coordinate system whose axes are orthogonal and maximize explained variance. The first principal component points in the direction of greatest spread, the second perpendicular to it.

Try it – draw points and watch the PCA axes update live:

The result for images: 10,000 pixels become 2,500 principal components – and the image looks almost identical. The rest was redundancy. That's dimensionality reduction.

For more depth: the Eigenvalues post explains why PCA is mathematically an eigenvalue decomposition of the covariance matrix – and the Fourier post shows why the DCT (which JPEG uses) does essentially the same thing. The post The Eigenprinciple shows that the same eigen-structure also sits behind vibration, search and probability.

7The Limits of Linearity

Dimensionality reduction alone isn't enough. If we linearly interpolate between two faces, this happens:

Linear interpolation between faces
Linear interpolation: the intermediate frames are ghosts, not faces.

The intermediate images aren't faces – they're overlays. In pixel space, the straight path between two faces is not a face.

It's completely different with nonlinear interpolation: here, the intermediate frames are actually new, plausible faces.

8The Kernel Trick

And here comes the crucial trick: dimension increase. Data that isn't linearly separable in 2D becomes separable in a higher-dimensional space:

Left: two classes in concentric rings – no straight line can separate them. Right: through the transformation $z = x^2 + y^2$ (lifting into the third dimension), a simple plane becomes the separator.

Together, an elegant double-play emerges:

A neural network can do both at once.

9Neural Networks

A single neuron takes multiple inputs $x_i$, multiplies each by a weight $w_i$, sums them up, and passes the result through an activation function $\varphi$:

$$y = \varphi\left(\sum_i w_i \cdot x_i + b\right)$$

The activation function is the key: it's nonlinear. Without it, the entire network would just be one big matrix multiplication – nothing a single matrix couldn't do.

Stack many neurons in layers – input, hidden, output – and you get a neural network. The hidden layers automatically learn which dimensions to expand and which to compress.

For more on emergent properties of such networks: the Emergence post covers exactly this.

10Autoencoders

An autoencoder is a special neural network with an hourglass architecture:

The training objective: the output should match the input as closely as possible. The bottleneck in the middle – the latent space – forces the network to keep only the essential information.

That's nonlinear dimensionality reduction. Try it – draw a digit and see how the autoencoder reconstructs it:

11Latent-Space Arithmetic

And now the magic happens. In the latent space, you can do arithmetic on faces like vectors:

$$\text{smiling woman} - \text{neutral woman} + \text{neutral man} = \text{smiling man}$$

This works because the latent space has decomposed faces into orthogonal features: gender, expression, gaze direction, lighting. Every direction in the latent space corresponds to a semantic property.

Explore the latent space of a digit VAE:

If this rings a bell: it's the same trick Word2Vec uses for words. King − Man + Woman = Queen works on exactly the same principle. The Eigenvalues post explains why.

12Deepfakes – The Decoder Swap

And we've arrived. The deepfake trick is shockingly simple:

  1. Train two autoencoders – one for person A, one for person B
  2. Both share the same encoder, but have different decoders
  3. The encoder learns a shared face representation in the latent space
  4. The swap: take an image of person A, run it through the shared encoder – then through the decoder of person B
Deepfake encoder-decoder swap
The deepfake trick: encoder of A + decoder of B = appearance of B with expression of A.

The result: the expression and head pose of A, but the appearance of B. A deepfake. Not magic – just dimensionality reduction, dimension increase, and a swapped decoder.

13Ethics & Detection

Deepfakes are unsettling. But understanding beats panic. Knowing how they work helps you spot them:

The technology is neutral. It enables medical simulation, film post-production, accessibility (lip-sync for the deaf) just as well. The question isn't whether we should understand it – but whether we can afford not to.

This post is based on a tech talk I gave at P&M Agentur. The original slides and all interactive visualizations are freely available.

Frequently Asked Questions

How does a deepfake work technically?

A deepfake uses two autoencoders with a shared encoder but different decoders. An image of person A is passed through the shared encoder and then reconstructed by the decoder of person B. The result: expression and head pose from A, appearance from B.

What is the difference between PCA and an autoencoder?

PCA is a linear dimensionality reduction – it finds the best orthogonal coordinate system. An autoencoder is the nonlinear generalization: it can learn arbitrarily complex manifolds because its activation functions are nonlinear.

What is the kernel trick?

The kernel trick lifts data into a higher-dimensional space where it becomes linearly separable. Mathematically, you never need the explicit higher dimensions – computing the inner products between data points in the higher space is enough.

Why can we still sometimes detect deepfakes?

Because models trained on insufficient data leave artifacts: unnatural blinking, inconsistent lighting at the hairline or ears, patterns in the frequency spectrum that don't appear in real images.

Read next

Related posts on ki-mathias.de: