Deepfakes Explained
From vectors and autoencoders to the face swap – the math behind the illusion
The Motivation
In summer 2024, I gave a tech talk at my company. The topic: nonlinear operations in high-dimensional spaces. Sounds abstract. It is – until you realize this is exactly the math behind deepfakes.
The question I wanted to answer: How does a computer transfer one person's face onto another so convincingly? The answer goes through vectors, matrices, dimensionality reduction, the kernel trick, and neural networks – and in the end, it's shockingly simple.
This post follows the arc of my presentation. Each chapter builds on the previous. By the end, you'll understand why a swapped decoder is enough to fake a face.
Vectors & Matrices
Let's start at the beginning. A vector is a list of numbers. Two numbers describe a point in the plane, three a point in space:
$$\vec{A} = \begin{pmatrix} 2 \\ 3 \end{pmatrix} \quad \text{(2D)} \qquad \vec{B} = \begin{pmatrix} 1 \\ 4 \\ 2 \end{pmatrix} \quad \text{(3D)}$$A matrix is a table of numbers that turns one vector into another. A 3×3 matrix can rotate, scale, or project a vector – all with a single multiplication:
$$\vec{v}_{\text{new}} = R \cdot \vec{v}_{\text{old}}$$Try it – transform a vector live below:
Orthogonality
A vector in 3D space can be written as a combination of three basis vectors $\vec{i}$, $\vec{j}$, and $\vec{k}$. These are perpendicular to each other – orthogonal. Each describes a completely independent direction.
What does that mean for real data? Take human attributes:
- Shoe size ↔ height: related (correlated)
- Age ↔ weight: partially related
- Eye color ↔ income: completely independent (orthogonal)
Orthogonal features carry no redundant information. And that's what becomes important later: if we can find and remove redundancy in data, we need fewer dimensions.
Blind Source Separation
An example of orthogonality's power: the cocktail party problem. Someone speaks while music plays. Two microphones record the mixture. Mathematically:
$$\mathbf{x} = A \cdot \mathbf{s}$$where $\mathbf{s}$ are the source signals and $A$ is the mixing matrix. If we can invert $A$:
$$\mathbf{s} = A^{-1} \cdot \mathbf{x}$$The correlated, mixed data (a slanted parallelogram) becomes decorrelated, separated signals (a clean square). The signals are now orthogonal to each other.
High-Dimensional Data
So far we've thought in 2 or 3 dimensions. But mathematically, you can construct any number. A tesseract is a cube in 4D space – you can barely visualize it, but you can compute with it just fine:
What does this have to do with images? A black-and-white image with 10×20 pixels has 200 pixel values. You can think of it as a 200-dimensional vector – each pixel is a coordinate.
Are these 200 dimensions orthogonal? No. Neighboring pixels are highly correlated. If one pixel is bright, its neighbor probably is too. There's redundancy in the data.
Dimensionality Reduction – PCA
Principal Component Analysis (PCA) finds a new coordinate system whose axes are orthogonal and maximize explained variance. The first principal component points in the direction of greatest spread, the second perpendicular to it.
Try it – draw points and watch the PCA axes update live:
The result for images: 10,000 pixels become 2,500 principal components – and the image looks almost identical. The rest was redundancy. That's dimensionality reduction.
For more depth: the Eigenvalues post explains why PCA is mathematically an eigenvalue decomposition of the covariance matrix – and the Fourier post shows why the DCT (which JPEG uses) does essentially the same thing.
The Limits of Linearity
Dimensionality reduction alone isn't enough. If we linearly interpolate between two faces, this happens:
The intermediate images aren't faces – they're overlays. In pixel space, the straight path between two faces is not a face.
It's completely different with nonlinear interpolation: here, the intermediate frames are actually new, plausible faces.
The Kernel Trick
And here comes the crucial trick: dimension increase. Data that isn't linearly separable in 2D becomes separable in a higher-dimensional space:
Left: two classes in concentric rings – no straight line can separate them. Right: through the transformation $z = x^2 + y^2$ (lifting into the third dimension), a simple plane becomes the separator.
Together, an elegant double-play emerges:
- Dimensionality reduction – to remove redundancy (PCA, DCT)
- Dimension increase – to make nonlinearity tractable (kernel trick)
A neural network can do both at once.
Neural Networks
A single neuron takes multiple inputs $x_i$, multiplies each by a weight $w_i$, sums them up, and passes the result through an activation function $\varphi$:
$$y = \varphi\left(\sum_i w_i \cdot x_i + b\right)$$The activation function is the key: it's nonlinear. Without it, the entire network would just be one big matrix multiplication – nothing a single matrix couldn't do.
Stack many neurons in layers – input, hidden, output – and you get a neural network. The hidden layers automatically learn which dimensions to expand and which to compress.
For more on emergent properties of such networks: the Emergence post covers exactly this.
Autoencoders
An autoencoder is a special neural network with an hourglass architecture:
- Encoder: compresses a high-dimensional image into a low-dimensional latent space
- Decoder: reconstructs the image from the compressed form
The training objective: the output should match the input as closely as possible. The bottleneck in the middle – the latent space – forces the network to keep only the essential information.
That's nonlinear dimensionality reduction. Try it – draw a digit and see how the autoencoder reconstructs it:
Latent-Space Arithmetic
And now the magic happens. In the latent space, you can do arithmetic on faces like vectors:
$$\text{smiling woman} - \text{neutral woman} + \text{neutral man} = \text{smiling man}$$This works because the latent space has decomposed faces into orthogonal features: gender, expression, gaze direction, lighting. Every direction in the latent space corresponds to a semantic property.
Explore the latent space of a digit VAE:
If this rings a bell: it's the same trick Word2Vec uses for words. King − Man + Woman = Queen works on exactly the same principle. The Eigenvalues post explains why.
Deepfakes – The Decoder Swap
And we've arrived. The deepfake trick is shockingly simple:
- Train two autoencoders – one for person A, one for person B
- Both share the same encoder, but have different decoders
- The encoder learns a shared face representation in the latent space
- The swap: take an image of person A, run it through the shared encoder – then through the decoder of person B
The result: the expression and head pose of A, but the appearance of B. A deepfake. Not magic – just dimensionality reduction, dimension increase, and a swapped decoder.
Ethics & Detection
Deepfakes are unsettling. But understanding beats panic. Knowing how they work helps you spot them:
- Artifacts: unnatural transitions at the hairline, ears, teeth
- Consistency: lighting on the face doesn't match the rest of the image
- Blinking: early deepfakes blink too rarely (training-data bias)
- Forensics: frequency analysis reveals GAN-typical patterns in the spectrum
The technology is neutral. It enables medical simulation, film post-production, accessibility (lip-sync for the deaf) just as well. The question isn't whether we should understand it – but whether we can afford not to.
This post is based on a tech talk I gave at P&M Agentur. The original slides and all interactive visualizations are freely available.
Frequently Asked Questions
How does a deepfake work technically?
A deepfake uses two autoencoders with a shared encoder but different decoders. An image of person A is passed through the shared encoder and then reconstructed by the decoder of person B. The result: expression and head pose from A, appearance from B.
What is the difference between PCA and an autoencoder?
PCA is a linear dimensionality reduction – it finds the best orthogonal coordinate system. An autoencoder is the nonlinear generalization: it can learn arbitrarily complex manifolds because its activation functions are nonlinear.
What is the kernel trick?
The kernel trick lifts data into a higher-dimensional space where it becomes linearly separable. Mathematically, you never need the explicit higher dimensions – computing the inner products between data points in the higher space is enough.
Why can we still sometimes detect deepfakes?
Because models trained on insufficient data leave artifacts: unnatural blinking, inconsistent lighting at the hairline or ears, patterns in the frequency spectrum that don't appear in real images.