Blog Post · Music · Bass · Fourier

Tommy the Cat — Bass Cover at 210 BPM + FFT Video Sync

The Primus riff simplified, but playable at something close to original tempo :-D Plus: a browser practice app, and two camera recordings synced frame-precisely via FFT — using only Python and FFmpeg.

KI-Mathias· · ~8 min read

Chapter 1

The Riff

Tommy the Cat by Primus. 210 BPM. Let’s just say Claypool’s original is not something you casually pick up after dinner. I gave it a shot anyway, in my own way.

My interpretation: simplified, but playable at original tempo. No Claypool-style slapping, but flamenco strokes, hammer-ons and groove. First-time listeners hear chaos — in reality it's a precise 16-note pattern that repeats. Every note has a defined technique, a defined string, a defined fret.

The tab below shows my simplified version. The first bar opens with a flamenco down-stroke on fret 7 (A string) and fret 9 (D string) simultaneously — a power chord. Then the up-stroke, followed by two thumb hits on the muted E string. From note 5 things get interesting: a hammer-on from G to G♯ on the E string, a popped hammer-on F to G on the D string (index finger), then alternating slap ghost notes and open G string as pop. The pattern repeats — the second and all subsequent bars begin with a hammer-on instead of the initial flamenco down-stroke.

Anyone learning this riff should start very slowly — 80 BPM, note by note, until the hand position is right. The tempo comes by itself.

G|-------------------------------------0-----------0-------------|-------------------------------------0-----------0-------------|
D|-9---9-------------------3---5h------------------------------0-|-9h--9-------------------3---5h------------------------------0-|
A|-7---7---------------------------------------------------------|-7h--7---------------------------------------------------------|
E|---------X---X---3---4-----------4-------4---3-------X---X-----|---------X---X---3---4-----------4-------4---3-------X---X-----|
   ↓   ↑   T   T   T   H   P   H   T   P   T   T   P   T   T   P   H   ↑   T   T   T   H   P   H   T   P   T   T   P   T   T   P
   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16

Legend:
  ↓  = Flamenco Down-Stroke        ↑  = Flamenco Up-Stroke
  T  = Thumb (Slap)                P  = Pop
  H  = Hammer-On                   h  = Hammer-On in tab notation (e.g. 5h = hammer-on to fret 5)
  X  = Muted Note (Ghost Note)

Bar 1, notes 1-16:
  1) Flamenco Down-Stroke: fret 7 (A) + fret 9 (D) simultaneously — power chord
  2) Flamenco Up-Stroke: power chord, other strings muted (fretting hand)
  3) Thumb on muted E string
  4) Thumb on muted E string
  5) Slapped Hammer-On from G to G# on E string
  6) G# on E string (hammer-on destination)
  7) Popped Hammer-On F to G on D string (index finger)
  8) G on D string (hammer-on destination)
  9) Slap G# (E string, fret 4)
 10) Pop open G string
 11) Slap muted E string
 12) Slap G (E string, fret 3)
 13) Pop open G string
 14) Slap muted E string
 15) Slap muted E string
 16) Pop D string (open)

From bar 2: note 1 becomes a hammer-on to fret 7 (A) + fret 9 (D),
followed by flamenco up-stroke. Everything else identical. Repeat.
Practice App:

For interactive practice there's a browser app at pmmathias.github.io/TommyTheCat — adjustable tempo, drums and bass individually mutable.

Honest assessment: My current version sits at 210 BPM. The actual Primus original is probably closer to ~230 BPM. I’m still practicing — and once I can play it cleaner and closer to the original tempo, I’ll likely re-record it. The practice app therefore covers the range 180–240 BPM: from comfortable practice all the way past the original.

Chapter 2

The Practice App

Anyone learning a riff at 210 BPM doesn't start at 210 BPM. The problem: drum computer loops at 60 % tempo sound wrong because the pitches scale along with them. A drum kit at 100 BPM doesn't sound like a drum kit at 210 BPM in slow motion — it sounds like a different instrument entirely.

The practice app solves this with the WSOLA algorithm (Waveform Similarity Overlap-Add), implemented via the SoundTouch library. WSOLA changes the tempo while keeping the pitch constant. A snare hit at 60 BPM sounds exactly as high as one at 210 BPM — just slower.

The app runs entirely in the browser, requires no installation and works on mobile. Recommended workflow: drums on, bass muted, tempo down to 180 BPM. Play the riff on your own bass along with the drums until your fingers know the sequence. Then step up gradually — 190, 200, 210, and if you’re feeling brave: 230+.

WSOLA Technical Background:

WSOLA divides an audio signal into short overlapping windows and reassembles them with optimal cross-correlation alignment — producing a longer or shorter signal without pitch shift. The algorithm belongs to the family of Time-Scale Modification methods and is standard in professional playback systems.

Chapter 3

The Tech Stack — FFT Cross-Correlation

This video consists of TWO separate camera recordings. I recorded drums and bass separately, each with its own camera, and the recordings were not synchronized. How do you bring two videos that started at different times into sample-precise alignment?

The Recording Setup

GarageBand recorded two audio tracks simultaneously via an interface box: e-drums and bass. But two separate cameras ran independently — one aimed at the drum kit, one at the bass. No clapperboard, no shared trigger. The videos have an unknown time offset somewhere between zero and several seconds.

Manual sync via waveform in a video editor? Possible — but error-prone, non-reproducible and tedious. The alternative: Python + scipy + FFmpeg.

Step 1: Extract Audio Tracks

Each video file contains an audio track holding the recorded instrument (plus ambient sound). FFmpeg extracts them as WAV files. The challenge: bass and drums have very different frequency spectra. A Butterworth bandpass filter separates the relevant signals:

  • Drums: 150–6000 Hz (snare transient, hi-hat, kick harmonics)
  • Bass: 40–300 Hz (fundamental and first overtones)

The Butterworth filter is a signal-processing standard: it attenuates frequencies outside the band with maximum flatness in the passband. scipy.signal.butter + sosfilt handles this in two lines.

Step 2: Cross-Correlation in the Frequency Domain

The cross-correlation of two signals measures how well they match at a given time offset. The peak of the cross-correlation sits exactly at the offset where the signals align best. Instead of computing it in the time domain (slow, O(n²)), we use the convolution theorem:

corr(f, g) = IFFT( FFT(f) · conj(FFT(g)) )

The exact same Fourier transform that generates ocean waves in our Bird Simulator post synchronizes bass and drums here. In the frequency domain, an expensive convolution becomes a simple elementwise multiplication — from O(n²) to O(n log n).

The Python core snippet (the actual sync code):

import numpy as np
from scipy.signal import butter, sosfilt

def bandpass(signal, lo, hi, sr):
    sos = butter(4, [lo, hi], btype='band', fs=sr, output='sos')
    return sosfilt(sos, signal)

def find_offset(ref, query, sr):
    # FFT-based cross-correlation
    n = len(ref) + len(query) - 1
    R = np.fft.rfft(ref,   n=n)
    Q = np.fft.rfft(query, n=n)
    corr = np.fft.irfft(R * np.conj(Q), n=n)
    peak = np.argmax(corr)
    if peak > n // 2:
        peak -= n   # negative lag
    # Sub-sample precision via parabolic interpolation
    if 1 <= abs(peak) < len(corr) - 1:
        y0, y1, y2 = corr[peak-1], corr[peak], corr[peak+1]
        peak += (y2 - y0) / (2 * (2*y1 - y0 - y2))
    return peak / sr   # offset in seconds

Step 3: Parabolic Sub-Sample Interpolation

The naive approach simply takes the index of the correlation peak and converts it to seconds. The problem: at 44,100 Hz sample rate the resolution is about 22 microseconds — at 25 fps one frame spans 40,000 microseconds. That's sufficient.

But for 60-fps video, parabolic interpolation pays off: we fit a parabola through the peak and its two neighbours and compute the exact sub-pixel maximum. The formula is in the snippet above. The result: sub-sample-precise time offset determination, well below one frame at 60 fps.

Step 4: FFmpeg Render

With the computed offset value (in seconds), FFmpeg renders the final video. Container-level seeks are more efficient than re-encoding:

# offset_s = computed offset in seconds (positive or negative)
# Positive offset: video 2 starts later than video 1
ffmpeg -ss {offset_s} -i drums.mp4 \
       -i bass.mp4   \
       -filter_complex "[0:v][1:v]hstack=inputs=2[v]" \
       -map "[v]" -map 0:a \
       output.mp4

No Premiere Pro needed. No manual waveform alignment. Just Python + FFmpeg + an AI agent that wrote the script. The entire render takes — depending on video quality — under two minutes.

Tech Stack Summary:

GarageBand (recording) → FFmpeg (extract audio tracks) → scipy Butterworth filter (frequency separation) → numpy FFT cross-correlation (compute offset) → parabolic interpolation (sub-sample precision) → FFmpeg (final render). Fully automated, reproducible, frame-precise.

Chapter 4

For Content Creators — How to Do This Yourself

The method isn't limited to bass covers. Anyone filming with multiple cameras — interviews, concerts, tutorials — can use the same pipeline as long as the audio tracks share a common reference signal (which they almost always do: room reverb, voice, music).

The Recipe in Five Steps

  1. Record audio tracks separately — one track per instrument or source, via an audio interface or directly into your DAW. No click track required, but it doesn't hurt.
  2. Record videos separately — each camera runs independently. No sync signal, no shared trigger. The Python script finds the offset automatically.
  3. Bandpass filter per instrument — Drums: 150–6000 Hz, Bass: 40–300 Hz, Guitar: 80–2000 Hz, Voice: 100–8000 Hz. Adjust per source.
  4. Cross-correlation in the frequency domain — numpy + scipy, the snippet from above. Returns the time offset in seconds. Typical runtime under one second for 10 minutes of audio.
  5. FFmpeg render — with the computed offset. Side by side (hstack), stacked (vstack), picture-in-picture (overlay) — all possible.

The Advantage Over Manual Sync

Manual waveform alignment in a video editor takes — depending on the material's length — between one minute and one hour. It's eye-guided estimation. Anyone who has uploaded a poorly synced recording knows: you see it immediately. A guitar strike that's visible 80 milliseconds before you hear it disturbs viewers more than you might expect.

The FFT method is frame-precise, automatic and reproducible. If the material changes — new takes, different cuts — simply re-run the script. No manual readjustment. No Premiere Pro. No subscription.

The complete Python script (including FFmpeg call, progress display and error handling) can be generated with an AI agent in under ten minutes — input: the description above. Output: working code. That's exactly what happened here.

Dependencies:

pip install numpy scipy soundfile — plus an FFmpeg installation on your PATH. No further dependencies. The script runs on macOS, Linux and Windows.

Frequently Asked Questions

Is this Les Claypool's original version?

No. Claypool's original Tommy the Cat is one of the most technically demanding bass parts in rock history — slap, pop, ghost notes and a percussive style that reflects decades of development. This version is a simplified interpretation: no Claypool-style slapping, instead flamenco strokes and hammer-ons. The tempo is original (210 BPM), the technique is more accessible.

What is WSOLA?

WSOLA stands for Waveform Similarity Overlap-Add. It is an algorithm for time-scaling audio signals without changing their pitch (pitch preservation). The signal is divided into short, overlapping windows; the windows are optimally aligned using cross-correlation and reassembled — longer or shorter, depending on the desired tempo. The result: a drum loop at 60 BPM sounds exactly the same pitch as at 210 BPM — just slower.

How does FFT-based video synchronization work?

Cross-correlation of two audio signals measures their agreement at different time offsets. The peak of the cross-correlation lies at the optimal offset. Instead of computing the correlation in the time domain (O(n²)), the convolution theorem uses the Fourier transform: in the frequency domain, convolution becomes elementwise multiplication (O(n log n)). With parabolic interpolation around the peak, sub-sample precision is achieved — better than a single video frame at 60 fps.