How We Added Automatic Transcription With whisper.cpp

February 10, 2026

Async video has a discoverability problem. You record a 3-minute walkthrough explaining how the new API works, share the link, and a week later nobody can find it. There’s no text to search. The knowledge is locked inside the video, invisible to Ctrl+F, invisible to your team’s search tool.

Transcription fixes this. Every video gets a text representation — searchable, skimmable, linkable. But most transcription services are cloud APIs that send your audio to someone else’s servers. For an EU-hosted, privacy-first tool, that’s not an option.

Here’s how we built automatic transcription that runs entirely on our own server.

Why whisper.cpp

OpenAI’s Whisper is the obvious starting point for speech-to-text. The original Python model works well but requires a full Python environment, PyTorch, and a GPU for reasonable performance. That’s a lot of infrastructure for a feature that runs a few times a day.

whisper.cpp is a C/C++ port of Whisper that runs on CPU. No Python, no PyTorch, no GPU required. A single static binary plus a model file. It’s fast enough on a 4-vCPU server and produces the same quality output as the original.

For our Docker image, we build whisper.cpp from source in a multi-stage build:

FROM alpine:3.21 AS whisper
RUN apk add --no-cache build-base cmake git
WORKDIR /build
RUN git clone --depth 1 https://github.com/ggerganov/whisper.cpp.git . && \
    cmake -B build -DCMAKE_BUILD_TYPE=Release -DBUILD_SHARED_LIBS=OFF && \
    cmake --build build --target whisper-cli -j$(nproc)

The key flag is -DBUILD_SHARED_LIBS=OFF. Without it, whisper-cli dynamically links against libwhisper.so and libggml.so, which don’t exist in the final Alpine image. Static linking produces a single self-contained binary that we copy into the runtime stage.

The model file (~466 MB for ggml-small) is mounted as a Docker volume, not baked into the image. This keeps the image small and lets operators swap models without rebuilding.

The transcription pipeline

Transcription runs automatically after every video upload, chained after thumbnail generation in the same async goroutine:

go func() {
    GenerateThumbnail(ctx, db, storage, videoID, fileKey, thumbnailKey)
    TranscribeVideo(ctx, db, storage, videoID, fileKey, userID, shareToken)
}()

The pipeline has five steps:

1. Extract audio. whisper.cpp expects 16kHz mono WAV input. We extract it with ffmpeg:

ffmpeg -i video.webm -ar 16000 -ac 1 -c:a pcm_s16le audio.wav

2. Run whisper. The CLI handles language detection automatically with -l auto:

whisper-cli -m /models/ggml-small.bin -f audio.wav \
  --output-vtt --output-json -of output -t 2 -l auto

This produces two files: a VTT subtitle file and a JSON file with detailed timestamps per segment. We use -t 2 to limit whisper to 2 threads — more on why later.

3. Parse the output. The whisper JSON contains timestamp pairs and text for each segment:

type TranscriptSegment struct {
    Start float64 `json:"start"`
    End   float64 `json:"end"`
    Text  string  `json:"text"`
}

We parse the JSON, trim whitespace, and skip empty segments.

4. Upload VTT to S3. The VTT file is stored alongside the video in object storage. The watch page loads it as a <track> element for native browser subtitles.

5. Store segments in the database. The parsed segments go into a JSONB column. This avoids an extra S3 fetch when rendering the transcript panel and makes future full-text search straightforward.

Concurrency control

whisper.cpp is CPU-intensive. On our 4-vCPU Hetzner server, a single transcription pins 2 cores for 30-60 seconds. Running two concurrently would saturate the server and degrade everything else — video uploads, page loads, compositing.

A simple semaphore limits concurrent transcriptions to one:

var transcriptionSemaphore = make(chan struct{}, 1)

func TranscribeVideo(ctx context.Context, ...) {
    select {
    case transcriptionSemaphore <- struct{}{}:
        defer func() { <-transcriptionSemaphore }()
    case <-ctx.Done():
        return
    }
    // ... transcription pipeline
}

If a second video finishes uploading while transcription is running, it waits in the queue. The context cancellation check prevents goroutines from piling up if the server is shutting down.

This is also why we pass -t 2 to whisper — 2 threads leaves 2 cores free for the rest of the application.

The watch page experience

Transcription produces two things the viewer can interact with: subtitles and a transcript panel.

Subtitles are the simplest part. When a VTT file exists, we add a <track> to the video element:

<track kind="subtitles" src="{{.TranscriptURL}}" srclang="en" label="Subtitles" default>

The browser handles rendering. No JavaScript needed.

The transcript panel sits below the video and shows every segment as a clickable row with a timestamp. Click a timestamp, and the video seeks to that moment:

panel.addEventListener('click', function(e) {
    var seg = e.target.closest('.transcript-segment');
    if (seg) {
        player.currentTime = parseFloat(seg.dataset.start);
        player.play();
    }
});

As the video plays, the active segment highlights automatically. A timeupdate listener compares the current playback time against each segment’s start and end times and toggles an .active class. This makes the transcript feel like a live document — you can follow along as the video plays, or jump ahead by clicking.

Automatic language detection

whisper.cpp’s -l auto flag detects the spoken language from the first 30 seconds of audio. This works well for monolingual recordings — English, German, Romanian, French — without requiring the user to select a language.

The ggml-small model (466 MB) handles most European languages well. It’s a good balance between accuracy and speed. The tiny model (75 MB) is faster but struggles with accented English and smaller languages. The medium model (1.5 GB) improves accuracy for difficult audio but doubles processing time.

Graceful degradation

Transcription is optional at every level:

No whisper binary? isTranscriptionAvailable() checks for the binary, the model file, and the TRANSCRIPTION_ENABLED environment variable. If any are missing, transcription is silently skipped. The video works normally — you just don’t get subtitles.

Transcription fails? The status is set to “failed” and the video remains fully functional. The library shows a “Retry transcript” button so the user can try again.

Self-hosted without the model? The Docker image includes whisper-cli but not the model (it would add 466 MB to every image pull). Self-hosters who don’t need transcription don’t pay the download cost. Those who do can mount the model as a volume and set TRANSCRIPTION_ENABLED=true.

services:
  sendrec:
    volumes:
      - ./models:/models:ro
    environment:
      - TRANSCRIPTION_ENABLED=true

This approach — build the capability into the binary, but make activation explicit — means self-hosters get a working product out of the box with no extra dependencies, and can opt into transcription when they’re ready.

The processing UX

Transcription takes 30-60 seconds. The user shouldn’t have to wait or wonder what’s happening.

When transcription starts, the video’s transcript_status goes to “processing.” The library UI shows “Transcribing…” with a polling loop that refetches every 5 seconds. When the status changes to “ready,” the transcript appears automatically.

On the watch page, if a viewer arrives while transcription is still running, they see “Transcription in progress…” where the transcript panel will be. A 10-second poll checks the API and reloads when ready. No page refresh needed.

What we’d do differently

VTT language tag. We hardcode srclang="en" on the track element even though whisper detects the language automatically. The whisper output includes the detected language, but we don’t propagate it to the VTT track yet. This means subtitle rendering defaults assume English — works fine in practice since browsers don’t change rendering based on language, but it’s not correct.

Segment granularity. whisper.cpp’s default segmentation sometimes produces long segments (30+ seconds) for continuous speech. Shorter segments would make the clickable transcript more useful. The --max-len flag can limit segment length, but it sometimes splits mid-sentence.

Try it

SendRec is open source (AGPL-3.0) and self-hostable. Automatic transcription is live at app.sendrec.eu. The transcription code is in transcribe.go if you want to see the full implementation.