Bring Your Own Transcription Provider

May 16, 2026

Until v1.88, SendRec transcribed every video with whisper.cpp running on the app container. That works great when you have CPU to spare. It does not work great on a 2-vCPU VPS where a 10-minute video can take 5 minutes to transcribe and pin both cores at 100% the entire time.

The #1 complaint from self-hosters has been: my container is small, my hardware is cheap, can I just point this at a real transcription API?

As of v1.89, yes.

Two new providers

TRANSCRIPTION_PROVIDER now accepts three values:

local (default) — whisper-cli on the app container. Nothing changes. Full privacy, no third-party calls, zero per-minute cost, but CPU-bound.
openai — any OpenAI-compatible /v1/audio/transcriptions endpoint. This includes the real OpenAI Whisper API, but also Groq (much faster), Scaleway Speech-to-Text (EU-hosted), and any self-hosted Faster-Whisper server.
deepgram — Deepgram /v1/listen. Different API shape but excellent latency and a generous free credit on signup.

The provider is a single env var per deployment. No per-user setting, no UI picker — set it once on the container and every upload uses it.

Configuration

TRANSCRIPTION_ENABLED=true
TRANSCRIPTION_PROVIDER=deepgram         # or openai
TRANSCRIPTION_API_KEY=<your-key>
TRANSCRIPTION_MODEL=nova-3              # optional override
TRANSCRIPTION_API_URL=https://api.openai.com  # only for openai

For Helm, the same variables live under sendrec.env.transcription* and sendrec.secrets.transcriptionApiKey. The chart’s whisper-model PVC and download-init-container only render when the provider is local, so cloud deployments don’t waste a 1 GB volume and don’t block startup on a HuggingFace download.

What it looked like in practice

We tested the rollout on a preview environment with Deepgram pointed at a 15-second clip. The local whisper path on the same VPS takes ~25 seconds for that clip. Deepgram’s nova-3 returned in 3 seconds. That’s an 8x improvement, and the transcript has better punctuation and sentence boundaries because nova-3 applies smart-formatting.

If you care about privacy or air-gapped deployments, keep local. If you care about latency, throughput, or just not pegging your VPS every time someone uploads a video, point it at Groq or Deepgram.

Why an interface, not separate workers

The provider abstraction is a small Go interface:

type Transcriber interface {
    Name() string
    Available() bool
    Transcribe(ctx context.Context, audioPath, language string) ([]TranscriptSegment, error)
}

The shared code extracts audio from the uploaded video with ffmpeg, hands the resulting 16 kHz mono WAV file to the transcriber, and turns the returned segments into a VTT file. Each provider only handles the part it differs on — HTTP shape, response parsing, language hints.

This means adding a new provider is one new file. We expect to add Gladia (EU-hosted, generous free tier) as the next built-in provider once we’ve tested it against real-world recordings.

Defaults

TRANSCRIPTION_PROVIDER defaults to local. TRANSCRIPTION_ENABLED defaults to false. Existing self-hosters who didn’t touch transcription will see exactly the behavior they had before v1.89. If you had TRANSCRIPTION_ENABLED=true set, your whisper-cli flow is unchanged.

If you want to switch, set the env var, redeploy, and the next upload will use the new provider. No migration, no rebuild.

Self-host: github.com/sendrec/sendrec. Helm chart: helm install sendrec sendrec/sendrec.