How we benchmark transcription models in Thoth

2026-05-10

Why benchmark?

Every transcription model ships with a WER number on some clean studio dataset. Real meetings are messier: accented speakers, background noise, domain vocabulary. We wanted numbers we could actually trust.

The setup

Three test scripts, each designed to stress a different scenario:

Script A — English with a French accent, casual meeting cadence
Script B — Native French speaker, code-switching between languages
Script C — Clean audio (Simon Sinek TED talk), re-transcription only

We recorded Scripts A and B ourselves. For Script C we used the official TED human-reviewed subtitles as the reference transcript.

WER is computed as (substitutions + deletions + insertions) / reference word count via dynamic programming edit distance.

Results

Re-transcription

Model	EN accented	FR native	EN clean
Whisper Base	51.1%	54.3%	9.2%
Whisper Small	45.9%	47.7%	9.2%
Whisper Medium	39.7%	45.3%	36.8%
Whisper Large V3 Turbo	32.3%	38.3%	7.8%
Parakeet TDT v3	38.9%	48.7%	8.7%

Live transcription (Script A)

Engine	WER	Latency
WhisperKit Base+Small	65.9%	~12 s
Parakeet Sliding Window	56.8%	~11 s
Parakeet EOU 120M	38.4%	~160 ms

Key findings

Whisper Large V3 Turbo was the most consistent performer across all conditions: best on French-accented English (32.3%), best on native French (38.3%), best on clean audio (7.8%). If you care about accuracy, this is the one to use.

Parakeet TDT v3 is competitive on clean professional audio (8.7%, close to Large's 7.8%) but degrades significantly under accent. It also code-switched to English mid-recording on the French script. Not its fault: it covers 25 languages but was not trained on heavy accent scenarios.

Whisper Medium was a surprise. Despite sitting between Small and Large in the lineup, it posted 36.8% WER on clean audio vs. Large's 7.8%. We traced it to a silent-skip bug specific to the Medium CoreML conversion: 245 deletions, near-zero insertions. The model is dropping entire sections silently. We flag this in Settings.

Parakeet EOU 120M (the live streaming engine) showed 38.4% WER on Script A. That sounds high until you remember it produces word-by-word output in real time at ~160ms latency. A completely different trade-off.

Takeaway

The LibriSpeech numbers in Settings are directionally correct but all models perform 10-30x worse on accented or foreign-language speech. Accent is the dominant factor: clean native audio brings most models close to their published benchmarks. If accuracy matters, use Whisper Large V3 Turbo.

← All posts