How we benchmark transcription models in Thoth
2026-05-10
Why benchmark?
Every transcription model ships with a WER number on some clean studio dataset. Real meetings are messier: accented speakers, background noise, domain vocabulary. We wanted numbers we could actually trust.
The setup
Three test scripts, each designed to stress a different scenario:
- Script A — English with a French accent, casual meeting cadence
- Script B — Native French speaker, code-switching between languages
- Script C — Clean audio (Simon Sinek TED talk), re-transcription only
We recorded Scripts A and B ourselves. For Script C we used the official TED human-reviewed subtitles as the reference transcript.
WER is computed as (substitutions + deletions + insertions) / reference word count via dynamic programming edit distance.
Results
Re-transcription
| Model | EN accented | FR native | EN clean |
|---|---|---|---|
| Whisper Base | 51.1% | 54.3% | 9.2% |
| Whisper Small | 45.9% | 47.7% | 9.2% |
| Whisper Medium | 39.7% | 45.3% | 36.8% |
| Whisper Large V3 Turbo | 32.3% | 38.3% | 7.8% |
| Parakeet TDT v3 | 38.9% | 48.7% | 8.7% |
Live transcription (Script A)
| Engine | WER | Latency |
|---|---|---|
| WhisperKit Base+Small | 65.9% | ~12 s |
| Parakeet Sliding Window | 56.8% | ~11 s |
| Parakeet EOU 120M | 38.4% | ~160 ms |
Key findings
Whisper Large V3 Turbo was the most consistent performer across all conditions: best on French-accented English (32.3%), best on native French (38.3%), best on clean audio (7.8%). If you care about accuracy, this is the one to use.
Parakeet TDT v3 is competitive on clean professional audio (8.7%, close to Large's 7.8%) but degrades significantly under accent. It also code-switched to English mid-recording on the French script. Not its fault: it covers 25 languages but was not trained on heavy accent scenarios.
Whisper Medium was a surprise. Despite sitting between Small and Large in the lineup, it posted 36.8% WER on clean audio vs. Large's 7.8%. We traced it to a silent-skip bug specific to the Medium CoreML conversion: 245 deletions, near-zero insertions. The model is dropping entire sections silently. We flag this in Settings.
Parakeet EOU 120M (the live streaming engine) showed 38.4% WER on Script A. That sounds high until you remember it produces word-by-word output in real time at ~160ms latency. A completely different trade-off.
Takeaway
The LibriSpeech numbers in Settings are directionally correct but all models perform 10-30x worse on accented or foreign-language speech. Accent is the dominant factor: clean native audio brings most models close to their published benchmarks. If accuracy matters, use Whisper Large V3 Turbo.