Thoth
Support Roadmap Blog Other Apps Mac App Store

How we benchmark transcription models in Thoth

2026-05-10

Why benchmark?

Every transcription model ships with a WER number on some clean studio dataset. Real meetings are messier: accented speakers, background noise, domain vocabulary. We wanted numbers we could actually trust.

The setup

Three test scripts, each designed to stress a different scenario:

  • Script A — English with a French accent, casual meeting cadence
  • Script B — Native French speaker, code-switching between languages
  • Script C — Clean audio (Simon Sinek TED talk), re-transcription only

We recorded Scripts A and B ourselves. For Script C we used the official TED human-reviewed subtitles as the reference transcript.

WER is computed as (substitutions + deletions + insertions) / reference word count via dynamic programming edit distance.

Results

Re-transcription

ModelEN accentedFR nativeEN clean
Whisper Base51.1%54.3%9.2%
Whisper Small45.9%47.7%9.2%
Whisper Medium39.7%45.3%36.8%
Whisper Large V3 Turbo32.3%38.3%7.8%
Parakeet TDT v338.9%48.7%8.7%

Live transcription (Script A)

EngineWERLatency
WhisperKit Base+Small65.9%~12 s
Parakeet Sliding Window56.8%~11 s
Parakeet EOU 120M38.4%~160 ms

Key findings

Whisper Large V3 Turbo was the most consistent performer across all conditions: best on French-accented English (32.3%), best on native French (38.3%), best on clean audio (7.8%). If you care about accuracy, this is the one to use.

Parakeet TDT v3 is competitive on clean professional audio (8.7%, close to Large's 7.8%) but degrades significantly under accent. It also code-switched to English mid-recording on the French script. Not its fault: it covers 25 languages but was not trained on heavy accent scenarios.

Whisper Medium was a surprise. Despite sitting between Small and Large in the lineup, it posted 36.8% WER on clean audio vs. Large's 7.8%. We traced it to a silent-skip bug specific to the Medium CoreML conversion: 245 deletions, near-zero insertions. The model is dropping entire sections silently. We flag this in Settings.

Parakeet EOU 120M (the live streaming engine) showed 38.4% WER on Script A. That sounds high until you remember it produces word-by-word output in real time at ~160ms latency. A completely different trade-off.

Takeaway

The LibriSpeech numbers in Settings are directionally correct but all models perform 10-30x worse on accented or foreign-language speech. Accent is the dominant factor: clean native audio brings most models close to their published benchmarks. If accuracy matters, use Whisper Large V3 Turbo.

← All posts

© 2026 Thoth · Privacy · Terms