2025 Edge Speech-to-Text
Model Benchmark
The Real Problem with Benchmarks
With so many speech-to-text (STT) models and APIs on the market today, you’d expect a clear, unbiased answer to the question: “Which one should I use?” But the truth is, it's far more complicated.
Academic research typically compares models, not production‑grade APIs. While these papers are useful for understanding model architectures, they rarely reflect real-world performance. On the flip side, many API benchmarks found online are created by the providers themselves. Take Deepgram’s benchmark, for example: it places Deepgram first across all metrics. While that makes sense from a marketing standpoint, it’s not very useful for developers who just want an objective comparison.
Why We Did This Benchmark?
We needed something practical and neutral. We focused on real-world transcription challenges, especially in high-stakes environments like hospitals where a single error can have serious consequences. This benchmark is designed to help developers choose the right model for real-world speech, not just the one that looks best in a lab.
Methodology
Dataset & Relevance
- Total Clips: 58 (5 to 40 seconds per clip)
- Clean Speech: Quiet, echo-free environments
- Noisy Speech: Crowd chatter, reverb, static
- Format: WAV, 16 kHz, mono-channel
- Speakers: Diverse accents & styles
- Source: Human-recorded via Ionio workers
You can checkout and use our dataset linked here: Speech to Text Benchmark.
Evaluated Models
- Whisper (OpenAI): High accuracy, multilingual.
- Deepgram: Real-time, low latency.
- AssemblyAI: Strong formatting capabilities.
- Wav2Vec2: Versatile, Hugging Face integration.
- Distil-Whisper: 6x faster, 49% smaller.
- NVIDIA Parakeet: 0.6B parameter research model.
- IBM Granite-Speech: 8B parameter, closed-weight.
Evaluation Metrics
Word Error Rate (WER) was our primary metric. However, we also analyzed substitutions, deletions, and insertions.
Substitution (S): Replacing "dose" with "those" (dangerous in medicine).
Deletion (D): Omitting "not" in "Do not resuscitate" (critical failure).
Insertion (I): Adding "no" to "The test was positive" (ambiguity).
Benchmark Analysis Results
Clean Speech Evaluation
In clean conditions, Granite-Speech-3.3 emerged as the standout performer with an exceptional 8.18% WER, significantly outpacing competitors. Distil-Whisper secured second place (14.93% WER). Wav2Vec2 struggled considerably, posting a disappointing 37.04% WER.
Noisy Speech Evaluation
Granite-Speech-3.3 maintained its lead with 15.72% WER in noise (only 7.54% degradation). Distil-Whisper showed strong resilience (21.26%). Standard Whisper degraded significantly to 29.80%. Wav2Vec2 remained the weakest at 54.69%.
Error Pattern Analysis
Insertion Errors
Insertion errors occur when a model adds phantom words. Granite-Speech-3.3 demonstrated minimal insertions (0.828 clean). Whisper showed a sharp increase in insertions under noise (2.362), which could lead to disruptive false positives in voice assistants (e.g., "Call John" becoming "Call John now").
Deletion Errors
Parakeet led the field here with a deletion rate of just 0.414 in clean conditions, making it ideal for legal transcription where completeness is non-negotiable. Wav2Vec2 had a deletion rate of 8.897 in noise, suggesting a tendency to drop entire phrases.
Substitution Errors
Granite-Speech-3.3 led with the lowest substitution rates (2.276 clean). Wav2Vec2 ballooned to 13.879 in noise, creating a high risk of misreporting key terms like "profit" vs "loss."
Performance in Noisy Conditions
Granite-Speech-3.3's modest 3.5% WER increase highlights superior noise-handling architecture. Wav2Vec2's 20.7% surge suggests a training bias toward clean audio. Parakeet’s unique decrease in insertion errors under noise is notable, while Whisper's insertion surge highlights weaknesses in noisy settings.
Key Insights & Recommendations
Lowest WER (0.1195). Best for mission-critical tasks: Medical, Legal, Financial services.
Lowest CER (0.1014). Best for educational content, podcasts, and media subtitling.
Lowest Deletion errors. Best for Accessibility and Medical transcription where missing words are fatal.
Moderate performance. Best for general interviews and meeting notes where context tolerates some error.
Low insertions in noise. Excellent for live captioning and voice commands.
Highest WER and Error rates. Not recommended for production use.
What's Next?
Still unsure which speech-to-text model best fits your needs? Book a call with us to dive deeper into our benchmarking insights and identify the perfect model, whether it's Granite-Speech-3.3 for mission-critical accuracy or a proprietary alternative like AssemblyAI.
Ready to scale? Get a personalized strategy session with our AI experts.
Book a Call