Overview
SkyScribe uses OpenAI’s Whisper, a state-of-the-art automatic speech recognition (ASR) system, to power our audio-to-text transcription service. Whisper delivers industry-leading accuracy across multiple languages and handles various audio conditions with exceptional robustness.What is Whisper?
Whisper is an automatic speech recognition system trained on 680,000 hours of multilingual and multitask supervised data collected from the web. This extensive training enables:- Improved robustness to accents, background noise, and technical language
- Multilingual transcription in dozens of languages
- Translation capabilities from multiple languages into English
- High accuracy across diverse audio conditions
Model Architecture
Whisper uses a simple end-to-end approach, implemented as an encoder-decoder Transformer:1
Audio Processing
Input audio is split into 30-second chunks and converted into a log-Mel spectrogram
2
Encoding
The spectrogram is passed through an encoder that processes the audio features
3
Decoding
A decoder predicts the corresponding text caption, along with special tokens for:
- Language identification
- Phrase-level timestamps
- Multilingual speech transcription
- Speech translation to English
Language Support
Whisper supports transcription in 99 languages and translation to English. The model was trained on 680,000 hours of multilingual data, with about one-third being non-English content.View Language Support
See the complete list of supported languages, performance details, and best practices for multilingual transcription.
Accuracy & Performance
Superior Robustness
Whisper’s training on large and diverse datasets results in exceptional performance:- 50% fewer errors compared to specialized models when tested across diverse datasets
- Handles background noise effectively due to real-world training data
- Recognizes technical language and domain-specific terminology
- Works with various accents without additional fine-tuning
Translation Capabilities
Whisper excels at speech-to-text translation:- Transcribes audio in the original language
- Translates to English with high accuracy
- Supports translation from all 99 supported languages
Zero-Shot Performance
Unlike models fine-tuned for specific datasets, Whisper performs exceptionally well “zero-shot” (without specific training) across:- Various audio quality levels
- Different recording environments
- Multiple accents and dialects
- Technical and specialized content
Transcription Speed
Processing time varies based on several factors:Typical Speed
1-3 minutes for a 30-minute audio file under normal conditions
Factors Affecting Speed
- Audio/video length
- File format and quality
- Current queue size
- Optional features enabled
Processing time factors:
- Speaker Diarization: Requires additional processing for speaker identification
- Video files: May take longer due to audio extraction
- Queue size: Processing time increases during peak usage
How SkyScribe Uses Whisper
SkyScribe leverages Whisper’s capabilities to provide:High Accuracy Transcription
Industry-leading transcription quality across 99 languages
Robust Processing
Reliable performance even with background noise or varying audio quality
Multilingual Support
Seamless transcription and translation across dozens of languages
Technical Content
Accurate recognition of technical terms, jargon, and specialized vocabulary
Performance and Limitations
While Whisper exhibits state-of-the-art performance across many benchmarks, it’s important to understand its strengths and limitations.Strengths
Accent Robustness
Improved robustness to diverse accents compared to many existing ASR systems
Noise Handling
Better performance in environments with background noise and challenging audio conditions
Technical Language
Strong recognition of technical terminology and specialized vocabulary
Known Limitations
1. Hallucinations
The model may occasionally include text that wasn’t actually spoken in the audio input. This occurs because:- Training data includes weakly supervised, large-scale noisy data
- The model combines predicting the next word with transcribing the audio
- The model uses its general language knowledge, which can sometimes lead to inference beyond what was said
2. Performance Variation Across Languages
The model’s performance varies across different languages based on the amount of training data available for each language.3. Accent and Dialect Variations
The model exhibits disparate performance across:- Different accents and dialects of the same language
- Speakers of different genders, races, and ages
- Various demographic groups
Word error rates may be higher for certain demographic groups. We continuously work to improve fairness and accuracy across all user groups.
Best Practices
To get optimal results with SkyScribe:- Use high-quality audio when possible (clear speech, minimal background noise)
- Specify the language manually if auto-detect isn’t working well for your dialect
- Review transcripts for critical applications to ensure accuracy
- Report issues to our support team to help us improve
Learn More
Want to dive deeper into Whisper’s capabilities and limitations?- Read the Whisper research paper
- View the Whisper model card
- Explore the open-source code
What’s Next?
You’ve learned about the Whisper model powering SkyScribe’s transcription and translation. Want to learn more?- Check out language support for details on specific languages
- Learn how to edit transcripts to improve accuracy

