Skip to main content

Overview

SkyScribe uses OpenAI’s Whisper, a state-of-the-art automatic speech recognition (ASR) system, to power our audio-to-text transcription service. Whisper delivers industry-leading accuracy across multiple languages and handles various audio conditions with exceptional robustness.

What is Whisper?

Whisper is an automatic speech recognition system trained on 680,000 hours of multilingual and multitask supervised data collected from the web. This extensive training enables:
  • Improved robustness to accents, background noise, and technical language
  • Multilingual transcription in dozens of languages
  • Translation capabilities from multiple languages into English
  • High accuracy across diverse audio conditions

Model Architecture

Whisper uses a simple end-to-end approach, implemented as an encoder-decoder Transformer:
1

Audio Processing

Input audio is split into 30-second chunks and converted into a log-Mel spectrogram
2

Encoding

The spectrogram is passed through an encoder that processes the audio features
3

Decoding

A decoder predicts the corresponding text caption, along with special tokens for:
  • Language identification
  • Phrase-level timestamps
  • Multilingual speech transcription
  • Speech translation to English

Language Support

Whisper supports transcription in 99 languages and translation to English. The model was trained on 680,000 hours of multilingual data, with about one-third being non-English content.

View Language Support

See the complete list of supported languages, performance details, and best practices for multilingual transcription.

Accuracy & Performance

Superior Robustness

Whisper’s training on large and diverse datasets results in exceptional performance:
  • 50% fewer errors compared to specialized models when tested across diverse datasets
  • Handles background noise effectively due to real-world training data
  • Recognizes technical language and domain-specific terminology
  • Works with various accents without additional fine-tuning

Translation Capabilities

Whisper excels at speech-to-text translation:
  • Transcribes audio in the original language
  • Translates to English with high accuracy
  • Supports translation from all 99 supported languages

Zero-Shot Performance

Unlike models fine-tuned for specific datasets, Whisper performs exceptionally well “zero-shot” (without specific training) across:
  • Various audio quality levels
  • Different recording environments
  • Multiple accents and dialects
  • Technical and specialized content

Transcription Speed

Processing time varies based on several factors:

Typical Speed

1-3 minutes for a 30-minute audio file under normal conditions

Factors Affecting Speed

  • Audio/video length
  • File format and quality
  • Current queue size
  • Optional features enabled
Processing time factors:
  • Speaker Diarization: Requires additional processing for speaker identification
  • Video files: May take longer due to audio extraction
  • Queue size: Processing time increases during peak usage
For typical use cases, expect roughly 1/10th real-time processing. A 30-minute recording typically processes in 1-3 minutes.

How SkyScribe Uses Whisper

SkyScribe leverages Whisper’s capabilities to provide:

High Accuracy Transcription

Industry-leading transcription quality across 99 languages

Robust Processing

Reliable performance even with background noise or varying audio quality

Multilingual Support

Seamless transcription and translation across dozens of languages

Technical Content

Accurate recognition of technical terms, jargon, and specialized vocabulary

Performance and Limitations

While Whisper exhibits state-of-the-art performance across many benchmarks, it’s important to understand its strengths and limitations.

Strengths

Accent Robustness

Improved robustness to diverse accents compared to many existing ASR systems

Noise Handling

Better performance in environments with background noise and challenging audio conditions

Technical Language

Strong recognition of technical terminology and specialized vocabulary

Known Limitations

Important: Understanding these limitations helps you get the best results from SkyScribe.

1. Hallucinations

The model may occasionally include text that wasn’t actually spoken in the audio input. This occurs because:
  • Training data includes weakly supervised, large-scale noisy data
  • The model combines predicting the next word with transcribing the audio
  • The model uses its general language knowledge, which can sometimes lead to inference beyond what was said

2. Performance Variation Across Languages

The model’s performance varies across different languages based on the amount of training data available for each language.
See our Language Support guide for more details. We recommend testing with your specific language to ensure it meets your accuracy requirements.

3. Accent and Dialect Variations

The model exhibits disparate performance across:
  • Different accents and dialects of the same language
  • Speakers of different genders, races, and ages
  • Various demographic groups
Word error rates may be higher for certain demographic groups. We continuously work to improve fairness and accuracy across all user groups.
For critical use cases: We recommend reviewing transcripts to ensure they meet your accuracy requirements.

Best Practices

To get optimal results with SkyScribe:
  1. Use high-quality audio when possible (clear speech, minimal background noise)
  2. Specify the language manually if auto-detect isn’t working well for your dialect
  3. Review transcripts for critical applications to ensure accuracy
  4. Report issues to our support team to help us improve

Learn More

Want to dive deeper into Whisper’s capabilities and limitations?

What’s Next?

You’ve learned about the Whisper model powering SkyScribe’s transcription and translation. Want to learn more?

Need Help?

If you have questions about language support, transcription accuracy, or limitations, contact our support team at [email protected].