Whisper: Multilingual speech recognition model, robust, versatile, open-source.
Model Name: Whisper
Developer/Creator: OpenAI
Release Date: September 2022 (original series), December 2022 (large-v2
), and November 2023 (large-v3
)
Model Type: Sequence-to-sequence ASR (automatic speech recognition) and speech translation model
Versions:
The Whisper models are primarily for AI research, focusing on model robustness, generalization, and biases, and are also effective for English speech recognition. The use of Whisper models for transcribing non-consensual recordings or in high-risk decision-making contexts is strongly discouraged due to potential inaccuracies and ethical concerns.
Intended for developers and researchers interested in incorporating speech-to-text capabilities into applications, supporting accessibility features, or conducting linguistic research.
The model utilizes a Transformer architecture that has been pre-trained on a mixture of supervised and unsupervised data.
The models are trained using 680,000 hours of audio and corresponding transcripts from the internet, with 65% being English audio and transcripts, 18% non-English audio with English transcripts, and 17% non-English audio with matching non-English transcripts, covering 98 languages in total.
Research indicates that these models outperform many existing ASR systems. They show enhanced robustness to accents, background noise, and technical language, and provide zero-shot translation from multiple languages into English with nearly state-of-the-art accuracy in both speech recognition and translation.
Performance varies across languages, particularly suffering in low-resource or less commonly studied languages, and demonstrates variability in accuracy with different accents, dialects, and demographic groups. The models may also generate repetitive texts, a trait partly addressable through beam search and temperature scheduling techniques.
Audio or text data used for training would not include information beyond mid-2022
Code Samples/SDK:
Tutorials: Speech-to-text Multimodal Experience in NodeJS
File Size
The maximum file size is limited to 2 GB.
Issues and contributions can be made directly through the GitHub repository.
References