Whisper: Leading the Charge to Revolutionize Converting Speech to Text

Stylized illustration of a man standing next to a computer monitor. The monitor shows the computer recording the man's speech and transcribing it on a word doc.

July 19, 2023

Blog

At the end of 2022, Open AI released Whisper, a speech-to-text artificial intelligence (AI) technology that is leading a new wave of open source automatic speech recognition (ASR) models. What makes Whisper truly revolutionary is that it is the first commercial-grade ASR model that is available for free. It also has the added feature of language detection, which is a first of its kind technology (to our knowledge!) and translation into a text in a different language. All this has allowed Whisper to advance beyond any other open-source ASR model on the market.

Range of Offerings

Whisper has several model sizes ranging from tiny (39 million parameters) to large (1.55 billion parameters), which are suitable for different use cases. The tiny model is good for near-real-time performance in situations where reducing the response time is critical, like phone or meeting transcriptions. But it may make more mistakes than its more highly parameterized brethren, and the transcription quality drops off steeply with specialized or domain-specific content. The large model requires a bit more investment in hardware to make it practical in production, but is the best option for longer phrases and sentences and more specialized content, which makes it suitable for subtitling scenarios.

Fine-Tuning Results

If your use case involves fairly consistent input content, fine-tuning can be a worthwhile investment. All that is needed is to collect a training set of audio-transcription pairs, convert the audio to a sampling rate of 16kHz, and any of the Whisper speech to text AI models can be fine-tuned on your content with just a few lines of code.

Language Detection from Audio

One of the most innovative features offered by these models is language detection from audio. While language detection from text is not a fully "solved" problem, solutions based on character n-grams have been available for years. However, to our knowledge, no tool before Whisper offered language detection from raw audio. This feature was made possible by adding a special token at the beginning of transcripts in the training data which represents the language of the transcription. This allows Whisper to predict the language of a transcription at inference time, even if the language is unknown.

Limitations

When it comes to known limitations, in our experience, Whisper doesn’t do well in code-switching scenarios (e.g., when there is a mix of several languages in one audio). In addition, it occasionally produces output translated into a different language even when a translation was not requested.

Conclusion

Recently, Meta (née Facebook) released its own open-source automatic speech recognition models, and surely more will come in the near future. But Whisper, as the first in this new generation of ASR technologies, has proved to be the benchmark for quality speech-to-text models. With its cutting-edge features like language detection and a variety of model sizes for different use cases, Whisper will continue to revolutionize the industry for years to come.