Introducing Cohere-transcribe: state-of-the-art speech recognition
In English, cohere-transcribe outperforms both proprietary and open-source competitors to take the #1 spot on the huggingface Open ASR Leaderboard. In addition, across the remaining 13 languages, our model is comparable to, or better than, all existing open-source models.
Figure 1: Cohere-transcribe has a better throughout (RTFx) vs accuracy (WER) tradeoff than other 1B+ size models. RTFx (real-time factor multiple) measures how fast an audio model processes its input relative to real time.
Cohere-transcribe was designed and built with production use in mind. That means state-of-the-art accuracy in a model that can be served efficiently. To that end, we collaborated with vLLM to enable production serving of our model with an open-source stack [see merged PR].
Cohere-transcribe is Cohere’s first audio model. From a modelling perspective, our aim with this release was to take a simple recipe and scale it methodically. For us this meant paying particular attention to fundamentals: a strong multilingual tokenizer, the optimization regime and, of course, our data mix. The result is a model that outperforms competitors in human evaluation as well as benchmarks. In this post we detail some of the key design decisions we made during the Cohere-transcribe model development process.
Supported languages: The model has been trained on 14 languages: English, German, French, Italian, Spanish, Portuguese, Greek, Dutch, Polish, Arabic, Vietnamese, Chinese (Mandarin), Japanese and Korean.
Try it now in our Hugging Face Space.
Architecture
Cohere-transcribe is a 2B encoder-decoder X-attention transformer with a Fast-Conformer encoder [1] trained with cross entropy. Following Distil-Whisper and others [2-4], we dedicate more than 90% of our total parameters to the encoder and maintain a lightweight decoder. This asymmetry keeps the amount of autoregressive inference compute to a minimum while maintaining performance. The strong efficiency of our offering is a direct result of this decision. In contrast, other recent models such as Qwen-1.7B-ASR and ibm-granite/granite-4.0-1b-speech build upon pre-trained text LLMs and add audio understanding to this autoregressive backbone. This makes the ASR model cheaper to train, but comes at the expense of inference speed and serving cost.
Training data
We chose a conventional, well-tested architecture and dedicated the bulk of our model development cycles to data work. cohere-transcribe-03-2026 was trained on 0.5M hours of curated audio transcript pairs. Following rounds of error analysis we augmented this with synthetic data. In order to get the final mix we filtered subsets of the data mix with an internal cleaning pipeline. We used proprietary methods for mix balancing; furthermore, we also ran audio decontamination checks for test/train overlap.
We used a 16k multilingual bpe tokenizer with byte fallback that we trained on data sampled in-distribution. During ASR training, we applied non-speech background noise augmentation with SNRs in the range 0 to 30 dB. Following canary [5] we make punctuation customizable in the prompt. This enabled us to train on open datasets for which there is no cased or punctuated reference transcriptions (e.g. multilingual librispeech). By default at inference time we punctuate all transcripts.
