Strategy for Accurate Multilingual Transcription for TTS Data
High-Accuracy Transcription Pipeline
1. Multi-ASR Ensemble and Tiered Models: To achieve high transcription accuracy on ~150k hours across 12 languages, a multi-model approach is ideal. Run multiple ASR systems in parallel and fuse their outputs. For example, Alibaba’s CosyVoice 3 pipeline passes audio through several recognizers and uses cross-validation to get a reliable transcript
. Research on pseudo-labeling shows that combining three ASR models (e.g. Whisper, Nemo, Icefall) and taking a word-level consensus yields near-human accuracy
. In practice, you can adopt a tiered strategy: use a fast model (like a “Gemini-3-Flash” or Whisper-large) for all data, then route difficult segments (low confidence or high disagreement) to a stronger model (“Gemini-3-Pro”) for re-transcription. This hybrid approach reserves the expensive model for the hardest cases, maximizing efficiency. 2. Confidence Checking and Forced Alignment: Implement a validation step to flag uncertain transcripts. Many ASR models provide a confidence score per word or utterance. Additionally, you can run a forced aligner (e.g. Montreal Forced Aligner or a Wav2Vec-based aligner) to align the transcript to the audio. If alignment fails or many words cannot align, it indicates potential errors. Forced aligners are typically language-specific, so you may need acoustic models for each language (MFA offers pre-trained models for several languages, or you can train your own). Flag low-confidence or poorly aligned segments for review or re-processing with a more powerful model. This two-pass validation (ASR confidence + alignment) ensures that only high-quality transcriptions enter your dataset. 3. Cross-Validation with Multiple Models: Expand on the ensemble idea by comparing outputs from different ASRs. If two transcripts perfectly agree (zero CER difference), the text is likely correct
. If they disagree, identify the “confusion regions” (words with mismatches) and handle them specially
. One simple approach is majority voting at the word level when three models are used (choose the word that two of the three agree on)
. For more complex reconciliation, consider using an LLM in text-only mode: feed it the multiple candidate transcripts with markers highlighting differences, and prompt it to choose the most plausible combination. (In fact, a study found that fine-tuning an LLM on ASR outputs can learn to resolve these disagreements effectively
.) By leveraging multi-ASR fusion, you can reduce hallucinations and errors that any single model might produce, achieving transcripts close to human quality
. 4. Strict Prompting to Prevent Hallucination: When using a large language model (LLM) for transcription, constrain its behavior through the prompt and decoding settings. Clearly instruct the model not to generate anything other than the exact speech content. For example, a system prompt might say: “Transcribe the audio exactly as spoken, without paraphrasing or adding extra commentary. Use appropriate language script and include markers like [laugh] only if truly present.” Providing a few examples of correct transcriptions vs. incorrect (hallucinated) ones can reinforce this guideline. Use a low sampling temperature (0.0 or near 0) to minimize randomness. Also, request output in a structured format (like JSON) – this naturally limits free-form creativity. For instance, asking the LLM to output: {"verbatim": "...", "roman_normalized": "...", "verbatim_emotions": "..."} leaves little room for imaginative tangents. Big labs often rely on such prompt engineering and even slight fine-tuning of LLMs on transcription tasks to keep them factual. If the model is still too “chatty,” consider fine-tuning it briefly on a small supervised set of audio & accurate transcript pairs to bias it toward literal transcription (essentially instructing it on the task before scaling to all data).
Handling Multilingual and Code-Switched Speech
1. Language Identification and Script Selection: A key challenge is code-switching – many of your segments mix languages (e.g. Hinglish: “Are bhai, kya kar rahe ho?”). The transcription system should output each portion in the correct script for that language. The best practice is to identify the spoken language at the word or phrase level and use that language’s writing system. For example, if a Hindi speaker says “अरे भाई, क्या कर रहे हो?”, the transcript should be in Devanagari, even if similar phonetics could be written in English letters. Likewise, Telugu speech should be in Telugu script. Most multilingual ASR models (like Whisper) will automatically output in the native script of the language if they’re trained on it. For code-switched utterances, you might need to post-process or instruct the model explicitly: e.g., “Use Devanagari for Hindi words, Tamil script for Tamil, and Latin script for English.” If the ASR can’t handle this natively, a fallback is to use a language-tagging approach – insert markers like <lang hi>...</lang> around Hindi text, then convert to Devanagari with a transliteration tool later. However, direct script output is preferable to avoid an extra conversion step (which can introduce errors). Notably, CosyVoice 3 emphasized robust text normalization across languages and text format diversity, ensuring the model saw various scripts and styles during training
. This means your TTS model will be more robust if it’s trained on transcripts that faithfully represent each language in its own script, even within a single sentence. 2. Transliteration for Romanized Segments: In some cases, speakers might spell things out or use English for local terms. If your transcripts end up with romanized native-language text (e.g. “arey anna, em chestunnav?” in Latin letters), consider transliterating that to the native script (“అరే అన్నా, ఏమి చేస్తున్నావ్?” in Telugu) for consistency. There are libraries and models for Indic transliteration which you can use in an automated pipeline. An alternative strategy is to include a special notation for such cases (the “te_EN” approach you mentioned, to denote Telugu-in-English-letters). However, adding such tags in the training data could complicate the model’s input format. It’s usually cleaner to have unified transcripts in one script per language. If an English word is truly being used (like a name or technical term), you can leave it in English. But if it’s just a phonetic Hindi phrase typed in English letters, convert it to हिंदी. The goal is to preserve the language identity of each word so the TTS model knows how to pronounce it correctly. Big multilingual TTS systems often rely on either the script or explicit language tokens to handle code-switching; for example, some use inline tokens like <|hi|> to force Hindi pronunciation for a following Latin script text. In your case, since you’re generating transcripts from scratch, it’s optimal to output the correct script and avoid the need for such tokens. You can always fall back to language tags if needed for ambiguous cases, but try to minimize mixed-script scenarios by resolving them in the transcript generation step. 3. Text Normalization (Numbers, Acronyms, etc.): Don’t forget to normalize non-speech content in transcripts. This includes expanding numbers, dates, acronyms, and formatting. A phrase like “I have 12,345 rupees” should be transcribed as “I have twelve thousand three hundred forty-five rupees” (or the equivalent in the target language) to avoid the TTS model having to learn how to say numeric digits. You already noted you can handle numeric normalization with regex or rules – that’s a good idea. Also consider common ITN (Inverse Text Normalization) cases: expanding abbreviations (“Dr.” to “Doctor”), units, etc., in whatever language context they appear. Consistent normalization is important so that your model training data has a uniform, speech-like text form (no stray symbols or untranslated English tokens unless intended). As CosyVoice 3 noted, incorporating varied text normalization and formats during training improved robustness
 – so you might even include both a “verbatim” version and a “normalized” version of transcripts in training (though primarily you’ll train on normalized transcripts for TTS). In summary, ensure each transcript line reads like natural, fully-spoken text in that language’s orthography.
Incorporating Emotion and Non-Speech Markers
1. Annotating Laughter and Other Events: To create a truly human-like TTS, you may want to include phenomena like laughter, coughs, and emotional tone indicators in the training data. This means inserting special tokens like [laugh], [cough], or emotion tags (e.g. [sad], [excited]) at the appropriate places in transcripts. Many state-of-the-art models have done exactly this. For example, CosyVoice 3’s training data included fine-grained inline tags such as “[laughter]” to indicate non-verbal sounds
. Inworld’s TTS-1 similarly introduced a set of markup tokens for speaking style and non-verbal vocalizations during a later fine-tuning stage
. These tags let the model learn to produce those cues in audio. To get these annotations, you have a couple of options:
Automatic detection: Use an audio-based classifier to detect laughter, applause, music, etc. You already use PANN for music detection; consider using a pre-trained sound event detection model for laughter or crowd noises. If the audio segment triggers a high confidence for laughter, insert a [laugh] token at that point in the transcript. Some ASR models (or Whisper with certain settings) might also insert tags like “[LAUGHTER]” – if so, leverage that output.
LLM-based annotation: Since you plan to use a powerful model for transcription, you could instruct it to include tags for audible non-speech events. For instance: “Transcribe the speech, and if the speaker laughs or coughs, insert [laugh] or [cough] at that position.” A well-trained model should do this reasonably, but be cautious: the model might hallucinate an event tag if it thinks something was funny or if there’s an ambiguous noise. To mitigate false positives, cross-verify with audio: e.g., if the transcript has a [laugh] tag, ensure the audio segment indeed contains laughter (perhaps by a simple energy burst check or a secondary classifier). You can filter out tags that aren’t corroborated by audio evidence.
2. Defining a Controlled Tag Set: Limit the set of emotion and event tokens to a defined list – you mentioned about 10 stable tags (e.g. [happy], [sad], [angry], [laugh], [pause], etc.). Consistency is key: the model should see these tags frequently and in correct contexts. Big labs handle this by collecting or generating paired examples. For instance, Inworld’s team constructed a special dataset of ~100k paired samples where each neutral utterance was paired with a stylized version containing emotion tags (like an angry version with [angry] tag)
. This taught their model how tags correspond to delivery. In your case, since you already have a large amount of real speech, you can incorporate the tags in transcripts wherever applicable. The “verbatim_emotions” field you proposed would contain these annotations. During model training, you can decide to either train the base model with these tags included from the start, or introduce them in a fine-tuning stage:
Immediate integration: If your automated tagging is accurate, you can include tags in the transcripts from the get-go. The model will learn them as part of its language modeling. CosyVoice 3, for example, had a portion of its pretraining data (~5,000 hours) with instructions and emotions, meaning the base model already saw emotion tokens in context
.
Post-training fine-tune: Alternatively, you train the TTS model on plain speech first, then perform a smaller supervised fine-tuning (SFT) with a focused dataset that includes emotion/style tags. This is what Inworld did – after main training, they fine-tuned with LoRA on markup-enriched data to inject fine-grained control
. This approach can sometimes yield more stable base performance (since the model isn’t trying to learn tags from noisy data), then add controllability later.
Given your inclination to “get data at once”, a practical plan is to generate multiple transcript versions now: one pure verbatim, and one with emotion tags. You could even create an augmented training set where, say, 20% of samples have the emotion tags inserted (to teach the model the concept), and the rest are normal (to preserve base quality)
. This mirrors Inworld’s strategy of mixing neutral and marked-up examples. It’s wise to keep the tag vocabulary limited and standardized (no free-form descriptions, just [laugh] not [laugh_hysterically] unless you plan to consistently use a wide range). Simplicity will make it easier for the model to learn the association. 3. Quality Control for Tags: As you suspected, an LLM might occasionally hallucinate an emotion tag (e.g., insert [sad] because the content seems sad, even if the speaker’s tone was neutral). To control this, implement a filtering rule: non-verbal tags like [laugh] or [cough] should only be present if the audio truly contains that sound. For emotional tone tags like [angry] or [sarcastic], it’s harder to algorithmically verify, but you might base it on an emotion recognition model that classifies the speaker’s tone. If the confidence is below a threshold, remove or replace the tag. Essentially, treat LLM-generated emotion tags as suggestions that need verification. It’s better to have slightly under-tagged data than to train the model on hallucinated events.
Generating and Using Multi-Form Transcripts
You proposed outputting transcripts in multiple forms – verbatim native script, romanized, and with emotion tags. This is an excellent idea for flexibility. Here’s how to tackle it:
Verbatim (Native Script): This is the core transcription in the language’s own script, with basic punctuation. Use this for primary TTS and ASR training. Ensure it’s as accurate as possible and normalized. All the strategies above (ensemble ASR, alignment, etc.) apply chiefly to this output.
Romanized (Latin Script): Having a romanized version of each transcript can be useful for certain applications (like if you want to train an ASR that takes Latin characters for all languages, or simply to help non-native readers). To get this, you can either prompt the LLM to produce it (for instance, by asking it to fill the "roman_normalized" field) or generate it after-the-fact with a transliteration library. A large multilingual model will often know how to phonetically spell words in Latin script, but it might not follow a strict standard. You may get variations (“e.g. kam kar rahe ho vs kam kar rahe ho? punctuation or spelling differences). A rule-based transliterator (like ITRANS or ISO 15919 for Indic languages) would give consistent results, but those can sometimes be awkward for non-technical use. A compromise is to use the LLM’s output but do a spot check or minor normalization on it (for example, ensure all diacritics are removed if you want plain ASCII, etc.). This romanized text is mostly for reference – it likely won’t be used to train the final TTS model unless you plan to allow Latin-script input for languages like Hindi (which is uncommon for a production TTS). If code-mixed content includes true English words, they will appear unchanged here, which is fine.
Verbatim with Emotions (or “Rich Transcript”): This version would mirror the native script transcript but include the special tokens for laughs, emotions, background noises, etc. It might also include enhanced punctuation or casing if you decide to add that (for instance, maybe in the verbatim version you keep minimal punctuation, but in the enriched version you could add “…” for pauses or capitalize proper nouns – it depends on what your model is expected to handle). To generate this, the LLM can be prompted to include those markers (as discussed above). Another approach is a two-step process: get the clean verbatim transcript first, then run a second pass where you insert tags based on external detectors (e.g., insert [laugh] at 3.2s because a laughter was detected there, or prepend [angry] at the start if an emotion model says the speaker’s tone is angry throughout the segment). Doing it in two steps can actually be safer – it separates speech-to-text from paralinguistic tagging, potentially reducing transcription errors. However, a capable multimodal model could do both in one go. Choose the approach that gives the highest precision for tags.
When you have these multiple forms, you can store them in your metadata (e.g., in Supabase or alongside the audio in the .json). They’re all derived from the same audio, so you can reference them as needed. For TTS training, you’ll likely start with the plain verbatim transcripts (no tags) for the bulk of pretraining. Later, you might use the verbatim_emotions version in a specialized fine-tune to teach the model those tags and emotional style control
. The romanized text might not directly go into model training, but it’s a nice-to-have for analysis or for training an auxiliary model (like a language identification or as a fallback ASR input).
Best-Practice Takeaways
To summarize the strategy with some best practices from industry research:
Use multiple ASR engines and cross-validation to maximize transcription quality. This is a proven approach in CosyVoice 3, which used “cross-validated ASR transcription using multiple models” to create its 1M-hour training corpus
. Recent research goes further with ensemble decoding and LLM-based error correction, showing that an ensemble+LLM can match human transcription quality on large datasets
. Implement as much of this as feasible – even a simpler two-model vote can boost accuracy.
Rigorous filtering and QA: After transcription, apply filtering similar to large TTS projects. Inworld’s TTS-1, for instance, filtered out segments with likely transcription errors (e.g. extremely high or low speaking rates, or weird punctuation-only transcripts)
. You should discard or re-transcribe any segments that look suspicious (e.g., a 10-second audio that got only “[music]” as text, or an obviously wrong language). Using your metadata like audio-text length ratio is helpful – if an audio segment is 5 seconds but the transcript has 50 characters, something is off. This kind of filtering will ensure your final 150k hours are truly usable.
Handle code-mixing robustly: Make a decision and stick to it regarding script usage. Most likely, this means all Hindi in Devanagari, all Tamil in Tamil script, etc., even if the original speaker peppered English words – keep those English words in Latin script as they are English. If a speaker says a full English sentence inside a Hindi segment, just transcribe that sentence in English. Your TTS model (if truly multilingual) should then naturally switch to the English “voice” for that bit. Consistency in how you represent mixed-language content will be important for the model to learn code-switching behavior.
Leverage emotion markers as a controllability feature: Including things like [laugh] or <happy> in training data can significantly enhance the expressiveness of the TTS model. CosyVoice 3 and others explicitly added such markup to gain fine-grained control
. Plan how you’ll use these: if you include them in training from the start, ensure they are correct and not too sparse. Alternatively, create a smaller high-quality set of data with these annotations for a later fine-tune. You might even do both: a bit in pretraining, plus a focused fine-tune, as a belt-and-suspenders approach for the model to really learn it.
Finally, continuously evaluate WER/CER on a validation subset across languages as you iterate on the transcription process. If, say, the Telugu transcripts are coming out with 15% WER against a known test set, whereas Hindi is 5%, you might decide to route more Telugu data through the “Pro” model or do an extra QC pass for Telugu. This feedback loop will guide where to invest more effort (as you noted, perhaps using the larger model on “less feasible” languages). The end goal is a massive collection of transcripts that are as close to ground-truth as possible, in the right script, with optional metadata tags – a solid foundation for training a high-fidelity, multilingual, expressive TTS LLM. Sources:
Alibaba (Tongyi Lab) – CosyVoice 3 technical summary (1M hours, multi-model ASR pipeline, emotion tags)
.
Inworld AI – TTS-1 Technical Report (200k-hour filtered dataset, markup fine-tuning for emotions)
.
Y. Chen et al. – Multi-ASR Fusion for Pseudo-Labeling (ensemble of Whisper/Nemo/Kaldi with LLM refinement to approach human transcription quality)
.
Montreal Forced Aligner docs (usage of forced alignment for various languages).
General ASR best practices in handling code-switching and text normalization.