Architectural Blueprint for a Generative Indic TTS System: Leveraging Large-Scale Podcast Data and Multimodal LLMs1. Introduction: The Paradigm Shift in Speech SynthesisThe domain of Text-to-Speech (TTS) synthesis is currently undergoing a fundamental paradigm shift, moving from statistical parametric synthesis and end-to-end neural pipelines (such as Tacotron and FastSpeech) toward generative Speech Language Models (SpeechLMs). This transition is driven by the recognition that human speech is not merely an acoustic signal reconstruction task but a semantic modeling challenge that requires deep understanding of context, paralinguistics, and conversational dynamics. For Indic languages—characterized by complex code-switching (e.g., Hinglish, Tanglish), rich morphological structures, and diverse prosodic patterns—this shift presents both a formidable challenge and an unprecedented opportunity.Traditional acoustic models have historically struggled with "in-the-wild" data, requiring clean, studio-recorded corpora to produce intelligible speech. However, the emergence of architectures such as Inworld TTS-1 1, CosyVoice 3 1, and FireRedTTS-2 1 demonstrates that scaling training compute and data volume allows models to learn directly from noisy, unstructured audio, provided the semantic conditioning is sufficiently robust. These models treat speech synthesis as a next-token prediction task, effectively bridging the gap between Large Language Models (LLMs) and audio generation.This report articulates a comprehensive strategy for developing a human-like Indic TTS system using uncurated YouTube podcast data. Podcasts represent a rich repository of naturalistic speech, containing the disfluencies, interruptions, laughter, and code-switching that characterize real human communication. However, utilizing this data requires a sophisticated curation pipeline. We propose leveraging the multimodal reasoning capabilities of Gemini 3.0 Pro and Flash to generate high-fidelity, verbatim transcriptions enriched with audio event tags. Furthermore, we delineate a hybrid training curriculum that synthesizes the best practices from state-of-the-art research—specifically, the use of Supervised Multi-task Training (SMT) for tokenization, interleaved text-speech formatting for context maintenance, and Low-Rank Adaptation (LoRA) for stylistic control—to determine the optimal phase for introducing code-switching and paralinguistic features.2. The Data Curation Pipeline: Multimodal Transcription with Gemini 3.0The quality of an autoregressive TTS model is strictly bounded by the fidelity of its training data. In the context of podcast data, "fidelity" refers not just to audio quality, but to the precision of the textual transcription and its alignment with the acoustic reality. Standard Automatic Speech Recognition (ASR) systems typically normalize speech, removing "umms," "ahs," and stuttering to produce readable text. For a generative TTS aiming to sound human, this normalization is destructive; it erases the very features the model needs to learn.2.1. The Role of Gemini 3.0: Verbatim Stenography vs. Semantic ReasoningGemini 3.0, with its massive context window (up to 1 million tokens in Flash and Pro) and multimodal native architecture, offers a distinct advantage over traditional ASR models.2 It can process audio and text simultaneously, allowing for "context-aware transcription" where the model uses semantic cues to resolve acoustic ambiguities—a frequent occurrence in code-mixed Indic speech.2.1.1. Model Selection: Gemini 3.0 Pro vs. Flash for ASRThe choice between Gemini 3.0 Pro and Gemini 3.0 Flash is critical and depends on the specific stage of the data pipeline.Gemini 3.0 Pro: This model excels in complex reasoning and instruction following.4 Benchmarks indicate it significantly outperforms previous iterations in multimodal understanding and long-context retrieval.5 For the initial generation of "ground truth" transcripts, Gemini 3.0 Pro is the recommended engine. Its "Deep Think" capabilities allow it to better adhere to complex formatting constraints, such as identifying specific speakers in a multi-party podcast or distinguishing between a cough and a laugh based on context.6Gemini 3.0 Flash: While significantly faster and more cost-effective (processed at ~218 tokens/second), evaluations suggest that Flash models can suffer from "timestamp hallucination" and compression issues.7 Users have reported that Gemini 3.0 Flash may compress timestamps (e.g., mapping a 5-minute audio to a 3-minute timeline) or round them to the nearest second, lacking the centisecond precision required for TTS alignment.7 Therefore, Flash should be reserved for secondary metadata tasks—such as topic classification, sentiment analysis, or filtering non-speech segments—rather than the primary verbatim transcription task where temporal precision is paramount.2.1.2. Prompt Engineering for Verbatim Fidelity and Audio EventsTo extract a training-ready corpus from podcasts, the system prompt must override the LLM's natural tendency to summarize or clean up text. The goal is to generate a "screenplay" of the audio.System Instruction Design:The prompt must enforce a strict schema that captures three parallel streams of information: linguistic content, paralinguistic events, and speaker diarization.Verbatim Directive: "You are an expert acoustic phonetician. Transcribe the provided audio file with absolute verbatim accuracy. Do not summarize. Do not correct grammar. Include all disfluencies, filler words (e.g., 'umm', 'uh', 'like'), repetitions, and stutters exactly as spoken. If a speaker trails off, mark it with '...'." 8Audio Event Tagging: The prompt must explicitly define the set of allowed audio tags to prevent vocabulary explosion in the TTS model. Based on Inworld TTS-1's methodology, valid tags should include [laugh], [sigh], [breath], [cough], [throat_clear], and [yawn].1 The instruction should specify: "Insert non-verbal tags exactly where the sound occurs in the temporal flow, even if it interrupts a word. Do not place them at the end of the sentence unless the sound occurs there." 10Speaker Diarization: "Identify distinct speakers and label them as Speaker A, Speaker B, etc. If the speaker changes mid-sentence, insert the new speaker tag immediately." 11This rigorous prompting transforms the LLM from a passive transcriber into an active annotator, creating a dataset where every acoustic event has a corresponding textual token.2.2. The Code-Switching Conundrum: Script Strategies for Hinglish/TanglishA defining characteristic of the target data is code-mixing—the fluid alternation between English and Indic languages. Handling this textually is one of the most significant engineering decisions in the pipeline.2.2.1. The Failure of Pure RomanizationA common approach is to transliterate all Indic text into the Roman (Latin) script (e.g., writing Hindi as "Main ghar ja raha hoon"). While this simplifies the character set, research indicates it is detrimental to TTS quality.Phonetic Ambiguity: The Roman script lacks the graphemic density to represent Indic phonemes accurately. For instance, the dental 't' (त) and retroflex 't' (ट) in Hindi are distinct phonemes but are both mapped to 't' in Roman transliteration. A model trained on this ambiguous data often suffers from "accent drift," producing anglicized pronunciations of Indic words.12LID Confusion: Language Identification (LID) models struggle to distinguish between Romanized Hindi, Romanized Urdu, and English, leading to poor routing in multilingual systems.122.2.2. The Hybrid Script RecommendationTo achieve human-like prosody and accurate pronunciation, this report recommends a Hybrid Script Strategy:English words are transcribed in the Roman script.Indic words are transcribed in their Native script (Devanagari for Hindi, Tamil script for Tamil).Prompting Implementation:The system instruction to Gemini 3.0 must explicitly enforce this: "Transcribe English words in the Roman script. Transcribe Hindi/Tamil words in their native script. Do not transliterate Indic words into Roman script unless the speaker explicitly spells them out." 12Why this works:This strategy effectively provides the TTS model with an explicit, token-level Language ID. When the model encounters Devanagari tokens, it switches its internal acoustic priors to the Indic phoneme space; when it encounters Roman tokens, it switches to English. This preserves the native pronunciation of Indic content while maintaining the correct stress patterns for English loanwords, directly addressing the "schwa deletion" and stress mapping challenges inherent in Hinglish TTS.142.3. Temporal Alignment: The Two-Pass ProtocolWhile Gemini 3.0 provides text and approximate timestamps, its precision is often insufficient for cutting training data, which requires alignment at the phoneme or frame level (<20ms). "Flash" models, in particular, have shown regression in timestamp accuracy.7 Therefore, a Two-Pass Alignment Protocol is necessary.2.3.1. Pass 1: Semantic Transcription (Gemini 3.0 Pro)Gemini generates the verbatim, code-mixed, event-tagged transcript. This provides the "Ground Truth" text sequence.2.3.2. Pass 2: Forced Alignment (MFA / Wav2Vec2)The Gemini transcript and the raw audio are fed into a Forced Alignment tool to generate precise start and end times for every phoneme and word.Montreal Forced Aligner (MFA): This is the industry standard for alignment. For code-mixed data, MFA must be initialized with a merged pronunciation dictionary that combines the CMU Dict (for English) with an Indic Lexicon generated via Grapheme-to-Phoneme (G2P) rules.15 MFA uses GMM-HMM acoustic models, which are highly robust for boundary detection in clean speech.Wav2Vec2-based Alignment: For podcast data with significant background noise or music where MFA might fail, a fine-tuned Wav2Vec2 model (like IndicWav2Vec 16) using CTC-segmentation is recommended.17 These models are more robust to noise and can align text even in challenging acoustic environments.This two-pass approach leverages the reasoning of Gemini for what was said and the signal processing of MFA/Wav2Vec2 for when it was said, creating a dataset with high semantic and temporal fidelity.ComponentRoleRecommended ToolKey ConstraintTranscriptionText GenerationGemini 3.0 Pro"Screenplay" Prompt; Hybrid ScriptDiarizationSpeaker IDGemini 3.0 ProExplicit Turn-Taking TagsEvent DetectionParalinguisticsGemini 3.0 Pro + Audio ClassifierVerify tags with YAMNet/BEATsAlignmentTemporal SlicingMFA / Wav2Vec2Merged EN/Indic Dictionary3. Tokenizer Architecture: The Foundation of ExpressivityIn the modern "SpeechLM" paradigm, the continuous audio waveform is discretized into a sequence of tokens using a Neural Audio Codec. The quality and semantic richness of these tokens define the upper bound of the TTS model's performance. If the tokenizer compresses a "sigh" into generic noise tokens, the LLM will never learn to generate a sigh, regardless of how well it is prompted.3.1. Limitations of Standard TokenizersStandard neural codecs like EnCodec or SoundStream are typically trained with a reconstruction objective (minimizing Mean Squared Error and perceptual loss). While effective for compression, this approach is "semantically blind." It treats paralinguistic events (laughter, breath) and background noise equivalently, often resulting in poor representation of the emotional content essential for human-like speech.3.2. Supervised Multi-task Training (The CosyVoice 3 Methodology)CosyVoice 3 introduces a critical innovation: Supervised Multi-task Training (SMT) for the tokenizer.1 Instead of training solely on reconstruction, the tokenizer's encoder is trained with auxiliary prediction heads for semantic tasks.Recommended Tokenizer Architecture:Base: A quantized autoencoder (e.g., X-codec2 or HiFi-Codec) using Residual Vector Quantization (RVQ) or Finite Scalar Quantization (FSQ).1Auxiliary Losses: The encoder representations should be fed into auxiliary heads to predict:Automatic Speech Recognition (ASR): Forces the latent tokens to retain phonetic/linguistic content.Speech Emotion Recognition (SER): Forces the latent tokens to encode emotional states (e.g., Happy, Sad, Angry).1Audio Event Detection (AED): Forces specific tokens to represent non-verbal events like laughter or breathing.Speaker Verification: Preserves speaker identity information in the tokens.Impact:This training regime produces "semantic audio tokens." A specific token sequence becomes strongly correlated with "laughter" or "questioning intonation" because the encoder was explicitly penalized for losing that information. When the downstream LLM generates these tokens, it is invoking rich, pre-learned acoustic concepts rather than just raw waveforms. This is the "bridge" that allows text tags like [laugh] to translate effectively into acoustic laughter.4. Training Curriculum: Staging Capability and ControlThe creation of a robust, controllable TTS system is not a monolithic process but a staged curriculum. Drawing from the methodologies of Inworld TTS-1, FireRedTTS-2, and CosyVoice 3, we propose a three-phase training strategy: Pre-training, Post-training, and Alignment.4.1. Phase 1: Pre-training (Capacity Building)Objective: Build a robust "World Model" of speech acoustics, text-audio alignment, and multilingual phonetics.Data: The full 1M+ hour corpus derived from YouTube podcasts and other sources.1Filtering: Unlike traditional TTS which requires studio-quality data, this phase should utilize the noisy, natural podcast data. Crucially, do not filter out breathing, hesitation, or minor background noise. The model must learn the statistical distribution of real human speech, not just "reading voice".1Input/Output: The model (a Decoder-only Transformer, e.g., 1B-3B parameters) is trained on the Next-Token Prediction task.Input: Text Tokens (Hybrid Script) + Speaker Prompts.Target: Audio Tokens (from the SMT tokenizer).Role of Audio Tags: In this phase, implicit learning is prioritized. While audio tags ([laugh]) can be present in the text, the primary goal is for the model to learn the correlation between text semantics and acoustic prosody (e.g., that the text "That's hilarious!" often precedes the acoustic tokens for laughter).4.2. Phase 2: Post-Training / Instruction Tuning (The Control Layer)This phase answers the user's specific query about when to introduce features. The consensus from Inworld TTS-1 1 and FireRedTTS-2 1 is that explicit control mechanisms should be refined during Post-training or Supervised Fine-Tuning (SFT).4.2.1. Handling Dialogue: The Interleaved Format (FireRedTTS-2)To model the conversational flow of a podcast (turn-taking, interruptions), the data must be structured to preserve context. FireRedTTS-2 proposes the Text-Speech Interleaved Format 1:Format: