So, I am on a mission to Create a human-like quality TTS LLM model. And for that we need data.
My main source of data and only source of data is Youtube. And I've seggregated Youtube videos based on podcasts and non-podcasts. Like all podcast videos in each indic language like telugu, hindi, kannada, tamil, etc. Now, I assumed all podcasts are high quality microphone recorded audio and atleast 50% of it will be natively recorded in 44khz atleast and may/maynot have been upsampled to 48khz. So i assumed all or most of those podcasts will be atleast 44khz natively. So my assumption-1 is podcasts = quality.
Hence i started filtering youtube channels first based on podcasts, then videos from those channels and filtered by, podcasts and all those channel’s each video is filtered by Gemini to accurately identify which among those are usable podcast videos assuming there might be so many other random videos as well. So, Gemini-3-flash filtered out most of those podcasts and now i have around
Now in R2, i have these indic youtube files saved as YoutubeID.webm and now. Now how did these Videos get saved in my Cloudflare R2, When i tried running it at scale in some automation jobs, then youtube detected it and blocked me from downloading youtube audios using yt-dlp . So here’s what i did. i bought proxies and downloaded the youtube video as residential proxy and reuploaded those youtube video URL → created a VPS CPU machines and then turned on proxy and then tried to extract best audio using "bestaudio[ext=webm]/bestaudio[ext=opus]/bestaudio  argument with yt-dlp. And then when i get the audio file in webm, i pushed it to cloudflare R2 in my account. Hence, i have all these respective audio videoID.webm formats.
Now, i have the data to get started. Now ofcourse these podcasts data is not directly usable, we will have to create a data cleaning pipeline - Github Readme.md
This pipeline is vastly inspired from kimi Audio. - KimiAudio
So, now i will have workers picking Each Yt ID.webm from cloudflare and then, running the entire single speaker segmentation pipeline using GPU. and then finally, i’ve chosen to save the cleaned data as Webdatasets as .tar  and FLAC  Codec.
each of videoID.webm in R2 in test  bucket will be orchestrated and cleaned and pushed back as .tar  and FLAC  files. buckets at - https://dash.cloudflare.com/cb908ed13329eb7b186e06ab51bda190/r2/overview
PANN CNNs for music detection. Verified with UI. should ideally recheck when generating webdatasets.
Time to choose Codecs. and decide on their architectural advantages, capacities, best ones, given our huge data scale, which are suitable to be trainable on indic data for better reconstruction and stuff like that. For that we will now choose between codecs like Xcodec2.0, Mimi, SNAC, BiCodec, Encodec, and similar same family of codecs or vocoders or in simple words audio tokenizers.
Cloudflare Bucket for cleaned single speaker segmented - https://dash.cloudflare.com/cb908ed13329eb7b186e06ab51bda190/r2/default/buckets/1-cleaned-data
Now Supabase has the metadata for over like 500k videos across 12 different languages.
language breakdown with usable hours
LanguageVideosAudio HoursUsable HoursTelugu71,61931,001.3119,712.28Malayalam65,27329,108.5418,183.70English59,65726,784.1717,904.33Hindi58,82429,646.2617,709.61Punjabi55,76127,515.7618,064.84Tamil50,19918,085.9811,257.98Kannada45,68921,857.1213,578.46Gujarati40,95025,087.4514,759.11Bengali20,12111,077.016,205.66Odia19,2808,943.515,009.52Marathi16,2217,980.745,130.47Assamese3,7932,136.101,369.66Now the dataset is of the format of videoID.tar where we have each tar for each videoID and all of them are inside cloudflare bucket named 1-cleaned-data bucket and few videos are with a prefix inside same bucket.
Now it’s time for us to do a one last validation checks like calculating any missing metadata and then create shards of these datasets , yet still retain all pre-computed metadatas right inside the buckets itself.
Repo’s to double check for pending validations, requirements or safety checks before creating packaged shards before we can upload them back to R2 and then pull back right before training.
https://github.com/zhenye234/X-Codec-2.0
https://github.com/inworld-ai/tts
Quick overview on data sourcing/prep so far
Summary until PrepStarted with YouTube as the only source, classify channels/videos by language into podcast vs non-podcast, and use Gemini-3-flash to keep likely high-quality podcast audio under the assumption that podcasts are natively 44 kHz+.
At scale you download bestaudio through residential proxies/VPS, store raw videoID.webm in R2, and track metadata in Supabase.
GPU segmentation then produces single-speaker FLAC segments plus metadata.json (including music_stats from PANN) and stores each video as a tar in the cleaned-data R2 bucket.
Distributed shard workers claim video tasks with row-level locks and leases, download per-video tars, and generate deterministic shard IDs for idempotent retries.
Each segment is quality-scored and filtered (duration/clipping), optionally chunked with silence-aware boundaries, and resampled into 16/24/32/44 kHz buckets before writing WebDataset shards (.tar + .jsonl) with train/val manifests.
Shards are uploaded back to the destination R2 bucket, Supabase is updated with shard/worker heartbeats and stale-lease recovery, and you run final validation/metadata checks before packaging shards for codec selection and training.
Now the data for codec is prepped like drop all <3s segments and all all 3-6s segments are padded during training, and then all 6s-12s are directly passed to training so the pipeline auto picks random 6s window from that bucket and anything more than 12s+ are split in such a way that there are 6s segments atleast in the final cuts, because i didnt want to waste the segments. since we will use 6s only from any size audio, we might as well use other 6s too. so i wanted to make up for the lost <3s segments by not wasting segments bigger than 12 by cutting and using them.
Now, the segments are split in such a way that all native nyquist based split like all samples are downsampled to 16 and stored in 16khz bucket and native atleast 85% energy based sort into their respective buckets but aggregated. like 24khz will have downsampled 32/44 and 32khz bucket will have downsampled 44 or more ones. and 44 has native 44 only. and they are all split into R2’s bucket in that way.
While i have specific strategies for codec training, i wanna address the issue of not having transcripts of these segments. i mean not just these above segments. the above are for codec training prep. i will use all of my data segments again from the point where i prepped from original videoID.tar and now i would want transcripts for all samples like even the 1s segments too. so, let’s plan for transcripts now. my data in R2 is now like videoID.tar where inside each tar has metadata.json and a segments/ folder inside which we have all the cleaned and VAD aware cut, music removed, single speaker segmented audio segments ranging from atleast 1s to some have atleast like 60s too.
Now, let’s discuss about strategies, dos and donts on preparing transcriptions to these audio segments. so far, i’ve tried open-source transcription models like, whisper-large-v3, indic-whisper, indicConformer and other opensource models. and even tried like LLMs like Gemma3n-4B, Voxtral-24B, but none of the models did give me like less than 5 WER or atleast like <10WER consistently on all languages. so, i've decided to use gemini-2.5-flash or maybe like gemini-3-flash  or maybe like gemini-3-pro if less confident on prior flash’s generation and set upon some forced aligner and if confidence score is less, we can ask pro to decide that. in such way, we can generate transcriptions for all the audios. Now, i need your suggestions / inputs and strategy in getting this done. like are the forced aligners designed, for like per language. should i train my own indic based forced aligners to test ? or is there any other way for me to validate if the transcripts generated by gemini-3-flash  are accurate. what can we do here to ensure the prompt or instructions we provide that restricts the big LLMs to not hallucinate, like i can set temperature to be 0.0, but does that mean the model wont be any creative ? no right. it still tends to be creative. and that data transcripts wont be suitable for an ASR model nor a TTS model.
So, now let’s strategize on how can i restrict these big LLMs onto being creative and giving transcriptions as is, so it’s suitable for a TTS & ASR system as well. And also i’ve been thinking about various forms of transcripts to get from these superior models. for example, should i get the devanagiri scripts for where applicable ? and dravidian script for languages like tamil and telugu, etc. If i get scripts in such a way, what if i want to support text like hinglish in some places. like “are bhaai, kya kar rhe ho ?” is not an english text, but infact hinglish - we call this code-mixed and how do i manage these. and what about for scripts like, is there a way my speech LLM, has a way to understand the semantics of “are bhaai, kya kar rhe ho ?” and “arey annai, em chestunnav ?” like this should be spoken in telugu right. is there a way for me to specify these languages to the model ? or how can i do this. well if there is no way we can let model decide on the above sentences not being english, can we atleast pass language codes like “te_en” to specify something like “arey annai, em chestunnav ?”  if this is the case, fine. but what if there are scenarios where i have text like “are ala kadu, denni hindi lo ‘मैं सेब खाता हूँ’ antaru” like what about these scenarios. and maybe also like “అరే అలా కాదు, దీన్ని హిందీలో ‘मैं सेब खाता हूँ’ అంటారు.” how do i handle such cases where we handle scenarios of code mixed with scripts too and not just roman scripts.
I want answers and justifications for all the above scenarios. and how do these big labs handle scenarios like these effectively. it must not be pre-process script ofcourse because these conversions cant be done through a regex script. I can write a very complex regex to convert situations like “12,345” to “twelve thousand three hundred and fourty five” but i cant do the roman to devanagiri and other conversions using regex without changing the semantics of the actual sentences and i want the model to be as flexible and robust as possible without not major but handleble regex ones.
Also, there is a special kind of feature i’d want to train or include features such as emotions directly into the model like audio events such as [laugh], [laugh_harder], [cough], [giggle], [sad], [happy], etc. but limited, but also, helps me control audio events and make the model more semantically correct. but this is very complex. but since we are anyway getting transcripts, i’d like to involve this format as well ! ideas on this too. and there might be scenarios of false positives that the model gives an audio tag which there might not be an actual true event, cause hallucination might be an issue. do you suggest we can filter these later, but for now, since we are anyway getting the transcriptions it’s easy to get transcriptions in all possible formats ? I want your strategy actually below. how can i get a strategic way in getting these transcriptions so i get maximum experiments and see possible outcomes and results.
I need solid solutions and tell me how do i get transcriptions and what all forms do i need. like i can ask these big LLMs to give different forms of output to me in structured like
{
“verbatim”: ————- → everything in native script unless numbers and symbols and etc
“roman_normalized” : → everything in roman normalized script
“verbatim_emotions” : →  emotional audio tags. but we might want to limit to few 10 stable ones so the model will only pick from those and not be very vague while picking. and has to be only in english cause we will get a different language tags and it becomes more and more so controllability will be less.
}
I need your proper planning, strategy in infusing above capabilities into model. give opinions on all my above questions regarding strategizing the transcriptions. or when do i fuse the capabilities. will the model be robust enough to have the emotion grasping capabilities in post-training ? or is it better to initiate it in SFT data, and do a post-train with emotion-specific ? I am open to do experiments. so i'd like to get data at once. and also think the scripting scenarios above has to be robust. 
Here are few models whom i inspired to train, so you will understand what kind of TTS i am training and what training architectures i will be following. 
FeatureCosyVoice 3Inworld TTS-1FireRedTTS-2Total Audio Data
~1 Million Hours 1111
~1 Million Hours 2222
~1.4 Million Hours 3
Pre-training Data• 1M hours of multilingual data (9 languages + 18 Chinese dialects) 4444.
• Tokenizer Data: 530k hours supervised multi-task (ASR, LID, SER, AED, SA)5555.
• 1M hours raw audio + 30k hours non-speech/environmental 6.
• Text Data: 10% mix (RedPajama-v2) + Instruction data (LAION OIG) 7.
• Bootstrap: ~15k hours high-quality subset mixed in8.
• 1.1M hours of monologue speech data used for foundational text-to-speech ability 9.
• Tokenizer Data: 500k hours (Stage 1) + 60k hours high-fidelity (Stage 2)10101010.
SFT Data (Supervised Fine-Tuning)• Selected Data: Uses "selected data" for specific tasks 11.
• Instruction Data: 5,000 hours of high-quality emotion, style, and role-play data 12.
• Multilingual: Studio-quality monolingual data with language instructions13.
• 200,000 hours of high-quality, transcribed audio-text pairs 14141414.
• Filtering: Top 80% DNSMOS, filtered for talking speed (CPS) 15.
• Markup Data: 100k examples (~180 hrs) of neutral/stylized pairs for LoRA16.
• Task Specific (Minimal):
• Podcast: ~50 hours of 2-speaker dialogue 17.
• Emotional Chat: ~15 hours of expressive speech18.
Post-training / Dialogue Data• Continual Pretraining: Emotional, instructed, and multi-lingual data 19.
• Reinforcement Learning: Uses "selected data" for DiffRO to maximize rewards20.
• RL Alignment: 1,000-hour English subset used for Group Relative Policy Optimization (GRPO) experiments21.
• 300,000 hours of multi-speaker dialogue data (2–5 speakers per session) used for "Post-training" to enable conversational ability22.
Training Strategies• DiffRO: Differentiable Reward Optimization uses Gumbel-Softmax to optimize tokens against rewards (ASR, SER) 232323.
• Pronunciation Inpainting: Mixes phonemes/words to fix polyphonic errors 24.
• Supervised Tokenizer: Explicitly trained on 5 semantic tasks25.
• Three-Stage Pipeline: Pre-training $\rightarrow$ SFT (from audio-pretrained checkpoint) $\rightarrow$ RL Alignment 26262626.
• RL Alignment: Optimizes composite reward (WER + Similarity + DNSMOS) using GRPO 27272727.
• RMS Loss: Added to codec training for volume consistency28.
• Curriculum Learning: Monologue Pre-training $\rightarrow$ Dialogue Post-training $\rightarrow$ Task SFT 29.
• Interleaved Format: Concatenates [Speaker] Text -> Audio sequences 30.
• Dual-Transformer: Large backbone predicts layer 1; small decoder predicts layers 2–1631.
Training Stages1. Tokenizer Training (Supervised)
2. Large-scale Pretraining (All data)
3. Post-training (DiffRO)
4. Continual Pretraining (Capabilities transfer)
5. Speaker Fine-tuning (SFT) 32.
1. Pre-training (Next-token prediction)
2. Supervised Fine-Tuning (SFT)
3. RL Alignment (GRPO)
4. LoRA Fine-tuning (Audio Markups)33333333.
1. Pretraining (2 epochs, Monologue)
2. Post-training (5 epochs, Dialogue)
3. SFT (Target speaker adaptation) 34.
Hyperparameters• Model: 1.5B LLM, 300M CFM (DiT) 35.
• Tokenizer: 25 Hz frame rate 36.
• Quantization: Finite Scalar Quantization (FSQ)37.
• Model: TTS-1 (1.6B), TTS-1-Max (8.8B) 38.
• Pre-train: LR $1.5 \times 10^{-4}$, Batch ~2M tokens, AdamW 39.
• SFT: LR $1.5 \times 10^{-5}$, Cosine decay 40.
• RL: $\alpha=\beta=\gamma=1.0$ (Reward weights) 41.
• Codec: 50 Hz, 65,536 vocab42.
• Model: Qwen2.5-based Dual-Transformer 43.
• Tokenizer: 12.5 Hz frame rate, 16 RVQ layers (2048 entries) 444444.
• Loss Weights: $\lambda_{text}=0.01$, $\lambda_{decoder}=0.6$ 45.
• Latency: First packet < 100ms46.
Key Innovation
DiffRO (Differentiable Reward Optimization)
Directly optimizes discrete speech tokens for specific objectives without traditional RL loops47.
RL Alignment (GRPO)
Adapts LLM alignment techniques (Group Relative Policy Optimization) to speech using composite rewards48.
12.5 Hz Tokenizer
Extremely low frame rate compresses context, allowing stable long-form dialogue modeling49.