

# Strategic Plan for Building a Human-Like TTS LLM from YouTube Data

## 1. Transcription Strategy: Overcoming the Quality Bottleneck at Scale

The foundational challenge in developing a high-fidelity Text-to-Speech (TTS) model, particularly one based on a Large Language Model (LLM) architecture, is the acquisition of a massive, high-quality, and accurately transcribed audio-text dataset. The project's initial strategy relied on sourcing this data from YouTube, a vast and diverse repository of speech. However, the subsequent attempt to transcribe this "in-the-wild" audio using a variety of state-of-the-art Automatic Speech Recognition (ASR) models and Large Language Models (LLMs) has proven to be a significant bottleneck. The core issue lies in the inability of these models to consistently achieve the required transcription accuracy, specifically a **Word Error Rate (WER) below 10%**, which is a critical threshold for producing training data suitable for a human-like TTS system. This section details the systematic failures of conventional transcription approaches and proposes a novel, multi-pronged strategy to overcome this obstacle. The proposed solution pivots from a direct transcription paradigm to a more robust, quality-centric approach that leverages Generative Speech Enhancement (GSE) models for both audio cleaning and, crucially, for generating confidence scores that can be used to filter the dataset. This method allows for the curation of a high-quality subset of audio data even in the absence of perfect ground-truth transcripts. Furthermore, this section explores alternative methods for acquiring transcriptions, such as leveraging pre-existing YouTube Closed Captions (CCs), and outlines a comprehensive strategy for transcript normalization, including the handling of code-mixed languages, multiple scripts, and the integration of semantic audio events.

### 1.1. Core Challenge: Inadequate Accuracy from Standard ASR Models

The initial phase of the project involved a systematic evaluation of various ASR and LLM-based transcription models to process the vast corpus of audio data extracted from YouTube. The goal was to generate accurate text-audio pairs for the subsequent TTS model training. Despite the sophistication of the models tested, none were able to meet the stringent accuracy requirements necessary for high-quality TTS synthesis. The failure was consistent across different model families, including open-source ASR systems, proprietary large language models, and models specifically fine-tuned for Indic languages. This widespread underperformance highlights the inherent difficulty of transcribing "in-the-wild" audio, which is characterized by a wide range of acoustic conditions, background noise, speaker accents, and spontaneous speech patterns that are not well-represented in the training data of most standard ASR systems. The inability to secure reliable transcriptions at scale has necessitated a fundamental rethinking of the data curation strategy, moving away from a reliance on perfect transcripts and towards a methodology that can identify and select high-quality audio based on intrinsic, model-derived quality metrics. This pivot is critical for unlocking the value of the massive YouTube dataset and ensuring that the TTS model is trained on data that is not only large in volume but also high in fidelity.

#### 1.1.1. Failure of Open-Source Models (Whisper, Indic-Whisper, IndicConformer)

The first line of attack in the transcription effort involved a suite of powerful open-source ASR models. The team evaluated **Whisper-large-v3**, a state-of-the-art multilingual model known for its robustness. However, when applied to the diverse and often noisy YouTube audio, its performance fell short of the required **<10% WER benchmark**. Recognizing the specific linguistic challenges of the target dataset, the team then turned to specialized models. **Indic-Whisper**, a variant of Whisper specifically fine-tuned on a variety of Indic languages, was tested with the expectation that its specialized training would yield better results. Similarly, **IndicConformer**, another model designed for the Indian linguistic landscape, was also put through its paces. Despite their specialized nature, both models exhibited the same fundamental limitation: they were unable to consistently and accurately transcribe the long-tail of acoustic phenomena present in the YouTube data. The performance degradation was particularly noticeable in segments with background music, overlapping speech, or non-native accents, which are common in podcast-style content. The consistent failure of these models, even those tailored for the target languages, underscores the gap between the controlled conditions of academic benchmarks and the chaotic reality of "in-the-wild" internet audio. This experience demonstrated that simply selecting a better open-source model was not a viable path forward and that a more fundamental change in approach was required.

#### 1.1.2. Failure of Large Language Models (Gemini, Gemma, Voxtral)

With the limitations of traditional ASR models becoming apparent, the project explored the potential of Large Language Models (LLMs) for the transcription task. The hypothesis was that the vast world knowledge and powerful reasoning capabilities of modern LLMs could be leveraged to "understand" the context of speech and produce more accurate transcriptions, especially in challenging acoustic scenarios. The team experimented with prompting strategies for models like **Gemini-2.5-flash**, **Gemini-3-flash**, and **Gemini-3-pro**, providing detailed instructions to guide the model towards verbatim transcription. However, even with temperature set to 0.0 to minimize creativity, the models consistently failed to produce the required level of accuracy. The LLMs exhibited a tendency to **"hallucinate"** or creatively interpret the audio, adding, removing, or altering words in a way that, while often semantically plausible, was not a faithful transcription of the source audio. This behavior is particularly problematic for TTS training, where the alignment between the audio and the exact text is paramount. The models also struggled with code-mixed language (e.g., Hinglish), often failing to recognize the language switch or transcribing it incorrectly. The failure of these powerful LLMs revealed a critical insight: their training objectives, which prioritize fluent and coherent text generation, are fundamentally misaligned with the task of verbatim transcription. This misalignment makes them unsuitable as a primary tool for generating the high-fidelity text-audio pairs needed for TTS model training, forcing the project to seek alternative methods that are more robust to the challenges of "in-the-wild" audio.

#### 1.1.3. Failure of Specialized Models (wav2vec 2.0, w2vindia)

The final category of models evaluated were those based on the **wav2vec 2.0** architecture, which has set new standards in self-supervised speech representation learning. The **`w2vindia`** model, a variant specifically pre-trained and fine-tuned on a large corpus of Indian language speech data, was a key candidate . The expectation was that its specialized training would make it the most suitable model for the task. However, even `w2vindia` failed to consistently achieve a WER below 10%. The model's performance, while better than some of the other candidates on certain language subsets, was still not reliable enough for the rigorous demands of TTS dataset creation. The high error rate suggests that the model's training data may not have fully captured the acoustic and linguistic variability present in the YouTube corpus, or that the fine-tuning process was not optimized for the specific characteristics of the audio being processed. Research on similar models, such as `IndicWav2Vec`, has shown that while they can achieve state-of-the-art results on specific benchmarks, their performance can vary significantly across different languages and domains . The failure of even this highly specialized model underscores the extreme difficulty of the transcription task and highlights the need for a fundamentally different approach. It became clear that relying on a single, off-the-shelf ASR model to transcribe the entire 50,000-hour dataset with the required accuracy was not a viable path forward.

### 1.2. Proposed Solution: Confidence-Based Filtering with Generative Speech Enhancement (GSE)

Faced with the inability of standard ASR and LLM models to provide accurate transcriptions, the project has identified a promising alternative strategy centered on **Generative Speech Enhancement (GSE)** . This approach represents a paradigm shift from the traditional "transcribe everything" methodology to a more nuanced **"curate for quality"** philosophy. The core idea is to leverage a GSE model not just as a tool for cleaning noisy audio, but as a sophisticated quality assessment engine. GSE models, which are trained to generate clean speech from noisy inputs, are known to be prone to "hallucination" errors—subtle but critical mistakes where the model omits phonemes, alters speaker characteristics, or introduces artifacts. A 2026 paper on confidence-based filtering for speech dataset curation proposes a method to detect these errors by using the model's own internal confidence scores . This technique allows for the non-intrusive filtering of a large "in-the-wild" dataset, effectively identifying and retaining only the audio segments that the GSE model processes with high confidence. By applying this method, the project can create a high-quality subset of the YouTube data that is free from the most egregious enhancement artifacts, providing a solid foundation for training the TTS model, even without perfect transcriptions for every segment. This strategy is particularly well-suited for the project's goal of training a neural codec, as the codec can be trained on this clean, high-confidence audio, and the subsequent TTS model can be trained to generate the discrete tokens produced by this codec.

#### 1.2.1. Concept: Leveraging GSE for Both Enhancement and Quality Scoring

The proposed solution is built on the dual-purpose utility of Generative Speech Enhancement (GSE) models. These models are typically used for their primary function: to take a noisy audio signal, such as a recording from a YouTube video, and generate a cleaner, higher-quality version of the speech. This is particularly valuable for the project's "in-the-wild" dataset, which contains a wide variety of background noises, reverberations, and other acoustic distortions. However, the key innovation lies in the second, more subtle application of these models: as a quality scoring mechanism. The 2026 paper "Confidence-based Filtering for Speech Dataset Curation with Generative Speech Enhancement Using Discrete Tokens" details a method that uses the internal state of a GSE model to assess the quality of its own output . The core insight is that when a GSE model is presented with a particularly challenging or noisy input, it may produce a clean-sounding output that is nonetheless flawed in terms of its content. These **"hallucination" errors**, such as phoneme omissions or speaker inconsistencies, are often missed by conventional non-intrusive quality metrics like DNSMOS or UTMOS, which focus more on the overall acoustic quality rather than the semantic fidelity of the speech . By analyzing the model's internal confidence, it becomes possible to detect these subtle but critical errors, providing a much more reliable signal for filtering the dataset. This dual approach allows the project to both clean the audio and simultaneously assess the trustworthiness of the cleaned output, creating a curated dataset that is not only acoustically clean but also semantically reliable.

#### 1.2.2. Implementation: Using Log-Probabilities of Discrete Tokens as Confidence Scores

The technical implementation of the confidence-based filtering strategy relies on the use of discrete token-based GSE models, such as the **Genhancer** model mentioned in the research . These models operate by first encoding the audio into a sequence of discrete tokens using a neural audio codec (like the Descript Audio Codec, DAC). The GSE model then processes these tokens to generate a new, enhanced sequence of tokens, which is finally decoded back into a clean audio waveform. The key to the quality scoring mechanism lies in the probabilities associated with the generation of these discrete tokens. The proposed method defines a token-level confidence score, `s_t`, for each time step `t` as the log-probability of the generated token from the first quantizer layer of the codec. This layer is chosen because it has the greatest perceptual impact on the final audio quality. The formula for this token-level confidence is given as:

`s_t = log p(x_t,1 = x̂_t,1 | c; θ)`

where `x_t,1` is the generated token, `c` is the conditioning information (the noisy audio), and `θ` represents the model's parameters .

To obtain a single quality score for an entire utterance, these token-level scores are averaged over the sequence length `T` to produce an utterance-level confidence score, `S_utt`:

`S_utt = (1/T) * Σ(s_t)`

This final score, `S_utt`, serves as a non-intrusive quality metric. A high score indicates that the model generated the enhanced speech with high confidence, suggesting a successful enhancement. Conversely, a low score implies that the model struggled with the input, flagging the output as a potential enhancement failure or hallucination. This score can then be used to filter the dataset, retaining only the utterances that exceed a certain confidence threshold, thereby curating a high-quality corpus for TTS training .

#### 1.2.3. Application: Filtering the "In-the-Wild" YouTube Dataset to Create a High-Quality Subset

The application of this confidence-based filtering method to the project's massive YouTube dataset offers a clear and actionable path forward. The process would involve a two-stage pipeline. In the first stage, a pre-trained GSE model, such as Genhancer, would be applied to every audio segment in the cleaned dataset. For each segment, the model would produce an enhanced audio file and its corresponding utterance-level confidence score, `S_utt`. In the second stage, a filtering threshold, `τ`, would be determined. This threshold could be set empirically, for example, by retaining the top N% of utterances based on the distribution of confidence scores across the entire dataset. All enhanced audio segments with a confidence score `S_utt` below this threshold `τ` would be discarded. The result would be a curated dataset, `D_curated`, containing only the high-confidence, high-quality audio:

`D_curated = {w_enhanced ∈ D_enhanced | S_utt ≥ τ}`

The research demonstrates the practical utility of this method. In their experiments, they curated an "in-the-wild" TTS dataset (TITW-hard, sourced from VoxCeleb) using this confidence-based filtering and showed that it improved the performance of a subsequently trained TTS model (Matcha-TTS) . The confidence score was shown to have a strong correlation with a suite of intrusive speech enhancement metrics, and it was particularly effective at identifying hallucination errors that were missed by other non-intrusive metrics like UTMOS. For instance, in one example, an enhanced speech sample with a high UTMOS score of 4.01 (top 39%) was found to have significant content corruption. In contrast, the proposed confidence score and the intrusive LPS metric both gave low scores, placing the sample in the top 93% and 81% respectively, correctly identifying it as a failure case . This demonstrates that the confidence-based filtering can effectively and non-intrusively identify and remove low-quality data, making it an ideal solution for curating the project's large-scale YouTube dataset.

### 1.3. Alternative Transcription Sources

While the primary strategy for data curation has shifted towards confidence-based filtering of audio, the project also recognizes the value of exploring alternative sources for acquiring accurate transcriptions. The ideal scenario for training a high-quality TTS model remains the availability of large volumes of perfectly aligned, high-fidelity audio-text pairs. Although the project's own ASR efforts have fallen short, there are other potential avenues for obtaining reliable text transcripts that should be investigated. These alternative methods could serve as a valuable supplement to the confidence-filtered audio, providing a smaller but potentially higher-quality set of transcribed data for specific use cases, such as supervised fine-tuning (SFT) or for training a more traditional ASR model to eventually bootstrap the transcription process. The two most promising avenues are the direct utilization of pre-existing YouTube Closed Captions (CCs) and the alignment of the audio with external text corpora, such as Wikipedia articles. These methods, while not without their own challenges, offer a different path to acquiring the necessary text data and could be particularly effective for a subset of the videos that meet specific criteria.

#### 1.3.1. Utilizing Pre-Existing YouTube Closed Captions (CCs)

One of the most direct methods for obtaining transcriptions is to leverage the Closed Captions (CCs) that are already available on many YouTube videos. These captions come in two main forms: **manually uploaded subtitles**, which are often of high quality, and **auto-generated captions**, which are created by YouTube's own ASR system. While the quality of auto-generated captions can be variable, they can still serve as a useful starting point, especially if they can be further processed or filtered. The project `YTTTS` (YouTube Text-To-Speech dataset) provides a clear precedent for this approach, demonstrating a pipeline that downloads both audio and captions from YouTube videos and then aligns them by parsing the `.srt` subtitle files . This method has the significant advantage of providing a direct, time-aligned text-audio pair, bypassing the need for a separate, error-prone transcription step.

However, this approach is not without its challenges. Firstly, not all videos have captions, and the availability of captions in the target Indic languages may be limited. Secondly, the accuracy of auto-generated captions, particularly for the long-tail of accents and dialects present in the dataset, is likely to be insufficient for direct use in TTS training. A 2021 article on preparing speech recognition datasets from YouTube notes that while using speech-to-text APIs can reduce the workload, manual editing is still required to produce a high-quality dataset . Despite these limitations, targeting videos with high-quality, manually created captions could provide a valuable, albeit smaller, subset of pristine data. Furthermore, even the imperfect auto-generated captions could be used as a weak supervision signal or as a starting point for a human-in-the-loop correction process. The project `yt-tts` also emphasizes the importance of choosing videos with subtitles, further validating this as a viable, if partial, solution . Therefore, a key part of the data strategy should be to identify and prioritize the download of videos that have reliable captions, using them to build a "gold standard" subset of the training data.

#### 1.3.2. Aligning Audio with External Text Corpora (e.g., Wikipedia)

A more advanced and potentially more fruitful strategy is to align the YouTube audio with external, high-quality text corpora. The central hypothesis is that a large number of YouTube videos, especially those in the "educational" or "informative" podcast category, are based on or directly read from existing written content. **Wikipedia**, with its vast repository of articles in multiple Indic languages, is a prime candidate for this source text. The process of aligning audio with a text corpus like Wikipedia is complex but has been successfully demonstrated in academic research . The first step would be to identify the most likely source article for a given video. This could be done by extracting keywords from the video's title and description, and then using a search engine or a direct query against a Wikipedia dump to find the most relevant articles. Once a candidate article is identified, the next step is to perform a large-scale forced alignment between the audio and the text. This would involve using a phonetic recognizer, such as one built with the **Montreal Forced Aligner (MFA)** or a similar tool, to find the best possible match between the audio and the text. This process would need to be robust enough to handle minor deviations, paraphrasing, and omissions that are common when a speaker is not reading verbatim. Research from the IIIT-H Indic Speech Databases project has shown that it is feasible to create high-quality speech datasets by having speakers read Wikipedia articles in a studio environment . The proposed strategy is to reverse this process: given the audio, find the text that the speaker is most likely reading. This approach, if successful, would provide a rich source of high-quality, verified transcripts, significantly improving the quality of the training data.

### 1.4. Transcript Format and Normalization Strategy

A critical, yet often overlooked, aspect of building a large-scale TTS system is the standardization and normalization of the text transcripts. For a model trained on data from a linguistically diverse region like India, this is not a trivial task. The transcripts must be able to accurately represent a wide range of linguistic phenomena, including code-mixed language (e.g., Hinglish), the use of multiple scripts (e.g., Devanagari for Hindi, Dravidian scripts for Tamil and Telugu, and Roman script for English), and the presence of non-speech audio events like laughter or coughing. The user's query highlights the complexity of these challenges, questioning how to handle sentences that mix languages and scripts, such as "are ala kadu, denni hindi lo ‘मैं सेब खाता हूँ’ antaru" . A robust strategy for transcript normalization is essential for ensuring that the TTS model can learn to speak with the natural fluency and expressiveness of a human, including the ability to handle these complex linguistic scenarios. This involves making deliberate decisions about the target output format, the handling of different scripts, and the inclusion of semantic audio tags.

#### 1.4.1. Handling Code-Mixed and Code-Switched Language (e.g., Hinglish)

Code-mixing and code-switching are ubiquitous features of spoken language in multilingual societies, and India is a prime example. A TTS model intended for this market must be able to handle utterances that seamlessly blend words and phrases from multiple languages, such as the **Hinglish** phrase "are bhaai, kya kar rhe ho?" . The challenge for the data pipeline is to create transcripts that accurately represent this linguistic reality without forcing the model into an unnatural, monolingual mode of speech. The user's query raises the crucial question of how to represent these mixed-language utterances in the training data. One approach is to transcribe the speech as-is, preserving the code-mixed nature of the text. This would require the model to learn the phonetics of both languages and the patterns of switching between them. Another approach is to normalize the text to a single target language, but this risks losing the natural flavor and cultural context of the original speech.

A more sophisticated strategy would be to develop a multi-layered transcript format. For example, the primary transcript could be a verbatim representation in the native script(s), preserving the code-mixing. A secondary layer could provide a normalized or translated version, which could be useful for certain downstream tasks or for training a more controlled version of the model. The user's suggestion of using language codes like "te_en" to specify the language of a particular phrase is a step in this right direction, providing explicit metadata about the linguistic content of the transcript . Ultimately, the goal is to provide the TTS model with enough information to understand the semantic intent of a mixed-language sentence and produce speech that sounds natural and appropriate. This might involve training the model on a diverse set of code-mixed examples and providing it with the tools to handle the phonetic and prosodic shifts that occur during language switching.

#### 1.4.2. Script Management: Native (Devanagari, Dravidian) vs. Romanized

The choice of script for the transcripts is another critical decision that will have a significant impact on the final TTS model. The user's query correctly identifies the need to support native scripts like **Devanagari** for Hindi and **Dravidian scripts** for Tamil and Telugu, as this is essential for cultural authenticity and user acceptance . However, the prevalence of **Romanized script** (i.e., writing Indic languages using the English alphabet) in informal digital communication, such as in the Hinglish example, presents a challenge. A robust TTS system should ideally be able to handle both. The proposed solution of generating transcripts in multiple formats is a sound one. The primary format should be the "verbatim" transcript, which uses the native script(s) as appropriate. This ensures that the model learns the correct phonetic and prosodic patterns associated with each script.

A secondary format, "roman_normalized," could be created to provide a standardized Romanized version of the text. This would be particularly useful for handling code-mixed sentences where switching between scripts would be cumbersome. The user's concern about the difficulty of converting between scripts using simple regex is well-founded; such conversions are fraught with peril and can easily alter the semantics of a sentence . Therefore, the responsibility for generating these different script formats should lie with the powerful LLMs used for transcription. By providing the LLM with clear instructions and examples, it can be prompted to generate the text in the desired script, leveraging its deep understanding of the relationship between the spoken word and its written representation in different scripts. This multi-script approach would provide the flexibility needed to train a TTS model that is both culturally authentic and robust to the diverse ways in which language is written in the digital age.

#### 1.4.3. Incorporating Semantic Audio Events (Emotions, Non-Speech Sounds)

To achieve a truly human-like quality, a TTS model must be able to convey not just the words, but also the emotional and paralinguistic content of the speech. The user's query highlights the desire to include features like emotions directly into the model, using audio event tags such as **`[laugh]`**, **`[cough]`**, **`[giggle]`**, and **`[sad]`** . This is an advanced feature that can significantly enhance the expressiveness and controllability of the TTS system. The proposed strategy of including these tags in the transcript format is a promising approach. By annotating the text with these semantic audio events, the model can be trained to associate specific textual cues with corresponding acoustic behaviors. For example, the tag `[laugh]` could be used to trigger a laughter sound, while `[sad]` could influence the prosody of the speech to sound more melancholic.

The user's concern about the potential for false positives and model hallucination is valid. The model might incorrectly identify an emotion or add a tag where none exists. To mitigate this, it is crucial to **limit the set of allowed tags to a small, stable set** of the most common and unambiguous audio events. This reduces the complexity of the task and makes it easier for the model to learn the correct associations. The user's suggestion to keep the tags in English, regardless of the language of the speech, is also a practical choice that simplifies the implementation . The decision of when to introduce this capability is also important. The user asks whether it is better to include it in the initial Supervised Fine-Tuning (SFT) data or to add it in a later post-training stage. Given the complexity of this feature, a multi-stage approach might be most effective. The initial SFT could focus on learning the basic text-to-speech mapping, while a later stage of post-training could be dedicated to fine-tuning the model's ability to recognize and produce these nuanced audio events. This would allow for a more controlled and experimental approach to developing this advanced capability.

## 2. Data Pipeline and Architecture

The foundation of this ambitious TTS project is a robust and scalable data pipeline designed to ingest, process, and manage a massive volume of audio data from YouTube. The architecture has been meticulously planned to handle the challenges of sourcing, cleaning, and preparing data at a scale of over **500,000 videos** and **50,000 hours** of audio across **12 Indic languages**. The pipeline is built on a distributed, cloud-native architecture, leveraging services like Cloudflare R2 for storage and Supabase for metadata management. This section provides a detailed overview of the entire data pipeline, from the initial sourcing of YouTube content to the final creation of sharded, ready-to-train datasets. The design prioritizes scalability, reliability, and the preservation of data quality at every stage of the process.

### 2.1. Data Sourcing and Ingestion

The first stage of the pipeline is responsible for identifying, downloading, and storing the raw audio data from YouTube. This process was designed to be both targeted and scalable, ensuring that the collected data is of the highest possible quality while also being able to handle the massive volume required for training a state-of-the-art TTS model. The strategy focused on a specific niche of YouTube content—podcasts—which were assumed to have higher audio quality than general user-generated content. The following subsections detail the specific techniques and technologies used to source and ingest this data, including the targeting strategy, the use of residential proxies to overcome download restrictions, and the cloud-based storage architecture.

#### 2.1.1. Targeting YouTube Podcasts for High-Quality Audio

The initial and most critical decision in the data sourcing phase was to focus exclusively on YouTube videos that could be classified as **"podcasts."** This strategic choice was based on the assumption that podcast content is more likely to be produced with a higher degree of professionalism, including the use of quality microphones and recording equipment. This, in turn, was expected to result in audio with a higher native sample rate (ideally **44.1 kHz or 48 kHz**) and a better signal-to-noise ratio compared to the vast majority of user-generated content on the platform. To implement this strategy, a multi-stage filtering process was developed. First, YouTube channels were broadly categorized based on their content, with a focus on identifying channels that primarily produce podcast-style content. Next, individual videos from these channels were further scrutinized. To automate this filtering process at scale, **Google's Gemini-3-flash model** was employed. The model was prompted to analyze the video titles, descriptions, and thumbnails to determine if a given video was a genuine podcast episode or some other type of content (e.g., a vlog, a music video, or a live stream). This AI-powered filtering was crucial for efficiently sifting through thousands of videos and building a curated dataset of high-potential audio files. This targeted approach was designed to maximize the quality of the raw data, which is a prerequisite for training a high-quality TTS model.

#### 2.1.2. Scalable Downloading via Residential Proxies and VPS

Downloading a massive number of videos from YouTube at scale presents significant technical challenges, as the platform has sophisticated mechanisms in place to detect and block automated download attempts. To circumvent these restrictions, a distributed and anonymized downloading infrastructure was implemented. The core of this infrastructure is the use of **residential proxies**. Unlike data center proxies, which are easily identifiable and blocked, residential proxies route traffic through real user devices, making the download requests appear as if they are coming from individual users. This significantly reduces the likelihood of being detected and blocked by YouTube. The downloading process is orchestrated through a fleet of **Virtual Private Servers (VPS)** . Each VPS is configured to use a residential proxy and runs a script that utilizes **`yt-dlp`**, a powerful command-line tool for downloading videos from YouTube. The `yt-dlp` command was specifically configured to download the best available audio stream, with a preference for the `webm` or `opus` formats, which often offer the best quality. The command used was `bestaudio[ext=webm]/bestaudio[ext=opus]/bestaudio`. This distributed approach, with multiple VPS instances each using a different residential proxy, allows for a high degree of parallelism, enabling the efficient download of the entire target dataset without triggering YouTube's anti-bot measures.

#### 2.1.3. Storage Architecture: Raw `.webm` Files in Cloudflare R2

Once downloaded, the raw audio files are stored in **Cloudflare R2**, a highly scalable and cost-effective object storage service. The choice of R2 was driven by its generous free tier, its compatibility with the S3 API, and its global network of edge locations, which ensures fast and reliable access to the data from anywhere in the world. The audio files are stored in their original `.webm` format, preserving the exact quality as downloaded from YouTube. Each file is named using its unique YouTube video ID (e.g., `YoutubeID.webm`), which allows for easy correlation with the metadata stored in the project's database. The use of a cloud-based object storage service like R2 provides virtually unlimited scalability, ensuring that the storage infrastructure can handle the massive volume of data without any performance bottlenecks. It also provides a high degree of durability and availability, protecting the valuable dataset from data loss. The raw `.webm` files in the R2 bucket serve as the immutable source of truth for the entire pipeline, and all subsequent processing steps will work with copies of this data, ensuring that the original files are always preserved.

### 2.2. Data Cleaning and Pre-processing Pipeline

The data cleaning and pre-processing pipeline is a critical, multi-stage process designed to transform raw, noisy audio downloaded from YouTube into a structured, high-quality dataset ready for model training. This pipeline is executed by distributed workers that process each video file stored in Cloudflare R2, applying a series of computationally intensive operations to isolate and clean the speech content. The entire workflow is orchestrated to handle the massive scale of the dataset, which encompasses over **500,000 videos** across **12 Indic languages**, totaling more than **200,000 hours** of audio. The pipeline's design is inspired by best practices in large-scale audio processing, such as those used for the KimiAudio project, and incorporates state-of-the-art machine learning models for tasks like speaker segmentation and music detection. The final output is a set of WebDataset shards, which are optimized for efficient loading during distributed training. This systematic approach ensures that the data fed into the TTS model is not only clean and consistent but also well-annotated with metadata that can be used for further filtering and analysis.

#### 2.2.1. GPU-Based Single-Speaker Segmentation

A fundamental step in preparing the dataset is the isolation of **single-speaker segments**. TTS models, especially those designed for high-fidelity voice cloning, typically perform best when trained on clean, single-speaker audio. The raw YouTube videos, however, often contain multiple speakers, background noise, music, and other non-speech audio. To address this, the pipeline employs a **GPU-accelerated single-speaker segmentation** process. This process takes the raw audio from a `videoID.webm` file and uses advanced audio diarization techniques to identify and segment continuous stretches of audio belonging to a single speaker. This is a computationally demanding task that benefits significantly from GPU acceleration, allowing for efficient processing of the vast amount of audio data. The segmentation algorithm is designed to be robust, handling variations in speaker volume, speaking style, and background conditions. It intelligently identifies speaker change points and creates distinct segments, each containing audio from only one speaker. This step is crucial for ensuring that the subsequent TTS model learns to generate coherent and consistent speech for a single voice, rather than being confused by overlapping speakers or conversational turn-taking, which are better handled in a dedicated dialogue modeling stage. The output of this stage is a collection of single-speaker audio segments for each video, which are then passed to the next stage of the pipeline for further cleaning and quality assessment.

#### 2.2.2. Music and Noise Detection (PANN CNNs)

Even after single-speaker segmentation, the audio segments may contain significant amounts of background music, ambient noise, or other non-speech sounds that are detrimental to TTS model training. To filter these out, the pipeline incorporates a music and noise detection module based on **PANN (PANNs: Large-Scale Pretrained Audio Neural Networks) CNNs**. PANN models are state-of-the-art audio classification models that have been pre-trained on massive datasets like AudioSet, giving them a strong ability to recognize a wide variety of audio events and scenes. In this pipeline, a PANN-based classifier is applied to each single-speaker segment to detect the presence of music or other undesirable background noise. The model provides a confidence score or a set of statistics (e.g., `music_stats`) indicating the likelihood that a segment is contaminated. This information is stored in the segment's metadata, allowing for downstream filtering. For example, segments with a high probability of containing music can be automatically discarded or flagged for manual review. This automated detection is essential for maintaining the quality of the dataset at scale, as manually listening to and annotating hundreds of thousands of hours of audio would be infeasible. The use of a pre-trained model like PANN ensures high accuracy and robustness across the diverse range of acoustic conditions present in the YouTube data, from studio-quality podcasts to noisy street interviews. This step is a key component in the data cleaning process, ensuring that the final dataset is composed primarily of clean, intelligible speech.

#### 2.2.3. Resampling and Quality Bucketing (16/24/32/44 kHz)

The raw audio downloaded from YouTube can have a variety of different sample rates, depending on the original upload and the encoding settings used by YouTube. To ensure consistency and to optimize the training process, all audio segments are resampled to a set of standard sample rates. The strategy here is to create multiple **"buckets" of data**, each with a different sample rate, to cater to different stages of the training process. The primary buckets are **16 kHz, 24 kHz, 32 kHz, and 44.1 kHz**. The 16 kHz bucket is used for initial experiments and for training models that are less sensitive to high-frequency content. The higher sample rate buckets (24, 32, and 44.1 kHz) are used for training the final, high-fidelity TTS model, as they contain more detailed acoustic information. The decision of which bucket to place a segment in is based on its original sample rate and its quality. Segments with a native sample rate of 44.1 kHz are placed in the 44.1 kHz bucket. Segments with a lower native sample rate are placed in the highest possible bucket without upsampling, to avoid introducing artifacts. This bucketing strategy allows for a flexible and efficient training pipeline, where different models can be trained on data with different levels of fidelity.

#### 2.2.4. Packaging into WebDatasets (`.tar` and FLAC)

Once the audio segments have been cleaned, segmented, and quality-scored, they need to be packaged into a format that is efficient for large-scale, distributed training. The chosen format is **WebDataset**, which stores data in `.tar` archives. This format is highly scalable and allows for efficient streaming of data to training workers without the need for a centralized database or file system. Each `videoID.tar` file contains all the cleaned audio segments for a particular YouTube video, along with a `metadata.json` file that stores all the pre-computed information about those segments, such as their duration, quality scores, and results from the music/noise detection. The audio segments themselves are stored using the **FLAC (Free Lossless Audio Codec)** codec. FLAC is chosen because it is a lossless compression format, meaning it preserves the original audio quality without any degradation, which is crucial for training a high-fidelity TTS model. It also provides good compression, reducing storage costs and speeding up data transfer. The use of WebDatasets simplifies the data loading process during training, as the training script can simply read the `.tar` files sequentially, treating them as a stream of data samples. This architecture is well-suited for the planned training infrastructure, which will likely involve multiple GPUs processing data in parallel. The final packaged data, consisting of thousands of these `.tar` files, is then uploaded back to a designated Cloudflare R2 bucket, ready to be pulled by the training jobs.

### 2.3. Metadata Management and Orchestration

Effective metadata management and orchestration are crucial for handling a dataset of this magnitude and complexity. The system must be able to track the state of each video and audio segment through the entire pipeline, from raw download to final packaged shard, and manage the distributed workers that perform the processing. The architecture uses a combination of a centralized database (Supabase) for metadata storage and a distributed worker system with robust task management to ensure the pipeline runs smoothly and efficiently. This setup allows for parallel processing of videos, fault tolerance in case of worker failures, and the ability to monitor the progress of the entire operation. The metadata stored in Supabase is not just for bookkeeping; it is an integral part of the data curation process, containing the quality scores and other annotations that will be used to filter the dataset and create the final training splits.

#### 2.3.1. Supabase for Centralized Metadata Storage

**Supabase**, an open-source Firebase alternative, is used as the central repository for all metadata associated with the dataset. For each of the over **500,000 YouTube videos**, a corresponding entry in a Supabase table stores a wealth of information. This includes the video ID, language, source channel, and the initial classification as a podcast or non-podcast. As the video progresses through the pipeline, the metadata is continuously updated. After the single-speaker segmentation and cleaning, the metadata for each video will include a list of its constituent audio segments, along with their durations, file paths in Cloudflare R2, and the results of the quality analysis (e.g., confidence scores from the GSE model, music detection scores from PANN). This centralized metadata store is essential for several reasons. First, it provides a single source of truth for the state of the entire dataset, allowing for easy querying and analysis. For example, one could quickly determine the total hours of high-quality, single-speaker audio available for a specific language. Second, it enables the distributed worker system by providing a way to assign tasks and track their completion. Workers can query Supabase for the next available video to process and update the status once they are done. Finally, the metadata itself becomes a valuable asset for the final stages of data preparation, as it contains all the information needed to create filtered datasets and training/validation splits based on various quality criteria.

#### 2.3.2. Distributed Worker System with Row-Level Locks

To process the massive dataset in a reasonable amount of time, the data cleaning and packaging pipeline is designed to be executed by a distributed system of workers. These workers run on GPU-enabled machines and are responsible for picking up raw video files from Cloudflare R2, running the entire cleaning pipeline, and uploading the resulting `.tar` shards back to R2. To manage this distributed process and avoid race conditions where multiple workers might try to process the same video, the system uses **row-level locks in Supabase**. When a worker is ready to process a new video, it queries the Supabase database for a video with a "pending" status. It then attempts to acquire a lease on that video's row by updating its status to "processing" and adding a timestamp. This operation is atomic, ensuring that only one worker can successfully acquire the lease. The worker then has a set amount of time (the lease duration) to complete the processing. If the worker fails or crashes, the lease will expire, and another worker can pick up the task. This mechanism provides fault tolerance and ensures that all videos in the dataset are eventually processed. The system also uses **deterministic shard IDs** for idempotent retries, meaning that if a worker fails and the task is retried, the same output will be generated, preventing duplicate or inconsistent data. This distributed, lock-based orchestration is a robust and scalable way to manage the complex, long-running tasks involved in preparing the dataset for training.

#### 2.3.3. Final Validation and Shard Creation

The final stage of the data pipeline is to perform a last round of validation checks on the cleaned and processed data and to create the final sharded datasets for training. This involves several steps. First, a validation script is run to check for any missing metadata or inconsistencies in the data. This script verifies that every audio segment has a corresponding entry in the metadata file and that all the required fields are present. Next, the data is sharded into smaller, more manageable chunks. This is done to facilitate distributed training, as each training process can work on its own shard of the data. The sharding process is deterministic and idempotent, meaning that the same shard will always be created from the same input data, which is important for ensuring reproducibility. The final sharded datasets, in WebDataset format, are then uploaded back to a dedicated R2 bucket, ready to be pulled by the training jobs. This final validation and sharding step is crucial for ensuring that the data is in a clean, consistent, and ready-to-use state before it is fed into the TTS model.

## 3. Model Architecture and Training Strategy

The development of a human-like TTS LLM requires a sophisticated model architecture and a multi-stage training strategy that can effectively leverage the massive, curated dataset. The proposed approach is heavily inspired by the successful methodologies of state-of-the-art models like **Inworld TTS-1** and **FireRedTTS-2**, which have demonstrated the power of large-scale pre-training followed by targeted fine-tuning and reinforcement learning. The core of the architecture will be a powerful LLM backbone that learns to generate discrete audio tokens from a neural codec. The training process is designed to be a curriculum, starting with learning the fundamental properties of speech from the entire filtered dataset, then specializing on high-quality, transcribed data, and finally, being aligned with human preferences for naturalness and quality. This section outlines the key components of this strategy, including the selection and training of the audio codec, the multi-stage TTS LLM training pipeline, and the incorporation of advanced features like multi-speaker dialogue and emotional control.

### 3.1. Audio Codec Selection and Training

The choice of audio codec is a foundational decision that will impact the entire TTS system. The codec is responsible for compressing the high-dimensional audio waveform into a low-dimensional sequence of discrete tokens, which can then be modeled by the LLM. The quality of the codec directly affects the upper bound of the fidelity of the synthesized speech. Therefore, a careful evaluation of available codecs is necessary to select the one that offers the best trade-off between compression rate, reconstruction quality, and computational efficiency for the project's specific needs.

#### 3.1.1. Candidate Codecs: X-Codec 2.0, Mimi, SNAC, BiCodec, Encodec

The project will evaluate a range of state-of-the-art neural audio codecs that are suitable for TTS applications. The primary candidates include:

*   **X-Codec 2.0**: A recent and highly performant codec known for its high-quality reconstruction and efficiency, which has shown strong results in TTS benchmarks.
*   **Mimi**: A codec developed by Kyutai, designed for high-fidelity audio generation with a focus on efficiency and low latency.
*   **SNAC (Speech Neural Audio Codec)** : A codec specifically designed for speech, offering a good balance between quality and compression.
*   **BiCodec**: A bidirectional codec that can be used for both encoding and decoding, potentially offering more flexibility in the training process.
*   **Encodec**: A widely used and well-established codec from Meta, known for its high-quality reconstruction, though it can be computationally intensive.

The evaluation will be based on several criteria, including the **reconstruction quality** (measured by metrics like PESQ and STOI), the **token rate** (lower is better for LLM context length), the **codebook size**, and the **robustness** to the diverse acoustic conditions present in the Indic language dataset. The goal is to select a codec that can faithfully reconstruct the nuances of the human voice, including its emotional and prosodic variations, while keeping the token sequence short enough to be efficiently processed by the LLM.

#### 3.1.2. Training Strategy on Filtered, High-Confidence Audio

Once a codec is selected, it will be trained on the curated, high-confidence audio dataset created through the GSE-based filtering process. Training the codec on clean, high-quality data is crucial for ensuring that it learns a robust and accurate representation of speech. The training process will involve feeding the audio segments into the codec and optimizing its parameters to minimize the reconstruction loss. This will be a large-scale training job, requiring significant GPU resources to process the tens of thousands of hours of audio data. The resulting trained codec will serve as the "vocabulary" for the TTS LLM, translating between the continuous audio space and the discrete token space of the language model. The quality of this codec will be a direct determinant of the final TTS model's performance, making this a critical step in the overall pipeline.

#### 3.1.3. Segment Length Strategy for Training (6-second windows)

To manage the computational load and to create a consistent training experience, a specific strategy for handling audio segments of varying lengths has been devised. The strategy is to standardize on a **6-second window** for all training samples. This is achieved through a multi-bucket approach:

*   **Segments < 3 seconds**: These are discarded as they are too short to contain meaningful prosodic information.
*   **Segments 3-6 seconds**: These are padded to a 6-second duration during the training process.
*   **Segments 6-12 seconds**: A random 6-second window is extracted from these segments for each training step.
*   **Segments > 12 seconds**: These are split into multiple 6-second segments to ensure that no audio data is wasted.

This strategy ensures that the model is always trained on fixed-length inputs, which simplifies the training process and allows for efficient batching. The choice of a 6-second window is a trade-off between providing enough context for the model to learn long-range prosodic dependencies and keeping the computational requirements manageable.

### 3.2. TTS LLM Training Pipeline (Inspired by Inworld TTS & FireRedTTS)

The training of the TTS LLM will follow a multi-stage pipeline, inspired by the successful strategies of models like Inworld TTS-1 and FireRedTTS-2. This approach allows the model to first learn the fundamental properties of speech from a massive amount of unlabeled data, then to specialize on the task of text-to-speech with high-quality labeled data, and finally to be fine-tuned for human preference. This staged approach is crucial for building a robust and high-quality TTS system.

| Stage | Name | Data | Objective | Key Techniques |
| :--- | :--- | :--- | :--- | :--- |
| **1** | **Pre-training** | Large-scale, filtered audio (codec tokens) | Learn fundamental speech representations | Next-token prediction on audio tokens |
| **2** | **Supervised Fine-Tuning (SFT)** | High-quality, transcribed audio-text pairs | Learn the text-to-speech mapping | Cross-entropy loss on audio tokens, conditioned on text |
| **3** | **Reinforcement Learning (RL) Alignment** | Smaller, high-quality dataset with human feedback | Optimize for human preference (naturalness, quality) | Group Relative Policy Optimization (GRPO), composite rewards |

*Table 1: A summary of the three-stage TTS LLM training pipeline, outlining the data, objectives, and key techniques for each stage.*

#### 3.2.1. Stage 1: Pre-training on Large-Scale, Filtered Audio Data

The first stage is **pre-training**, where the LLM is trained on the massive, curated dataset of high-confidence audio segments. In this stage, the model learns to predict the next audio token in a sequence, without any text conditioning. This is an unsupervised learning task that allows the model to develop a deep understanding of the statistical properties of human speech, including phonetics, prosody, and speaker characteristics. This stage is analogous to the pre-training of a text-based LLM on a large corpus of text. The goal is to create a powerful "speech foundation model" that has a strong prior on what constitutes natural-sounding audio. This pre-trained model will then serve as a robust starting point for the subsequent, more specialized training stages.

#### 3.2.2. Stage 2: Supervised Fine-Tuning (SFT) on High-Quality Transcribed Data

The second stage is **Supervised Fine-Tuning (SFT)** , where the pre-trained model is trained on a smaller, high-quality dataset of audio-text pairs. This is where the model learns the core task of text-to-speech synthesis. The model is conditioned on the input text (the transcript) and is trained to generate the corresponding sequence of audio tokens. This stage requires a very clean and accurate transcribed dataset, which can be obtained through the alternative transcription methods discussed earlier (e.g., YouTube CCs, Wikipedia alignment). The SFT process will fine-tune the model's parameters to align the generated speech with the semantic content of the text, teaching it correct pronunciation, intonation, and rhythm. The quality of the SFT data is paramount, as any errors in the transcripts will be learned by the model.

#### 3.2.3. Stage 3: Reinforcement Learning (RL) Alignment for Quality and Human Preference

The final stage of the training pipeline is **Reinforcement Learning (RL) alignment**. In this stage, the SFT model is further fine-tuned to optimize for human preference. This is achieved by training the model with a reward model that has been trained on human feedback. The model generates multiple speech samples for a given text, and the reward model assigns a score to each sample based on its perceived quality and naturalness. The model is then updated to maximize this reward, using a technique like **Group Relative Policy Optimization (GRPO)** , which has been successfully adapted for speech applications . This RL stage is crucial for pushing the quality of the TTS model from "good" to "human-like," as it allows the model to learn the subtle nuances of speech that are difficult to capture with a simple cross-entropy loss function.

### 3.3. Incorporating Advanced Features

To create a truly state-of-the-art TTS system, the model must be capable of handling a range of advanced features that go beyond simple text-to-speech synthesis. These features include the ability to model multiple speakers in a dialogue, to control the emotional expression of the speech, and to correct for common pronunciation errors. The proposed architecture and training strategy are designed to accommodate these advanced capabilities.

#### 3.3.1. Multi-Speaker and Dialogue Modeling

To enable the model to generate conversational speech, a post-training stage will be dedicated to **multi-speaker and dialogue modeling**. This will involve training the model on a dataset of multi-speaker dialogues, where the input text is annotated with speaker IDs. The model will learn to generate speech for each speaker in the conversation, capturing the natural turn-taking and interaction patterns of human dialogue. This is a key feature for creating more engaging and interactive AI systems, such as virtual assistants or characters in video games. The training data for this stage can be sourced from podcast episodes with multiple guests or from other conversational datasets.

#### 3.3.2. Fine-Grained Emotional Control and Audio Markups

The model will be trained to support **fine-grained emotional control** and **audio markups**, as discussed in the transcript normalization strategy. The transcripts will include tags for emotions (e.g., `[sad]`, `[happy]`) and other audio events (e.g., `[laugh]`, `[cough]`). During the SFT and RL stages, the model will learn to associate these tags with the corresponding acoustic features, allowing it to generate speech with the desired emotional coloring. This will provide users with a high degree of control over the expressiveness of the synthesized speech, making it more versatile and engaging. The inclusion of these markups in the training data is a key step towards creating a more semantically aware TTS system.

#### 3.3.3. Pronunciation Inpainting and Error Correction

To address the common problem of pronunciation errors, especially for rare words or names, the model will incorporate a **pronunciation inpainting** mechanism. This technique, inspired by the methods used in Inworld TTS-1, involves mixing phonemes and words to correct for polyphonic errors . The model will be trained to recognize when a word is likely to be mispronounced and to "inpaint" the correct phonetic sequence. This can be achieved by providing the model with phonetic transcriptions as an additional input or by training it to perform error correction on its own output. This feature is crucial for improving the overall accuracy and reliability of the TTS system, especially for applications that require precise pronunciation, such as language learning or accessibility tools.

## 4. Evaluation and Benchmarking

A comprehensive evaluation and benchmarking strategy is essential for measuring the progress and success of the TTS model development. This involves using a combination of objective metrics to assess the technical quality of the audio and transcriptions, as well as subjective human evaluations to gauge the naturalness and preference of the synthesized speech. The evaluation will also include a detailed analysis of the curated dataset itself, to validate the effectiveness of the data curation pipeline.

### 4.1. Quality Metrics for Audio and Transcriptions

A suite of standard and specialized metrics will be used to evaluate the quality of the TTS model and the underlying data. These metrics provide a quantitative measure of various aspects of the system's performance, from the accuracy of the transcriptions to the fidelity of the synthesized audio.

#### 4.1.1. Word Error Rate (WER) and Character Error Rate (CER)

**Word Error Rate (WER)** and **Character Error Rate (CER)** are the primary metrics for evaluating the accuracy of the transcriptions. WER measures the percentage of words that are incorrectly transcribed, while CER measures the percentage of characters. A low WER is a critical requirement for the training data, as it directly impacts the quality of the TTS model. The goal is to achieve a WER of **<5%** for the SFT data. These metrics will also be used to evaluate the performance of the final TTS model by comparing its generated speech to the ground-truth text using a high-quality ASR model.

#### 4.1.2. Speaker Similarity (SIM) and DNSMOS Scores

To evaluate the quality of the synthesized speech, several perceptual metrics will be used. **Speaker Similarity (SIM)** measures how closely the synthesized voice matches a target speaker's voice. This is crucial for voice cloning applications. **DNSMOS (Deep Noise Suppression Mean Opinion Score)** is a non-intrusive metric that predicts the perceived quality of speech, taking into account factors like noise, distortion, and overall naturalness. A high DNSMOS score is a strong indicator of a high-quality TTS model. These metrics will be used to track the progress of the model through the different training stages and to compare its performance against baseline models.

#### 4.1.3. Confidence Score Correlation with Intrusive Metrics

A key part of the evaluation will be to validate the effectiveness of the **confidence-based filtering** strategy. This will be done by correlating the GSE model's confidence scores with a suite of **intrusive quality metrics**, such as PESQ (Perceptual Evaluation of Speech Quality) and STOI (Short-Time Objective Intelligibility). A strong positive correlation would confirm that the confidence score is a reliable proxy for audio quality, providing further evidence for the validity of the data curation pipeline. This analysis will also help to fine-tune the confidence threshold used for filtering the dataset.

### 4.2. Benchmarking Against Existing Models

To contextualize the performance of the new TTS model, it will be benchmarked against several existing state-of-the-art models. This will provide a clear point of comparison and will help to identify areas for further improvement.

#### 4.2.1. Comparison with Inworld TTS-1, FireRedTTS-2, and CosyVoice 3

The new model will be compared against **Inworld TTS-1**, **FireRedTTS-2**, and **CosyVoice 3**, which are the primary inspirations for this project. The comparison will be made on a common set of evaluation metrics, including WER, SIM, and DNSMOS. This will provide a direct measure of how the new model stacks up against the current leaders in the field. The comparison will also consider factors like model size, training data size, and inference latency, to provide a more holistic view of the trade-offs involved.

#### 4.2.2. Internal TTS Arena for Human Preference Evaluation

While objective metrics are important, they do not always capture the full picture of perceived quality. Therefore, a crucial part of the evaluation will be a **human preference evaluation**. This will be conducted using an internal **TTS Arena**, similar to the Chatbot Arena used for evaluating LLMs. In this setup, human listeners will be presented with pairs of speech samples from different models (including the new model and the baseline models) and will be asked to choose the one they prefer. This will provide a direct measure of the subjective quality and naturalness of the synthesized speech, which is the ultimate goal of the project.

### 4.3. Dataset Quality Validation

The final part of the evaluation will be a thorough analysis of the curated dataset itself. This is to ensure that the data curation pipeline has been successful in creating a high-quality, diverse, and representative dataset for training the TTS model.

#### 4.3.1. Assessing the Impact of Confidence-Based Filtering

The impact of the **confidence-based filtering** strategy will be assessed by comparing the quality metrics of the filtered dataset to the original, unfiltered dataset. This will involve measuring the average DNSMOS score, the percentage of segments with background music, and other quality indicators. The goal is to demonstrate that the filtering process has successfully removed low-quality data and has resulted in a significant improvement in the overall quality of the dataset.

#### 4.3.2. Analyzing the Final Curated Dataset Composition

A detailed analysis of the final curated dataset will be performed to understand its composition. This will include analyzing the distribution of languages, the distribution of segment lengths, the diversity of speakers, and the prevalence of different acoustic conditions. This analysis will help to ensure that the dataset is well-balanced and representative of the target domain, and it will provide valuable insights for the future development of the TTS model. The results of this analysis will be used to create a comprehensive data card for the dataset, documenting its characteristics and the methods used for its creation.