# KaniTTS-2 Data Pipeline

[![](https://dcbadge.limes.pink/api/server/https://discord.gg/NzP3rjB4SB?style=flat)](https://discord.gg/NzP3rjB4SB) [![License](https://img.shields.io/badge/License-Apache_2.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)

```
===============================================
          N I N E N I N E S I X  😼
===============================================

          /\_/\
         ( -.- )───┐
          > ^ <    │
===============================================
```

## What Does This Do?

This pipeline takes audio datasets from HuggingFace and converts them into **tokenized neural codec representations** ready for training **KaniTTS-2**. It is built specifically around KaniTTS-2's codec and speaker embedding architecture — do not expect compatibility with other models unless they use the same NanoCodec tokenizer and WavLM-based speaker embedder.

The pipeline does two major things in sequence:

1. **Speaker Embedding** — extracts per-sample speaker identity vectors using WavLM, optionally runs unsupervised clustering to assign speaker IDs, and optionally averages embeddings per speaker group for more robust speaker representations.
2. **Audio Tokenization** — encodes audio into 4-layer NanoCodec tokens using NVIDIA NeMo, shards the output into compressed JSONL files, and assembles a final HuggingFace dataset.

Both stages are controlled entirely through `config.yaml`.

---

## Quick Start

### Prerequisites

- Linux (Ubuntu/Debian recommended)
- One or more NVIDIA GPUs (RTX 4090, 5080, 5090, etc.)
- Python 3.10+
- `make` installed

### Step 1 — Install

```bash
make install
```

Creates a `venv/` and installs all dependencies. Takes ~5 minutes on first run.

### Step 2 — Authenticate with HuggingFace

```bash
make login
```

Paste your token from [huggingface.co/settings/tokens](https://huggingface.co/settings/tokens). Use a token with **Write** permission if you plan to upload datasets.

### Step 3 — Configure

```bash
nano config.yaml
```

See the **Configuration Reference** section below for a full explanation of every setting.

### Step 4 — Run

```bash
make run
```

### Available Make Commands

| Command | Description |
|---------|-------------|
| `make install` | Create venv and install all dependencies |
| `make login` | Authenticate with Hugging Face |
| `make run` | Run `main.py` via venv |
| `make clean` | Remove the venv directory |
| `make help` | Show command reference |

---

## Pipeline Architecture

```
config.yaml
    │
    ▼
PipelineManager.validate()
  ├─ check clustering column conflicts
  └─ check group_sp_emb column source feasibility
    │
    ▼ (for each dataset in hf_datasets)
DatasetProcessor.load_dataset()
    │
    ▼ (if add_speaker_emb: true)
SpeakerEmbeddingProcessor.process()
  ├─ Step 1-4  Audio resampling (22050 Hz main + 16 kHz copy for WavLM)
  ├─ Step 5    Filter samples < 0.7 s
  ├─ Step 6    WavLM embedding inference (single or multi-GPU)
  ├─ Step 7    Filter NaN embeddings (mandatory)
  ├─ Step 8    [optional] UMAP + HDBSCAN clustering → speaker_column
  └─ Step 9    [optional] Per-speaker embedding averaging → grouped_sp_emb
    │
    ▼ (if add_speaker_emb: false, group_sp_emb.do_this: true)
SpeakerEmbeddingGrouper.group()     ← standalone, uses existing embeddings
    │
    ▼
NanoCodec Tokenization (single or multi-GPU workers)
  ├─ Reader workers shard and stream samples into queue
  └─ GPU workers encode audio → 4-layer tokens → JSONL.gz shards
    │
    ▼
assemble_and_save_final_dataset()
  ├─ [optional] save_to_disk (local)
  └─ [optional] push_to_hub (HuggingFace)
```

---

## Configuration Reference

Everything is controlled by `config.yaml`. Here is a fully annotated example:

```yaml
# ─── Core tokenization settings ───────────────────────────────────────────────
base_settings:
  audio_codec: nvidia/nemo-nano-codec-22khz-0.6kbps-12.5fps
  num_readers: 8
  qsize: 100000
  OUT_DIR: shards
  gzip_level: 1
  buffer_size: 16777216
  lines_per_file: 50000
  load_dataset_num_proc: 20

# ─── Speaker embedding settings ───────────────────────────────────────────────
speaker_embedding_settings:
  add_speaker_emb: true

  model:
    name: "nineninesix/speaker-emb-tbr"
    embedding_column: "wavlm_embedding"

  audio:
    target_sample_rate: 16000
    max_duration_sec: 30.0

  processing:
    batch_size: 2
    use_multiprocessing: true

  clustering:
    do_clusters: true
    speaker_column: label
    UMAP:
      n_neighbors: 15
      min_dist: 0.1
      metric: cosine
      random_state: 42
      n_components: 5
    HDBSCAN:
      min_cluster_size: 15
      min_samples: 3
      metric: euclidean
      cluster_selection_method: eom

  group_sp_emb:
    do_this: true
    group_by_column_name: label
    grouped_embedding_column: grouped_sp_emb

# ─── Datasets ─────────────────────────────────────────────────────────────────
hf_datasets:
  - name: your-username/dataset-name
    sub_name: null
    split: train
    text_column_name: text
    audio_column_name: audio
    speaker_column_name: null
    add_constant:
      - key: speaker_id
        value: alice

# ─── Output ───────────────────────────────────────────────────────────────────
save_settings:
  local: train_dataset
  hf_upload: null
```

---

### `base_settings`

| Key | Description | Default |
|-----|-------------|---------|
| `audio_codec` | NeMo NanoCodec model from HuggingFace Hub | `nvidia/nemo-nano-codec-22khz-0.6kbps-12.5fps` |
| `num_readers` | Number of CPU reader processes that stream data into the tokenization queue | `8` |
| `qsize` | Max items held in the shared queue between readers and GPU workers | `100000` |
| `OUT_DIR` | Directory for intermediate JSONL.gz shards | `shards` |
| `gzip_level` | Gzip compression level (1 = fastest, 9 = smallest) | `1` |
| `buffer_size` | Write buffer in bytes (16 MB recommended) | `16777216` |
| `lines_per_file` | How many samples per output shard file | `50000` |
| `load_dataset_num_proc` | Parallel processes used when loading datasets from HF | `20` |

---

### `speaker_embedding_settings`

#### `add_speaker_emb`

Master switch. Set to `true` to run the full speaker embedding pipeline (WavLM inference + optional clustering + optional grouping). Set to `false` to skip embedding entirely — unless `group_sp_emb.do_this: true`, in which case grouping runs standalone on embeddings that already exist in the dataset.

#### `model`

| Key | Description |
|-----|-------------|
| `name` | HuggingFace model repo for the WavLM speaker embedder |
| `embedding_column` | Name of the output column added to the dataset (e.g. `wavlm_embedding`) |

The model used is [Orange/Speaker-wavLM-tbr](https://huggingface.co/Orange/Speaker-wavLM-tbr) (or  `nineninesix/speaker-emb-tbr`). It outputs **128-dimensional L2-normalized** embedding vectors.

#### `audio`

| Key | Description |
|-----|-------------|
| `target_sample_rate` | WavLM operates at 16000 Hz — do not change unless you switch models |
| `max_duration_sec` | Audio clips longer than this are truncated before embedding. 20–30 s is recommended. |

Samples shorter than **0.7 seconds** are automatically removed before embedding — they produce unreliable speaker representations.

#### `processing`

| Key | Description |
|-----|-------------|
| `batch_size` | Samples per batch per GPU during WavLM inference |
| `use_multiprocessing` | Enable multi-GPU inference (one process per GPU); set `false` for single-GPU |

With multiple GPUs throughput scales near-linearly (e.g. 4 GPUs ≈ 4× speed). Single GPU works fine too.

---

### `clustering`

Unsupervised speaker discovery using UMAP + HDBSCAN. Useful when the source dataset has no speaker labels, or when you want to automatically discover speaker clusters across a mixed dataset.

| Key | Description |
|-----|-------------|
| `do_clusters` | Enable clustering. If `false`, all keys below are ignored |
| `speaker_column` | Name of the integer cluster ID column added to the dataset. Default: `speaker_id`. Must not conflict with any `add_constant` key — the pipeline will warn you at startup if it does. |

#### `UMAP` parameters

Reduces the 128-dim WavLM embeddings to a lower-dimensional space before clustering. Standard UMAP parameters apply.

| Key | Meaning |
|-----|---------|
| `n_components` | Target dimensionality (5 is a good default) |
| `n_neighbors` | Balances local vs global structure |
| `min_dist` | Minimum distance between points in low-dim space |
| `metric` | Distance metric (`cosine` works well for normalized embeddings) |
| `random_state` | Seed for reproducibility |

#### `HDBSCAN` parameters

Density-based hierarchical clustering. Samples not assigned to any cluster (noise, label = -1) are **removed** from the dataset.

| Key | Meaning |
|-----|---------|
| `min_cluster_size` | Minimum number of samples to form a cluster |
| `min_samples` | Controls how conservative the clustering is |
| `metric` | Distance metric for clustering |
| `cluster_selection_method` | `eom` (Excess of Mass) or `leaf` |

**Note:** If many samples become noise (-1), try reducing `min_cluster_size` or `min_samples`.

---

### `group_sp_emb`

Averages all per-sample WavLM embeddings for each unique speaker value, and writes the average back as a new column. This gives the model a stable, generalized speaker representation rather than the noisy per-utterance vector.

| Key | Description |
|-----|-------------|
| `do_this` | Enable per-speaker embedding averaging |
| `group_by_column_name` | Column used to group samples by speaker (e.g. the clustering output column, or a constant like `speaker_id`) |
| `grouped_embedding_column` | Name of the new averaged embedding column added to the dataset |

The resulting column (`grouped_sp_emb` by default) is automatically **preserved through tokenization** and appears in the final dataset alongside the codec tokens.

#### Feasibility check at startup

Before running, the pipeline checks that `group_by_column_name` will actually exist in the dataset. It looks for the column in three places:

1. `add_constant` keys in any dataset config
2. Clustering output (`speaker_column`, when `do_clusters: true`)
3. `speaker_column_name` of any dataset config

If none of these sources can provide the column, a warning is shown and you are offered the option to disable grouping for the current run. If you are sure the column already exists in the source dataset, you can proceed by answering `n`.

#### Standalone mode

If `add_speaker_emb: false` but `group_sp_emb.do_this: true`, the grouper runs independently — it assumes the embedding column already exists in the loaded dataset (e.g. from a previous pipeline run). This is useful to re-run only the grouping step without re-computing embeddings.

---

### `hf_datasets`

List of datasets to process. All datasets are processed sequentially and merged into a single final dataset.

| Key | Description |
|-----|-------------|
| `name` | HuggingFace dataset repo (e.g. `openslr/librispeech_asr`) |
| `sub_name` | Dataset configuration / subset (e.g. `clean`). Set to `null` if none. |
| `split` | Split to load: `train`, `test`, `validation`, or a slice like `train[:500]` |
| `text_column_name` | Column containing the text transcription |
| `audio_column_name` | Column containing the audio |
| `speaker_column_name` | Column containing speaker identity (if present in the source dataset). Set to `null` if absent. |
| `add_constant` | List of `{key, value}` pairs added as a constant column to every sample. Useful for tagging language, speaker, dataset source, etc. |

**Note on `add_constant` and `speaker_column_name`:** These are the two ways to inject a speaker column when the source dataset has no speaker labels. If you are running clustering, the pipeline produces its own speaker column — set `speaker_column_name: null` in that case.

---

### `save_settings`

| Key | Description |
|-----|-------------|
| `local` | Path to save the final assembled dataset to disk (HuggingFace Arrow format). Set to `null` to skip. |
| `hf_upload` | HuggingFace repo to push the final dataset to (e.g. `your-username/dataset-name`). Set to `null` to skip. Uploaded as **private** by default. |

---

## Output Format

### Intermediate shards

During processing, each GPU worker writes compressed shards to `OUT_DIR`:

```
shards/
  dataset-name-worker00-00000.jsonl.gz
  dataset-name-worker00-00001.jsonl.gz
  dataset-name-worker01-00000.jsonl.gz
  ...
```

### Final dataset schema

After assembly, each sample contains:

```json
{
  "text": "Hello world",
  "nano_layer_1": [123, 456, 789, ...],
  "nano_layer_2": [234, 567, 890, ...],
  "nano_layer_3": [345, 678, 901, ...],
  "nano_layer_4": [456, 789, 012, ...],
  "encoded_len": 150,
  "wavlm_embedding": [0.021, -0.043, ...],
  "label": 7,
  "grouped_sp_emb": [0.019, -0.038, ...]
}
```

Which columns appear depends on your config:
- `nano_layer_*`, `encoded_len` — always present (core tokenization output)
- `wavlm_embedding` — present when `add_speaker_emb: true`
- `label` (or your `speaker_column`) — present when `do_clusters: true`
- `grouped_sp_emb` (or your `grouped_embedding_column`) — present when `group_sp_emb.do_this: true`
- Any `add_constant` keys and `speaker_column_name` values — passed through

---

## Troubleshooting

### No CUDA devices found

```
RuntimeError: ❌ ERROR: No CUDA devices found!
```

Check drivers and CUDA toolkit with `nvidia-smi`.

### HuggingFace login fails

Run `make login` and paste a token with Write permission from [huggingface.co/settings/tokens](https://huggingface.co/settings/tokens).

### Pipeline runs slowly

- Increase `num_readers` (more CPU reader processes)
- Increase `load_dataset_num_proc` (faster initial dataset loading)
- Check GPU utilization: `nvidia-smi`

### Out of memory during embedding

- Decrease `processing.batch_size` in `speaker_embedding_settings`

### Out of memory during tokenization

- Decrease `qsize`
- Decrease `num_readers`

### Many clustering noise samples (label = -1)

Lower `min_cluster_size` and `min_samples` in HDBSCAN settings. Noise samples are removed from the dataset before tokenization.

### Warning: clustering column conflict

At startup you may see:

```
⚠️  WARNING: Clustering column naming conflict detected!
```

This means `speaker_column` in `clustering` matches a key in `add_constant`. The pipeline will prompt you to enter a different column name. To avoid this, change `speaker_column` in `config.yaml` before running.

---

## Appendix A — Speaker Embedding Deep Dive

### The WavLM model

The embedder uses [Orange/Speaker-wavLM-tbr](https://huggingface.co/Orange/Speaker-wavLM-tbr), a WavLM encoder fine-tuned for speaker recognition with a stats-pooling + projection head. It outputs **128-dimensional L2-normalized** vectors.

Key properties:
- Input: 16 kHz audio, max 20–30 s (longer is truncated)
- Output: one 128-dim vector per audio clip
- L2 norm of every output vector = 1.0
- Captures speaker timbre/voice characteristics, not content

The pipeline maintains **two audio channels** to avoid quality loss:
- The original 22050 Hz column is kept intact for tokenization
- A separate `audio_spk` column is decoded at 16 kHz only for embedding
Both are produced from the same raw bytes to avoid double re-encoding.

### Audio preprocessing

Before embedding:
- Clips shorter than **0.7 s** are filtered out (too short for reliable speaker identity)
- Clips longer than `max_duration_sec` are **truncated** (not padded)
- After embedding, any samples with **NaN embeddings** are removed (can happen on corrupted audio)

NaN filtering happens before clustering and grouping — both downstream steps receive clean data.

### Per-utterance vs per-speaker embeddings

By default, `wavlm_embedding` contains one vector per utterance. This is noisy — different recordings of the same speaker vary due to room acoustics, microphone, emotion, speaking rate.

`group_sp_emb` solves this by averaging all utterance embeddings for each unique speaker:

```
Speaker A:  emb_1, emb_2, emb_3, ...  →  mean(emb_1..N)  →  grouped_sp_emb
Speaker B:  emb_1, emb_2, emb_3, ...  →  mean(emb_1..N)  →  grouped_sp_emb
```

Every sample of Speaker A receives the **same** averaged vector. This forces the model to learn a stable speaker representation that doesn't overfit to individual recording artifacts.

**Why this matters for TTS training:** If the model sees a different speaker vector for every utterance of the same speaker, it may learn to associate voice timbre with specific recording conditions rather than the actual speaker. Averaged embeddings remove this confound and lead to better speaker generalization.

---

## Appendix B — Unsupervised Speaker Clustering

### When to use it

Use clustering when:
- Your source dataset has no speaker labels at all
- You are mixing multiple single-speaker datasets and want automatic speaker discovery
- You want to split a dataset into perceptually distinct voice groups without manual annotation

### How it works

1. **UMAP** reduces 128-dim WavLM embeddings to `n_components` dimensions (default: 5)
2. **L2 normalization** on the reduced embeddings
3. **HDBSCAN** clusters the normalized embeddings
4. Samples with label `-1` (HDBSCAN noise) are **dropped**
5. The cluster integer ID is written to `speaker_column`

Cluster IDs are integers (0, 1, 2, ...). They have no fixed meaning across runs — HDBSCAN is non-deterministic unless `random_state` is set in UMAP.

### Tuning guidance

**Too many noise samples (removed):**
- Lower `min_cluster_size` → makes clusters easier to form
- Lower `min_samples` → less conservative noise detection

**Too few clusters (everything merged):**
- Lower `min_cluster_size` → forces finer clusters
- Increase `n_components` in UMAP → preserves more local structure

**Clusters do not correspond to real speakers:**
- Increase `n_neighbors` in UMAP → more global structure preserved
- Try `metric: cosine` in UMAP (best for normalized embeddings)

### Combining clustering with group_sp_emb

The most common and recommended setup:

```yaml
clustering:
  do_clusters: true
  speaker_column: label

group_sp_emb:
  do_this: true
  group_by_column_name: label       # group by cluster ID
  grouped_embedding_column: grouped_sp_emb
```

The pipeline runs clustering first, assigns integer cluster IDs, then computes the average embedding per cluster. The result: every sample in the same speaker cluster receives the same stable averaged vector.

---

## Appendix C — Multi-Dataset Processing

All datasets in `hf_datasets` are processed sequentially and merged into one final dataset. Each dataset can have:
- A different text/audio column name
- Its own `add_constant` fields (e.g. language, dataset source tag)
- A mix of datasets with and without `speaker_column_name`

Example: mixing a labeled multi-speaker dataset with a single-speaker dataset:

```yaml
hf_datasets:
  - name: openslr/librispeech_asr
    sub_name: clean
    split: train
    text_column_name: text
    audio_column_name: audio
    speaker_column_name: speaker_id     # already has speaker labels
    add_constant:
      - key: lang
        value: en

  - name: your-org/single-speaker-dataset
    sub_name: null
    split: train
    text_column_name: sentence
    audio_column_name: audio
    speaker_column_name: null
    add_constant:
      - key: speaker_id                 # inject a constant speaker label
        value: alice
      - key: lang
        value: en
```

When `add_speaker_emb: true` and `do_clusters: true`, clustering runs independently per dataset. Each dataset's speaker IDs are independent integers — they do not share a global speaker space across datasets.

---

## Appendix D — Recommended Configs by Use Case

### Single-speaker dataset, no labels

```yaml
speaker_embedding_settings:
  add_speaker_emb: true
  clustering:
    do_clusters: false
  group_sp_emb:
    do_this: false

hf_datasets:
  - name: your-org/dataset
    speaker_column_name: null
    add_constant:
      - key: speaker
        value: alice
```

### Multi-speaker dataset, labels already present

```yaml
speaker_embedding_settings:
  add_speaker_emb: true
  clustering:
    do_clusters: false
  group_sp_emb:
    do_this: true
    group_by_column_name: speaker_id    # use existing labels
    grouped_embedding_column: grouped_sp_emb

hf_datasets:
  - name: your-org/dataset
    speaker_column_name: speaker_id     # column already in dataset
```

### Multi-speaker dataset, no labels — full unsupervised pipeline

```yaml
speaker_embedding_settings:
  add_speaker_emb: true
  clustering:
    do_clusters: true
    speaker_column: label
    UMAP:
      n_components: 5
      metric: cosine
    HDBSCAN:
      min_cluster_size: 15
      min_samples: 3
  group_sp_emb:
    do_this: true
    group_by_column_name: label
    grouped_embedding_column: grouped_sp_emb

hf_datasets:
  - name: your-org/unlabeled-dataset
    speaker_column_name: null
```

### Skip embedding, only tokenize

```yaml
speaker_embedding_settings:
  add_speaker_emb: false
  group_sp_emb:
    do_this: false
```

### Re-run grouping on an already-embedded dataset

```yaml
speaker_embedding_settings:
  add_speaker_emb: false             # skip re-embedding
  model:
    embedding_column: wavlm_embedding
  group_sp_emb:
    do_this: true
    group_by_column_name: speaker_id
    grouped_embedding_column: grouped_sp_emb
```

The grouper will run standalone using the `wavlm_embedding` column that already exists in the loaded dataset.

---

## Appendix E — Cosine Similarity Filtering and Regrouping

### Why cosine similarity matters for data quality

After computing `grouped_sp_emb` (the per-speaker averaged embedding), you can measure how similar each utterance's own `wavlm_embedding` is to its speaker's averaged vector. This cosine similarity score is a powerful signal for audio quality:

- **Score close to 1.0** — the utterance sounds like the speaker it was grouped under. Clean, consistent recording.
- **Score around 0.3–0.6** — moderate deviation. Could be a different speaking style, noisy conditions, or minor acoustic mismatch.
- **Score close to 0 or negative** — the utterance is very unlike the speaker average. Almost certainly a bad sample: background music, overlapping voices, corrupted audio, or a speaker mix-up.

Filtering with a threshold around **0.6** removes the worst samples while keeping natural voice variation. You will typically lose 2–10% of samples depending on dataset quality — this is normal and desirable.

It`s example for our work...

<img width="558" height="409" alt="image" src="https://github.com/user-attachments/assets/2237fa2c-d167-48f9-9765-b1596de8a80b" />


### Computing cosine similarity

Both `wavlm_embedding` and `grouped_sp_emb` are L2-normalized 128-dim vectors. Cosine similarity between two L2-normalized vectors is simply their dot product:

```python
import numpy as np
from datasets import load_from_disk

dataset = load_from_disk("train_dataset")

def add_cosine_sim(sample):
    emb = np.array(sample["wavlm_embedding"], dtype=np.float32)
    grp = np.array(sample["grouped_sp_emb"], dtype=np.float32)
    # Both are L2-normalized, so dot product == cosine similarity
    sim = float(np.dot(emb, grp))
    return {"cosine_sim": sim}

dataset = dataset.map(add_cosine_sim, num_proc=10, desc="Computing cosine similarity")

# Inspect the distribution
sims = np.array(dataset["cosine_sim"])
print(f"Mean:    {sims.mean():.3f}")
print(f"Median:  {np.median(sims):.3f}")
print(f"< 0.0:   {(sims < 0.0).sum()} samples  ({(sims < 0.0).mean()*100:.1f}%)")
print(f"< 0.3:   {(sims < 0.3).sum()} samples  ({(sims < 0.3).mean()*100:.1f}%)")
print(f"< 0.6:   {(sims < 0.6).sum()} samples  ({(sims < 0.6).mean()*100:.1f}%)")
```

If you see a large fraction of samples with cosine similarity near 0 or negative — that dataset has serious quality problems and cleaning is essential.

### Filtering by threshold

```python
THRESHOLD = 0.6

filtered = dataset.filter(
    lambda s: s["cosine_sim"] >= THRESHOLD,
    num_proc=10,
    desc=f"Filtering cosine_sim < {THRESHOLD}",
)

print(f"Before: {len(dataset)} samples")
print(f"After:  {len(filtered)} samples  ({len(filtered)/len(dataset)*100:.1f}% retained)")

# Save the filtered dataset
filtered.save_to_disk("train_dataset_filtered")
```

### Regrouping after filtering

After removing bad samples, the previously computed `grouped_sp_emb` vectors are stale — they were averaged over the unfiltered set. You must recompute them. Use the standalone script:

```bash
venv/bin/python regroup_embeddings.py \
    --dataset train_dataset_filtered \
    --embedding-col wavlm_embedding \
    --group-col label \
    --output-col grouped_sp_emb \
    --output train_dataset_clean
```

To also push the result to HuggingFace:

```bash
venv/bin/python regroup_embeddings.py \
    --dataset train_dataset_filtered \
    --embedding-col wavlm_embedding \
    --group-col label \
    --output-col grouped_sp_emb \
    --output train_dataset_clean \
    --hub your-username/dataset-clean
```

All arguments:

| Argument | Default | Description |
|----------|---------|-------------|
| `--dataset` | required | Path to filtered dataset (Arrow format from `save_to_disk`) |
| `--embedding-col` | `wavlm_embedding` | Per-utterance embedding column |
| `--group-col` | `label` | Speaker grouping column (cluster ID or speaker name) |
| `--output-col` | `grouped_sp_emb` | Output column name for the recomputed averages |
| `--output` | required | Local path to save the result |
| `--hub` | none | HuggingFace repo to push to (optional) |

### Recommended workflow

```
1. make run                          → full pipeline (embed + cluster + group)
2. compute cosine_sim per sample     → inspect distribution
3. filter by threshold (≥ 0.6)       → remove bad audio
4. regroup_embeddings.py             → recompute grouped_sp_emb on clean data
5. use train_dataset_clean for training
```

If after regrouping you still see many low-similarity samples, consider tightening the threshold (0.65–0.7) or reviewing your clustering parameters — poor clusters lead to noisy group averages.

---

## Need Help?

- Discord: [![](https://dcbadge.limes.pink/api/server/https://discord.gg/NzP3rjB4SB?style=flat)](https://discord.gg/NzP3rjB4SB)
- Found a bug? Open an issue on the repository.

## License

Apache 2.0. See `LICENSE` for details.

WavLM model ([Orange/Speaker-wavLM-tbr](https://huggingface.co/Orange/Speaker-wavLM-tbr)): CC-BY-SA-3.0 (Orange SA).