## What forced alignment is (and isn’t) in this setup

Forced alignment is a **validator + locator**: given *(audio, transcript)* it finds the best timing path and can output **word/phoneme timestamps** plus **alignment scores**. It is extremely good at telling you “this text fits the audio well” vs “this text is off,” but it won’t magically correct a bad transcript unless you use it inside a **repair loop**.

Tools that fit your language set:

* **Montreal Forced Aligner (MFA)**: Kaldi-based, trainable, supports G2P via Pynini; best when you can provide/learn lexicons per language. ([montreal-forced-aligner.readthedocs.io][1])
* **WhisperX**: runs Whisper, then refines timestamps using **wav2vec2 forced alignment** (word-level timestamps + “unaligned word” signals). ([GitHub][2])
* **MMS CTC forced aligner (158 languages)**: practical for Indic because it supports **ISO-639-3 language codes** and can **romanize** during preprocessing (useful across scripts). ([huggingface.co][3])
* **torchaudio CTC forced alignment API**: good if you want to build your own scoring/diagnostics loop (CTC segmentation + high-level bundles). ([docs.pytorch.org][4])

## The “tight” strategy: Alignment-weighted system combination + targeted repair

You said you have **3 transcripts per utterance** (Whisper-large-v3, Indic Conformer(s), LLM-ASR, etc.). The highest-yield approach is:

### 0) Normalize aggressively (or your voting is fake)

Do this *before* any comparison/alignment:

* Script + punctuation normalization (Indic punctuation, danda, quotes)
* Numerals (e.g., “10”, “పది”, “दस”) → choose one canonical form per language
* Common expansions (abbrev, currency, dates) in a consistent style
* Remove obvious ASR junk tokens (“[MUSIC]”, “uh”, repeated words) only if you plan to remove them everywhere

If normalization is inconsistent, ROVER/consensus will “disagree” for dumb reasons.

---

### 1) Score each candidate transcript with an **independent forced aligner**

For each candidate transcript (T_j) for the same audio:

1. Run a forced aligner (recommend **MMS CTC aligner** as the universal baseline, and optionally MFA for languages where you can build strong lexicons). ([huggingface.co][3])
2. Extract these metrics (you’ll use them as weights + filters):

* **Coverage**: % of words/chars that receive a valid alignment (unaligned words are a huge red flag; WhisperX exposes this concept too). ([huggingface.co][5])
* **Mean alignment score** (CTC path scores / emission-based score; MMS/torchaudio-style aligners give you per-token/segment scores) ([huggingface.co][3])
* **Timing sanity**: too many ultra-short word durations, or big “gaps” in the middle of speech
* **Stability check**: re-run alignment with ±200–400ms padding on segment boundaries; good transcripts stay stable

**Hard filter**: if coverage < X% or score below threshold, drop that transcript *for that utterance*.
This alone often beats naive voting because it nukes hallucinations and language-mismatch outputs.

---

### 2) Do **weighted ROVER / confusion-network voting** (not naive majority)

ROVER is the classic system-combination method for ASR 1-best outputs. ([NIST][6])
Even better: build a **confusion network** and choose words with the best expected WER properties (consensus decoding). ([arXiv][7])

How to make it “insane” instead of average:

* Compute a **weight per system per utterance**:
  (w_j = \mathrm{softmax}(\lambda \cdot \text{alignScore}(T_j) - \mu \cdot \text{unalignedRate}(T_j)))
* In each confusion slot, choose the token maximizing (\sum_j w_j \cdot \mathbf{1}[\text{token}_j=\text{token}])
* If you have per-word confidences from ASR decoders, multiply them in, but **don’t trust them alone** (alignment is your external auditor).

This is directly in the spirit of ROVER (combine multiple ASR outputs to reduce WER) ([NIST][6]) and confusion-network consensus decoding for WER minimization. ([arXiv][7])

---

### 3) Alignment-guided “repair loop” on only the bad spans

After you produce the consensus transcript (T^*):

1. Forced-align (T^*) again.
2. Identify **low-confidence regions**:

* low alignment scores
* clusters of unaligned words
* high disagreement entropy in the confusion network slot

3. For just those spans, re-decode with “expensive mode”:

* shorter audio window (e.g., 2–8s), more context overlap
* higher beam / more temperatures
* for Whisper: prompt with left/right context from (T^*) so it stops freelancing

4. Re-run weighted voting only for that span, splice back into (T^*), then re-align.

This is how you spend compute where it actually buys WER.

---

### 4) The cheat code if you truly need **<1% WER**: allow rejection

If you’re building training data, the easiest way to get **label WER near 0–1%** is not “perfect correction,” it’s **precision-first filtering**:

* Accept only utterances where:

  * consensus alignment coverage is extremely high
  * alignment score is above a strict threshold
  * systems strongly agree (low confusion entropy)

Everything else gets:

* reprocessed (repair loop), or
* routed to human spot-check, or
* dropped

With unlimited data, dropping 20–60% of segments is often cheaper than trying to “fix” them.

---

## Practical recommendations for your language set

* Start with **MMS forced aligner** as the universal validator (Indic scripts + romanization + broad coverage). ([huggingface.co][3])
* Use **WhisperX** if you already run Whisper and want alignment diagnostics + word timestamps quickly (but still keep MMS as independent validator if Whisper is one of your generators). ([GitHub][2])
* Use **MFA** when you’re willing to invest in per-language lexicon/G2P quality (it can be very strong, but the lexicon work matters). ([montreal-forced-aligner.readthedocs.io][1])
* Implement voting as **weighted confusion-network/ROVER**, weighted by alignment quality, not equal votes. ([NIST][6])

## One reality check (important)

At scale, **<1% WER across all utterances** is usually not achievable *purely automatically* unless you (a) constrain the domain hard, and/or (b) **reject** a meaningful portion of data, and/or (c) add some targeted human verification. The pipeline above gets you as close as you can get with compute, and makes it obvious exactly which utterances need escalation.

If you want, I can translate the above into a concrete spec: scoring functions, thresholds to calibrate (with a 1–2 hour hand-labeled dev set per language), and an exact repair-loop schedule.

[1]: https://montreal-forced-aligner.readthedocs.io/en/stable/user_guide/index.html?utm_source=chatgpt.com "User Guide — Montreal Forced Aligner 3.X documentation"
[2]: https://github.com/m-bain/whisperX?utm_source=chatgpt.com "m-bain/whisperX"
[3]: https://huggingface.co/MahmoudAshraf/mms-300m-1130-forced-aligner "MahmoudAshraf/mms-300m-1130-forced-aligner · Hugging Face"
[4]: https://docs.pytorch.org/audio/stable/tutorials/forced_alignment_tutorial.html?utm_source=chatgpt.com "Forced Alignment with Wav2Vec2"
[5]: https://huggingface.co/spaces/Anioji/WhisperX/blob/9c2d684d1a778e559b6b13bb90112712dbc20568/whisperx/alignment.py?utm_source=chatgpt.com "whisperx/alignment.py"
[6]: https://www.nist.gov/publications/post-processing-system-yield-reduced-word-error-rates-recognizer-output-voting-error?utm_source=chatgpt.com "Recognizer Output Voting Error Reduction [ROVER]"
[7]: https://arxiv.org/abs/cs/0010012?utm_source=chatgpt.com "Finding consensus in speech recognition: word error minimization and other applications of confusion networks"
