# Validation Pipeline TODOs

## Current State (What We Have)
- ✅ CTC frame-level probabilities from Wav2Vec2 (Telugu model)
- ✅ Basic "alignment score" from max frame probs (~0.91 mean)
- ✅ **NEW: Real CTC Forced Alignment** (`ctc_forced_aligner.py`)
  - Per-word alignment scores
  - Identifies low-confidence words
  - Mean score: ~0.79 (more rigorous than frame confidence)
- ❌ NOT doing ASR round-trip (not useful for Indic - both models imperfect)

---

## Priority 1: ASR Round-Trip Validation (HIGH VALUE)

**Why**: The BEST way to validate Gemini's transcription is to have a DIFFERENT ASR model transcribe the same audio and compare.

### Implementation
```python
# 1. Transcribe with Gemini → "gemini_text"
# 2. Transcribe with Whisper → "whisper_text" 
# 3. Compute CER(gemini_text, whisper_text)
# 4. If CER < 15% → High confidence
#    If CER 15-30% → Medium (review)
#    If CER > 30% → Low (regenerate/discard)
```

### Models to Use
- **Whisper Large V3** (best Indic performance)
- **IndicConformer** (already have this!)
- OR: Google USM/Chirp if available

### Action Items
- [ ] Add `whisper` or `faster-whisper` to requirements
- [ ] Create `asr_validator.py` that does round-trip validation
- [ ] Compute CER between Gemini output and Whisper output
- [ ] Add ASR confidence (log-prob) as secondary signal

---

## Priority 2: Proper CTC-Segmentation (MEDIUM VALUE)

**Why**: Current "alignment" is fake (proportional by char count). Real CTC-segmentation uses dynamic programming to find optimal alignment.

### Implementation
Use `torchaudio.functional.forced_align` (PyTorch 2.1+) or `ctc-segmentation` library.

```python
# Instead of proportional alignment:
# 1. Get CTC emissions from Wav2Vec2
# 2. Build trellis matrix for transcript characters
# 3. Find optimal path via Viterbi/DP
# 4. Extract word boundaries from path
```

### Action Items
- [ ] Implement proper CTC-segmentation in `alignment_scorer.py`
- [ ] Use torchaudio's forced_align if available
- [ ] Fall back to manual trellis if not

---

## Priority 3: Install Real IndicMFA (OPTIONAL)

**Why**: Gives precise phone-level alignment and MFA's quality metrics.

### Blockers
- Requires `conda install montreal-forced-aligner`
- Requires downloading AI4Bharat's acoustic models
- Complex setup, may not be worth it

### If We Do It
- [ ] Install MFA via conda
- [ ] Download IndicMFA models from AI4Bharat releases
- [ ] Extract `alignment_analysis.csv` metrics:
  - Speech log-likelihood
  - Phone duration deviation
  - OOV rate

---

## Scoring Recipe (Combined)

```python
def compute_quality_score(
    gemini_text: str,
    whisper_text: str,
    alignment_score: float,
    audio_duration: float
) -> dict:
    """
    Compute quality score for transcription validation.
    
    Returns:
        {
            "cer": 0.12,  # Character Error Rate vs Whisper
            "alignment_score": 0.91,  # CTC frame confidence
            "quality": "high",  # high/medium/low
            "action": "keep"  # keep/review/regenerate/discard
        }
    """
    # 1. CER between Gemini and Whisper
    cer = compute_cer(gemini_text, whisper_text)
    
    # 2. Combined score
    combined = 0.6 * (1 - cer) + 0.4 * alignment_score
    
    # 3. Decision
    if cer < 0.15 and alignment_score > 0.8:
        return {"quality": "high", "action": "keep"}
    elif cer < 0.30:
        return {"quality": "medium", "action": "review"}
    else:
        return {"quality": "low", "action": "regenerate"}
```

---

## Decision Thresholds

| CER | Alignment | Quality | Action |
|-----|-----------|---------|--------|
| < 15% | > 0.8 | High | ✅ Keep |
| 15-30% | > 0.7 | Medium | 🔍 Review |
| > 30% | any | Low | 🔄 Regenerate |
| > 50% | < 0.6 | Failed | ❌ Discard |

---

## Next Steps (Ordered)

1. **[NEXT]** Add Whisper round-trip validation
2. Compute CER as primary quality metric
3. Upgrade to proper CTC-segmentation (optional)
4. Consider IndicMFA only if CTC isn't enough

---

## References

- [torchaudio forced alignment tutorial](https://docs.pytorch.org/audio/tutorials/forced_alignment_tutorial.html)
- [AI4Bharat IndicMFA](https://github.com/AI4Bharat/IndicMFA)
- [ctc-segmentation library](https://github.com/lumaku/ctc-segmentation)
- [MFA alignment analysis docs](https://montreal-forced-aligner.readthedocs.io/en/v3.1.1/user_guide/implementations/alignment_analysis.html)