# Transcript Pipeline Plan V2 (Merged After Claude + GPT Review)

## 1) Executive Summary
- Keep authoritative transcript as code-mixed native scripts.
- Use `expected_language_hint` from Supabase as a weak prior only.
- Keep punctuation from audible prosody only.
- Run aggressive boundary salvage before transcription.
- Split outputs into 3 dataset lanes early: `asr_core`, `tts_clean`, `tts_expressive`.
- Run Gemini online inference as the primary lane for SLA control; use Vertex Batch as overflow/backlog lane; use OpenRouter as emergency overflow with strict spend cap.
- Ship fast validation first (V1) and run heavy validation on low-score tail (V2) without blocking full-run throughput.

## 2) Merged Decisions (What Changed vs Prior Plan)

### 2.1 Adopted from Claude/GPT plans
- Remove full JSON schema from prompt body and enforce schema via API parameters only.
- Standardize `detected_language` as codes: `hi mr te ta kn ml gu pa bn as or en`, plus `no_speech`, `other`.
- Remove model self-confidence field from schema; rely on external validators.
- Explicitly define `tagged` as strict derivation of `transcription` (copy + event tags only).
- Add explicit no-speech behavior:
  - `transcription="[NO_SPEECH]"`
  - `tagged="[NO_SPEECH]"`
  - `detected_language="no_speech"`
  - `speaker={"emotion":"neutral","speaking_style":"conversational","pace":"normal","accent":""}`
- Process work by `videoID.tar` unit (download/extract once, process all segments inside).
- Keep transliteration/romanization as a later text-only pass, not in this audio call.
- Add strict versioning fields: prompt/schema/trimmer/validator/model/provider versions.

### 2.2 Conflicts resolved
- Thinking level:
  - Prior plan used `minimal`.
  - Updated: run canary with `low` and `minimal`; lock whichever wins quality at acceptable latency/cost.
- Batch strategy:
  - Prior plan leaned batch-heavy.
  - Updated: online inference primary, batch secondary overflow/backlog due async turnaround uncertainty.
- Event tags:
  - Prior plan included `[speech_overlap]`.
  - Updated: remove `[speech_overlap]` from inline event tags; track overlap as a separate quality flag.

## 3) Final Output Contract (Schema V2)

```json
{
  "type": "object",
  "properties": {
    "transcription": { "type": "string" },
    "tagged": { "type": "string" },
    "speaker": {
      "type": "object",
      "properties": {
        "emotion": {
          "type": "string",
          "enum": ["neutral", "happy", "sad", "angry", "excited", "surprised"]
        },
        "speaking_style": {
          "type": "string",
          "enum": ["conversational", "narrative", "energetic", "calm", "emphatic", "sarcastic", "formal"]
        },
        "pace": {
          "type": "string",
          "enum": ["slow", "normal", "fast"]
        },
        "accent": { "type": "string" }
      },
      "required": ["emotion", "speaking_style", "pace", "accent"],
      "additionalProperties": false
    },
    "detected_language": {
      "type": "string",
      "enum": ["hi", "mr", "te", "ta", "kn", "ml", "gu", "pa", "bn", "as", "or", "en", "no_speech", "other"]
    }
  },
  "required": ["transcription", "tagged", "speaker", "detected_language"],
  "additionalProperties": false
}
```

## 4) Prompt Spec V2 (Behavior Rules)
- `EXPECTED_LANGUAGE_HINT` is a weak prior from metadata.
- Trust audio over hint when mismatch exists.
- `transcription` is authoritative code-mixed native script.
- Preserve verbatim fillers, stammers, repetitions, false starts, and abrupt cutoffs.
- Punctuation limited to `. , ? !` from audible prosody only.
- Numbers and symbols:
  - preserve spoken form (do not normalize to digits/symbols unless literally spoken as such).
- `tagged` must be character-identical to `transcription` except for inserted event tags.
- Event tags allowed set:
  - `[laugh] [cough] [sigh] [breath] [throat_clear] [sniff] [music] [applause] [noise] [singing]`
- Tag insertion rule: only when prominent and unambiguous; if uncertain, omit.

## 5) Audio Preparation and Segmentation
- Input source: `videoID.tar` from R2 containing `metadata.json` + `segments/*.flac`.
- For each segment:
  - measure start/end boundary quality with short-window energy + VAD.
  - trim to nearest usable low-energy valley if boundary is clipped.
  - if unsalvageable hard-cut both sides, drop from TTS lanes; keep for ASR only if later quality score passes.
- Add `150ms` silence padding on both ends after trim.
- Duration policy:
  - global reject: `< 2.0s`
  - target window: `2.0s–10.0s`
  - extension for split search: up to `15.0s`

## 6) Data Lanes and Acceptance Logic
- `asr_core`:
  - high-confidence transcript, may include some clipped/partial segments if quality score passes.
- `tts_clean`:
  - strict boundaries, no overlap suspicion, no clipped edges, no no-speech.
- `tts_expressive`:
  - strict boundaries plus approved event-tag presence.

## 7) Validation Plan

### V1 (all segments, cheap, non-blocking)
- schema validity and parse success
- `tagged`-derivation integrity check
- special token density (`[UNK]`, `[INAUDIBLE]`, `[NO_SPEECH]`)
- language mismatch flag (hint vs detected)
- duration vs text-length plausibility
- script plausibility checks
- event-tag count sanity
- overlap suspicion signal
- provisional quality score

### V2 (low-score tail only)
- consensus ASR comparison with Indic multilingual ASR backbone
- optional forced-alignment confidence scoring
- optional premium-model recheck for bottom tail
- final quality score and retry/quarantine action

## 8) Throughput and Provider Routing
- Target: 80M segments in 100 hours.
- Primary lane: Gemini online inference (AI Studio/Vertex online) with strict token-bucket controls.
- Secondary lane: Vertex Batch for backlog/overflow.
- Tertiary lane: OpenRouter overflow only when ETA risk breaches target and Google lanes are saturated.
- OpenRouter policy: strict budget cap and explicit routing requirements.

## 9) Storage Contract (Per Segment)
- identity: `video_id`, `segment_id`, `speaker_id`
- offsets: original + trimmed boundaries
- audio prep metadata: boundary flags, padding, salvage status
- inference metadata: model/provider/prompt/schema versions and request settings
- output fields: transcription/tagged/speaker/detected_language
- validation metrics: provisional/final scores + flags
- dataset lane assignment: `asr_core` | `tts_clean` | `tts_expressive` | `quarantine`

## 10) Rollout Phases
- Phase 0: lock prompt/schema/trimmer specs and run 10k stratified canary.
- Phase 1: choose thinking level (`low` vs `minimal`) from canary quality/cost results.
- Phase 2: run full throughput with V1 validation and adaptive routing.
- Phase 3: process low-score tail with V2 validators and targeted retries.
- Phase 4: publish lane-specific manifests for training pipelines.

## 11) Open Decisions (Require User Confirmation)
- Event tag set: keep `[singing]` or replace with another conversational tag.
- Number rendering policy: fully spoken form only vs mixed spoken+digit exceptions.
- Premium recheck policy for low-score tail: Gemini 3.1 Pro vs no premium recheck.
