---
name: Pipeline and Prompt Finalization
overview: Wire AudioPolisher into pipeline, restructure the transcription schema from 4 redundant outputs to 3 transcription formats + 1 tagged format + speaker metadata, fix two existing bugs, and finalize the prompt for production.
todos:
  - id: wire-polisher
    content: Wire AudioPolisher into pipeline.py between Step 3 and Step 4. Add polish_audio flag to PipelineConfig.
    status: pending
  - id: fix-bugs
    content: Fix english_ratio bug in simple_validator.py line 221. Fix temperature default in gemini_transcriber.py line 29.
    status: pending
  - id: restructure-schema
    content: "Rewrite TranscriptionOutput in transcription_schema.py: drop native_transcription, add tagged (code_switch + audio tags), add speaker metadata object with emotion/style/accent/energy/quality fields."
    status: pending
  - id: rewrite-prompt
    content: Rewrite get_transcription_prompt() - shorter, boundary-aware, defines allowed audio tags and speaker metadata enums. Update get_user_prompt() to match.
    status: pending
  - id: update-json-schema
    content: Update TRANSCRIPTION_JSON_SCHEMA to match new fields including nested speaker object.
    status: pending
  - id: update-validator
    content: Update simple_validator.py to use transcription field (strip punctuation for CTC). Fix the english_ratio reference.
    status: pending
  - id: update-transcriber
    content: Update gemini_transcriber.py to parse new schema fields (tagged, speaker) from Gemini response.
    status: pending
  - id: cleanup
    content: Remove test pngs, wavs, polished dirs from project root. Update README.
    status: pending
---

# Pipeline and Prompt Finalization

## Part A: Pipeline Wiring (carry-over from previous plan)

### A1. Wire AudioPolisher into pipeline.py

Insert between Step 3 (AudioProcessor) and Step 4 (Transcribe) in [pipeline.py](pipeline.py) `run()` method (~line 180). Add `polish_audio: bool = True` to `PipelineConfig`. ~10 lines.

### A2. Fix two bugs

- **simple_validator.py line 221**: references `english_ratio` which `check_characters()` never returns. Remove it.
- **gemini_transcriber.py line 29**: `TranscriptionConfig.temperature` defaults to `1.0` but pipeline uses `0.0`. Change default to `0.0`.

---

## Part B: Schema Restructure

### B1. New output schema (replaces current 4 fields)

Drop `native_transcription` (redundant - just `native_with_punctuation` with punctuation stripped). Replace with tagged transcription + speaker metadata.

**New schema in [src/backend/transcription_schema.py](src/backend/transcription_schema.py):**

```
TranscriptionOutput:
  transcription        # Native script + punctuation (was native_with_punctuation)
  code_switch          # Mixed script: native + English in Latin
  romanized            # Full Latin transliteration
  tagged               # Code-switch base + audio event tags (NEW)
  speaker              # Speaker/audio metadata object (NEW)
  confidence           # 0-1 score (keep)
  notes                # Quality notes (keep)

Speaker metadata object:
  emotion              # neutral | happy | sad | angry | excited | surprised
  emotion_intensity    # mild | moderate | strong
  speaking_style       # conversational | narrative | excited | calm | emphatic | sarcastic | formal
  pace                 # slow | normal | fast
  energy               # low | medium | high
  accent               # free text - regional dialect/accent (e.g. "Telangana", "Hyderabadi")
  audio_quality        # clean | moderate_noise | noisy
  background           # silence | music | crowd | other
```

### B2. Allowed audio event tags for `tagged` field

These go in the tagged transcription at the position where they occur. Only tag what is clearly audible - no hallucination.

- `[laugh]` - laughter (light or heavy)
- `[cough]` - coughing
- `[sigh]` - audible sigh
- `[breath]` - audible heavy breathing/inhale
- `[singing]` - humming, singing, or any melodic/rhythmic vocalization
- `[noise]` - non-speech noise burst
- `[music]` - background music present
- `[applause]` - clapping/applause

Example: `నాకు కొన్ని ads గుర్తుంటాయి [laugh] like example కచ్చా mango bite`

Example: `అది ఒక పాట [singing] లా లా లా [singing] అని చెప్పాడు`

---

## Part C: Prompt Rewrite

### C1. Prompt changes in [src/backend/transcription_schema.py](src/backend/transcription_schema.py)

The current `get_transcription_prompt()` is ~50 lines with repetitive per-field rules. Rewrite to be:

- **Shorter**: Gemini 3 with thinking mode doesn't need as much hand-holding
- **Boundary-aware**: Add note that audio may start/end mid-speech from VAD cutting - transcribe only clearly audible content, don't guess incomplete words at boundaries
- **Tag-aware**: Define the 8 allowed audio event tags with brief descriptions
- **Metadata-aware**: Define allowed enum values for each speaker metadata field
- **No redundant field**: Remove native_transcription instructions entirely

### C2. JSON schema update

Update `TRANSCRIPTION_JSON_SCHEMA` dict to match new fields. The `speaker` field becomes a nested object with enum constraints.

### C3. Validation update

Update [src/validators/simple_validator.py](src/validators/simple_validator.py) to validate `transcription` field (was `native_transcription`). The CTC alignment check uses the `transcription` field with punctuation stripped programmatically.

---

## Part D: Cleanup

- Remove test artifacts from project root (png, wav files, polished_test dirs)
- Update README.md with new schema

---

## What this does NOT change (deferred)

- Language-specific prompt variations (same prompt + language name works for now)
- Speaker embedding / MFCC analysis (dropped - Gemini handles boundary content)
- De-clipping (not worth it - transcriptions are fine despite 87% clipping)
- Resample to 16kHz (optional optimization, can add later if API latency is an issue)