# Maya3 Transcription Pipeline

Audio transcription pipeline for Indian languages using Google Gemini AI.

## Production Configuration

```
Model: gemini-3-flash-preview
Temperature: 1.0 (Gemini 3 default - below 1.0 causes looping)
Thinking: low (fast, prevents loops)
Validation: Native script + CTC alignment
```

## Quick Start

```bash
# Setup
python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt

# Run pipeline
python pipeline.py VIDEO_ID --language Telugu --max-segments 10
```

## Pipeline Usage

```bash
# Basic usage
python pipeline.py pF_BQpHaIdU

# With options
python pipeline.py pF_BQpHaIdU \
    --language Telugu \
    --model gemini-3-flash-preview \
    --thinking low \
    --temperature 0.0 \
    --max-segments 20 \
    --output ./transcriptions

# Skip validation (faster but no quality check)
python pipeline.py pF_BQpHaIdU --no-validate
```

## Python API

```python
from pipeline import run_pipeline

# Run full pipeline
result = run_pipeline(
    video_id="pF_BQpHaIdU",
    language="Telugu",
    model="gemini-3-flash-preview",
    thinking_level="low",
    temperature=0.0,
    max_segments=20,
    validate=True
)

print(f"Processed: {result.segments_processed} segments")
print(f"Output: {result.output_file}")
```

## Validation

Transcriptions are validated for:

1. **Character Check** (instant)
   - Only valid native script characters
   - Numbers and punctuation allowed
   - No English/foreign characters

2. **Audio Match** (0.1s per segment)
   - CTC-based alignment score
   - Detects if text doesn't match audio

```python
from src.validators import validate_transcription, quick_validate

# Quick check (character only)
result = quick_validate("నాకు కొన్ని గుర్తుంటాయి", language="te")
print(result["valid"])  # True

# Full validation (with audio)
result = validate_transcription("audio.flac", "నాకు కొన్ని", language="te")
print(result.status)  # accept / review / reject
```

## Output Format

Each transcription includes 3 text formats + 1 tagged format + speaker metadata:

```json
{
  "segment_id": "SPEAKER_00_0000_0.03-2.61.flac",
  "transcription": {
    "transcription": "నాకు కొన్ని యాడ్స్ గుర్తుంటాయి.",
    "code_switch": "నాకు కొన్ని ads గుర్తుంటాయి.",
    "romanized": "naku konni ads gurtuntayi.",
    "tagged": "నాకు కొన్ని ads గుర్తుంటాయి [laugh] like example కచ్చా mango bite.",
    "speaker": {
      "emotion": "happy",
      "emotion_intensity": "mild",
      "speaking_style": "conversational",
      "pace": "normal",
      "energy": "medium",
      "accent": "Telangana",
      "audio_quality": "clean",
      "background": "silence"
    },
    "confidence": 0.95
  },
  "validation_status": "accept",
  "validation_score": 0.798
}
```

## Project Structure

```
maya3_transcribe/
├── pipeline.py              # Main pipeline orchestrator
├── src/
│   ├── backend/             # Core modules
│   │   ├── config.py
│   │   ├── r2_storage.py
│   │   ├── supabase_client.py
│   │   ├── audio_processor.py
│   │   ├── audio_polisher.py     # NEW: VAD boundary cleanup
│   │   ├── gemini_transcriber.py
│   │   └── transcription_schema.py
│   └── validators/          # Validation module
│       ├── __init__.py
│       ├── simple_validator.py   # Main validator
│       └── ctc_forced_aligner.py # CTC alignment
├── models/                  # Downloaded models (gitignored)
├── analysis_results/        # Analysis output (gitignored)
├── transcriptions/          # Pipeline output (gitignored)
└── docs/                    # Documentation
```

## Supported Languages

| Language | Code | Script |
|----------|------|--------|
| Telugu | te | Telugu |
| Hindi | hi | Devanagari |
| Tamil | ta | Tamil |
| Kannada | kn | Kannada |
| Malayalam | ml | Malayalam |
| Bengali | bn | Bengali |
| Marathi | mr | Devanagari |
| Gujarati | gu | Gujarati |
| Punjabi | pa | Gurmukhi |
| Odia | or | Odia |

## Audio Polisher (Pre-processing)

Lightweight signal-processing pass that cleans VAD-generated segment boundaries
before transcription. No ML models, runs in ~30ms/segment.

**What it does:**
- Detects and trims leading cut artifacts (partial phonemes from imprecise VAD)
- Detects trailing speech bursts (leaked from next segment)
- Preserves silence at boundaries (silence = good padding, never trimmed)
- Measures SNR, conditionally normalizes volume for quiet segments
- Integrated into pipeline as Step 3.5 (automatic, `polish_audio=True` by default)

**Conservative by design** - only trims cut speech artifacts, never silence.
Already-clean segments pass through untouched (~76% skip rate on test data).

```python
from src.backend.audio_polisher import AudioPolisher

polisher = AudioPolisher()
result = polisher.polish("segment.flac", output_dir="./polished/")
print(result.summary())
# POLISHED | 2582ms -> 2557ms | start -25ms (artifact_trimmed)
```

## Configuration

Environment variables (`.env`):

```
GEMINI_API_KEY=your_key
R2_ACCESS_KEY_ID=your_key
R2_SECRET_ACCESS_KEY=your_secret
R2_ENDPOINT_URL=your_endpoint
R2_BUCKET_NAME=your_bucket
SUPABASE_URL=your_url
SUPABASE_KEY=your_key
```
