# Maya3 Transcription Pipeline

Audio transcription pipeline for Indian languages using Google Gemini AI models.

## Features

- **R2 Cloud Storage Integration**: Downloads audio tar files from Cloudflare R2
- **Supabase Integration**: Fetches video language metadata from database
- **Audio Processing**: Handles segment splitting with configurable duration limits
- **Multi-Model Support**: Works with Gemini 3 Pro, Flash, and 2.5 models
- **Structured Output**: Four transcription formats per segment
- **Validation**: IndicMFA and IndicConformer for quality validation

## Setup

```bash
# Create virtual environment
python3 -m venv venv
source venv/bin/activate

# Install dependencies
pip install -r requirements.txt

# Ensure .env file has required credentials:
# - R2_ENDPOINT_URL, R2_BUCKET, R2_ACCESS_KEY_ID, R2_SECRET_ACCESS_KEY
# - GEMINI_KEY
# - URL, SUPABASE_ADMIN (Supabase credentials)
```

## Quick Start

```bash
# Basic usage - process a video
python pipeline.py pF_BQpHaIdU --language Telugu

# Limit segments for testing
python pipeline.py pF_BQpHaIdU -n 5 --language Telugu

# Use different model
python pipeline.py pF_BQpHaIdU -m gemini-3-pro-preview -t high -n 3
```

---

## Transcription Utility (`transcription_utils.py`)

Standalone utility for transcribing audio with proper model configuration.

### Python API

```python
from transcription_utils import transcribe_audio

# Simple usage (Gemini 2.5)
result = transcribe_audio(
    audio_path="audio.flac",
    model="gemini-2.5-flash",
    language="Telugu",
    temperature=0.0
)

# Gemini 3 with temperature=0 (use thinking_budget to prevent loops!)
result = transcribe_audio(
    audio_path="audio.flac",
    model="gemini-3-flash-preview",
    language="Telugu",
    temperature=0.0,
    thinking_budget=300  # IMPORTANT: Prevents thinking loops
)

# Access results
print(result['native_transcription'])
print(result['romanized'])
print(f"Time: {result['_metadata']['processing_time_sec']}s")
```

### CLI Usage

```bash
# Basic
python transcription_utils.py audio.flac

# With specific model
python transcription_utils.py audio.flac --model gemini-3-flash-preview

# Gemini 3 with temp=0 (must use thinking_budget)
python transcription_utils.py audio.flac \
    --model gemini-3-pro-preview \
    --temperature 0 \
    --thinking-budget 300
```

### Key Settings

| Model Family | Temperature=0 | Notes |
|--------------|---------------|-------|
| **Gemini 3** | Use `thinking_budget=300` | Prevents thinking loops |
| **Gemini 2.5** | Works directly | No special config needed |

---

## Model Analysis Results

Analysis of 5 Gemini models on 20 Telugu audio segments.

### Speed Ranking (temp=0, thinking=low)

| Rank | Model | Avg Time | Notes |
|------|-------|----------|-------|
| 1 | gemini-3-flash-preview | 2.8s | Fast + quality |
| 2 | gemini-2.5-flash | 3.7s | Reliable |
| 3 | gemini-3-pro-preview | 6.8s | Best quality |
| 4 | gemini-2.5-flash-lite | 8.1s | Budget option |
| 5 | gemini-2.5-pro | 8.4s | Premium |

### Recommendations

| Use Case | Model | Settings |
|----------|-------|----------|
| **High Volume** | gemini-2.5-flash-lite | temp=0 |
| **Balanced** | gemini-2.5-flash | temp=0 |
| **Quality** | gemini-3-pro-preview | temp=0, thinking_budget=300 |

### Critical Finding

**⚠️ AVOID**: `temperature=0` + `thinking_level=high` on Gemini 3 models causes 300+ second delays (thinking loops).

**✅ FIX**: Use `thinking_budget=300` instead of `thinking_level=high` when `temperature=0`.

---

## Output Format

Each segment produces four transcription formats:

```json
{
  "native_transcription": "నాకు కొన్ని యాడ్స్ గుర్తుంటాయి...",
  "native_with_punctuation": "నాకు కొన్ని యాడ్స్ గుర్తుంటాయి,...",
  "code_switch": "నాకు కొన్ని ads గుర్తుంటాయి like example...",
  "romanized": "naaku konni ads gurtuntaayi like example..."
}
```

---

## Validation Module (`src/validators/`)

Post-transcription validation using AI4Bharat models.

### Validators

| Validator | Type | Purpose | Speed |
|-----------|------|---------|-------|
| **IndicMFA** | Forced Alignment | Word timestamps + confidence | ~0.1-0.3s |
| **IndicConformer** | ASR | Independent transcription | ~2-6s |

### Quick Usage

```python
from src.validators import ValidatorRunner

runner = ValidatorRunner(
    enable_indicmfa=True,
    enable_indic_conformer=True,
    language="te"
)

result = runner.validate(
    audio_path="segment.flac",
    reference_text="reference text",  # Required for MFA
    language="te"
)

for name, vr in result.results.items():
    print(f"{name}: {vr.transcription}")

runner.cleanup()
```

---

## Project Structure

```
maya3_transcribe/
├── pipeline.py              # Main transcription pipeline
├── transcription_utils.py   # Standalone transcription utility
├── requirements.txt
├── README.md
├── .env                     # Credentials
│
├── src/
│   ├── backend/             # Core pipeline modules
│   │   ├── config.py        # Environment config
│   │   ├── r2_storage.py    # R2 storage client
│   │   ├── supabase_client.py
│   │   ├── audio_processor.py
│   │   ├── transcription_schema.py
│   │   └── gemini_transcriber.py
│   │
│   └── validators/          # Validation modules
│       ├── base.py
│       ├── runner.py
│       ├── indicmfa_validator.py
│       └── indic_conformer_validator.py
│
├── analysis_results/
│   └── final_analysis.json  # Model comparison results
│
└── docs/                    # API documentation
```

---

## Analysis File Structure

`analysis_results/final_analysis.json`:

```json
{
  "metadata": {
    "video_id": "pF_BQpHaIdU",
    "language": "Telugu",
    "total_segments": 20,
    "models_tested": ["gemini-3-pro-preview", "gemini-3-flash-preview", ...],
    "configs_tested": ["temp1_high", "temp0_low"]
  },
  "summary": {
    "speed_ranking": [...],
    "recommendations": {...}
  },
  "segments": [
    {
      "segment_id": "SPEAKER_00_0000_0.03-2.61.flac",
      "duration_sec": 2.58,
      "models": {
        "gemini-3-pro-preview": {
          "temp1_high": {"native": "...", "time_sec": 20.0},
          "temp0_low": {"native": "...", "time_sec": 6.0}
        },
        ...
      },
      "validation": {
        "indicmfa": {"transcription": "...", "confidence": 0.84, "word_count": 8},
        "indic_conformer": {"transcription": "...", "confidence": null}
      }
    }
  ]
}
```

---

## Dependencies

```bash
# Core
pip install boto3 google-genai pydantic supabase pydub

# Validators
pip install torch torchaudio transformers soundfile onnx onnxruntime
```