# Maya3 Transcription Pipeline

Audio transcription pipeline for Indian languages using Google Gemini AI.

## Production Configuration

```
Model: gemini-3-flash-preview
Temperature: 0.0 (deterministic)
Thinking: low (fast, prevents loops)
Validation: Native script + CTC alignment
```

## Quick Start

```bash
# Setup
python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt

# Run pipeline
python pipeline.py VIDEO_ID --language Telugu --max-segments 10
```

## Pipeline Usage

```bash
# Basic usage
python pipeline.py pF_BQpHaIdU

# With options
python pipeline.py pF_BQpHaIdU \
    --language Telugu \
    --model gemini-3-flash-preview \
    --thinking low \
    --temperature 0.0 \
    --max-segments 20 \
    --output ./transcriptions

# Skip validation (faster but no quality check)
python pipeline.py pF_BQpHaIdU --no-validate
```

## Python API

```python
from pipeline import run_pipeline

# Run full pipeline
result = run_pipeline(
    video_id="pF_BQpHaIdU",
    language="Telugu",
    model="gemini-3-flash-preview",
    thinking_level="low",
    temperature=0.0,
    max_segments=20,
    validate=True
)

print(f"Processed: {result.segments_processed} segments")
print(f"Output: {result.output_file}")
```

## Validation

Transcriptions are validated for:

1. **Character Check** (instant)
   - Only valid native script characters
   - Numbers and punctuation allowed
   - No English/foreign characters

2. **Audio Match** (0.1s per segment)
   - CTC-based alignment score
   - Detects if text doesn't match audio

```python
from src.validators import validate_transcription, quick_validate

# Quick check (character only)
result = quick_validate("నాకు కొన్ని గుర్తుంటాయి", language="te")
print(result["valid"])  # True

# Full validation (with audio)
result = validate_transcription("audio.flac", "నాకు కొన్ని", language="te")
print(result.status)  # accept / review / reject
```

## Output Format

Each transcription includes:

```json
{
  "segment_id": "SPEAKER_00_0000_0.03-2.61.flac",
  "transcription": {
    "native_transcription": "నాకు కొన్ని యాడ్స్ గుర్తుంటాయి",
    "native_with_punctuation": "నాకు కొన్ని యాడ్స్ గుర్తుంటాయి.",
    "code_switch": "నాకు కొన్ని ads గుర్తుంటాయి",
    "romanized": "naaku konni ads gurtuntaayi"
  },
  "validation_status": "accept",
  "validation_score": 0.798
}
```

## Project Structure

```
maya3_transcribe/
├── pipeline.py              # Main pipeline orchestrator
├── src/
│   ├── backend/             # Core modules
│   │   ├── config.py
│   │   ├── r2_storage.py
│   │   ├── supabase_client.py
│   │   ├── audio_processor.py
│   │   ├── gemini_transcriber.py
│   │   └── transcription_schema.py
│   └── validators/          # Validation module
│       ├── __init__.py
│       ├── simple_validator.py   # Main validator
│       └── ctc_forced_aligner.py # CTC alignment
├── models/                  # Downloaded models (gitignored)
├── analysis_results/        # Analysis output (gitignored)
├── transcriptions/          # Pipeline output (gitignored)
└── docs/                    # Documentation
```

## Supported Languages

| Language | Code | Script |
|----------|------|--------|
| Telugu | te | Telugu |
| Hindi | hi | Devanagari |
| Tamil | ta | Tamil |
| Kannada | kn | Kannada |
| Malayalam | ml | Malayalam |
| Bengali | bn | Bengali |
| Marathi | mr | Devanagari |
| Gujarati | gu | Gujarati |
| Punjabi | pa | Gurmukhi |
| Odia | or | Odia |

## Configuration

Environment variables (`.env`):

```
GEMINI_API_KEY=your_key
R2_ACCESS_KEY_ID=your_key
R2_SECRET_ACCESS_KEY=your_secret
R2_ENDPOINT_URL=your_endpoint
R2_BUCKET_NAME=your_bucket
SUPABASE_URL=your_url
SUPABASE_KEY=your_key
```