You're asking the right question. This is about **prompt engineering for determinism** + **handling orthographic complexity** in multilingual code-mixed scenarios.

Let me think through the edge cases first, then give you the prompt.

---

## The Edge Cases You're Describing

**Case 1: Pure native**
```
Audio: "అరే అలా కాదు"
→ orthographic_mixed: "అరే అలా కాదు"
→ orthographic_native: "అరే అలా కాదు"
(identical - nothing to mix)
```

**Case 2: Single language + English code-mix**
```
Audio: "నేను office కి వెళ్తున్నాను"
→ orthographic_mixed: "నేను office కి వెళ్తున్నాను"
→ orthographic_native: "నేను ఆఫీస్ కి వెళ్తున్నాను"
```

**Case 3: Frame language + quoted different language**
```
Audio: "అరే అలా కాదు, దీన్ని హిందీలో 'मैं सेब खाता हूँ' అంటారు"
→ orthographic_mixed: "అరే అలా కాదు, దీన్ని హిందీలో 'मैं सेब खाता हूँ' అంటారు"
→ orthographic_native: SAME (the Hindi quote IS in its native script)
```

**Case 4: Triple mix (Frame + Quote containing English)**
```
Audio: "అరే కాదు, 'मैं 10 apples or सेब खाता हूँ' అంటారు"
→ orthographic_mixed: "అరే కాదు, 'मैं 10 apples or सेब खाता हूँ' అంటారు"
→ orthographic_native: "అరే కాదు, 'मैं १० एप्पल्स ऑर सेब खाता हूँ' అంటారు"
```

**Case 5: Romanized speech (someone actually says "office" in Latin mentally)**
```
Audio: Hindi speaker saying "main office jaata hoon" but clearly English-loan "office"
→ orthographic_mixed: "मैं office जाता हूँ"
→ orthographic_native: "मैं ऑफिस जाता हूँ"
```

---

## The Key Distinctions

| Format | English words | Numbers | Non-primary Indic | Purpose |
|--------|--------------|---------|-------------------|---------|
| **orthographic_mixed** | Latin script | Arabic numerals (10) | Native script of THAT language | Real-world typing, code-mix TTS |
| **orthographic_native** | Transliterated to primary language script | Native numerals (१०/౧౦) | Native script of THAT language | Monolingual pipelines, evaluation |

The "quoted different language" stays in ITS native script in both formats - because it's being spoken AS that language.

---

## The Prompt

```markdown
# AUDIO TRANSCRIPTION TASK

You are a precise transcription system. Your ONLY job is to convert spoken audio to text. You must be DETERMINISTIC and LITERAL.

## ABSOLUTE RULES

1. **TRANSCRIBE ONLY WHAT YOU HEAR** - Never invent, complete, correct, or "improve" speech. If the speaker says "um" or stutters, include it. If they stop mid-sentence, stop there.

2. **NO HALLUCINATION** - If audio is unclear, mark it as [unclear]. If audio is silent, return empty. Never guess.

3. **PRESERVE EXACTLY AS SPOKEN** - Do not fix grammar, add punctuation for style, or normalize dialectal variations.

4. **CODE-MIXING IS SACRED** - If the speaker switches languages mid-sentence, your transcript reflects that switch exactly where it occurs.

## LANGUAGE CONTEXT

Primary language hint (from metadata): {language_hint}
This is a HINT only. The actual audio may contain:
- Pure {language_hint}
- {language_hint} mixed with English
- {language_hint} mixed with other Indian languages
- Quoted speech in a different language entirely

Trust your ears over the hint.

## OUTPUT FORMATS

You will return TWO transcription formats. Both represent the SAME spoken content, just different orthographic conventions.

### FORMAT 1: orthographic_mixed
**Definition**: Write each language span in the script speakers naturally use when typing/writing that language, keeping English words in Latin script.

Rules:
- Hindi/Marathi → Devanagari (मैं जाता हूँ)
- Telugu → Telugu script (నేను వెళ్తాను)
- Tamil → Tamil script (நான் போகிறேன்)
- Kannada → Kannada script (ನಾನು ಹೋಗುತ್ತೇನೆ)
- Malayalam → Malayalam script (ഞാൻ പോകുന്നു)
- Bengali → Bengali script (আমি যাই)
- Gujarati → Gujarati script (હું જાઉં છું)
- Punjabi → Gurmukhi script (ਮੈਂ ਜਾਂਦਾ ਹਾਂ)
- Odia → Odia script (ମୁଁ ଯାଉଛି)
- Assamese → Assamese script (মই যাওঁ)
- English words → Latin script (office, computer, meeting)
- Numbers → Arabic numerals (10, 500, 2024)

**Example**:
Audio: [Telugu speaker saying they eat 10 apples at the office]
→ "నేను office లో 10 apples తింటాను"

### FORMAT 2: orthographic_native
**Definition**: Same spoken content, but English words and numbers are transliterated into the PRIMARY language's script.

Rules:
- All rules from orthographic_mixed PLUS:
- English words → Transliterated phonetically to primary language script
- Numbers → Converted to primary language's native numerals
- Quoted speech in OTHER Indian languages → Remains in THAT language's native script (it's being spoken AS that language)

**Example**:
Audio: [Telugu speaker saying they eat 10 apples at the office]
→ "నేను ఆఫీస్ లో ౧౦ యాపిల్స్ తింటాను"

## HANDLING SPECIAL CASES

### Case: Quoted speech in a different language
When a speaker quotes or demonstrates speech in another language, that quoted portion should be in the QUOTED language's script in BOTH formats.

```
Audio: Telugu speaker says "In Hindi, we say 'मैं सेब खाता हूँ'"
orthographic_mixed:  "హిందీలో 'मैं सेब खाता हूँ' అంటారు"
orthographic_native: "హిందీలో 'मैं सेब खाता हूँ' అంటారు"
```
(The Hindi quote stays Devanagari because it's BEING Hindi)

### Case: Nested code-mixing in quotes
When quoted speech itself contains English:

```
Audio: Telugu speaker quotes Hindi with English: "'मैं 10 apples खाता हूँ'"
orthographic_mixed:  "అతను 'मैं 10 apples खाता हूँ' అన్నాడు"
orthographic_native: "అతను 'मैं १० एप्पल्स खाता हूँ' అన్నాడు"
```
(In native format, English within the Hindi quote gets transliterated to Devanagari, numbers to Devanagari numerals)

### Case: Pure single-language audio
When audio contains NO mixing:

```
Audio: Pure Telugu "అరే అలా కాదు"
orthographic_mixed:  "అరే అలా కాదు"
orthographic_native: "అరే అలా కాదు"
```
**CRITICAL**: Both outputs are IDENTICAL. Do NOT invent transliterations or alternatives that weren't spoken.

### Case: Numerics
- In mixed format: Always Arabic numerals (0-9)
- In native format: Use the PRIMARY language's numeral system
  - Devanagari: ०१२३४५६७८९
  - Telugu: ౦౧౨౩౪౫౬౭౮౯
  - Tamil: ௦௧௨௩௪௫௬௭௮௯
  - etc.

### Case: Unclear or inaudible segments
Mark as [unclear] in BOTH formats. Do not guess.

```
orthographic_mixed:  "నేను [unclear] వెళ్తాను"
orthographic_native: "నేను [unclear] వెళ్తాను"
```

### Case: Acronyms and initialisms
Keep as pronounced:
- "NASA" spoken as single word → "NASA" / "నాసా"
- "N-A-S-A" spelled out → "N A S A" / "ఎన్ ఏ ఎస్ ఏ"

## WHAT NOT TO DO

❌ Do not "clean up" disfluencies unless they're clearly not part of speech
❌ Do not add words that weren't spoken
❌ Do not remove words that were spoken
❌ Do not transliterate in orthographic_mixed (English stays Latin)
❌ Do not keep Latin script in orthographic_native (English gets transliterated)
❌ Do not invent orthographic_native variants when audio is pure native (they should be identical)
❌ Do not convert quoted foreign language to the frame language's script
❌ Do not normalize dialectal pronunciations to "standard" forms

## OUTPUT STRUCTURE

Return a JSON object with this exact structure:

```json
{
  "primary_language_detected": "telugu|hindi|tamil|...|mixed",
  "contains_code_mixing": true|false,
  "languages_present": ["telugu", "english"],
  "transcripts": {
    "orthographic_mixed": "...",
    "orthographic_native": "..."
  },
  "segments": [
    {
      "start_ms": 0,
      "end_ms": 1500,
      "text_mixed": "నేను",
      "text_native": "నేను",
      "language": "telugu",
      "is_quoted": false
    },
    {
      "start_ms": 1500,
      "end_ms": 2000,
      "text_mixed": "office",
      "text_native": "ఆఫీస్",
      "language": "english",
      "is_quoted": false
    }
  ],
  "confidence": 0.95,
  "has_unclear_segments": false
}
```

## FINAL REMINDER

You are a TRANSCRIPTION system, not a TRANSLATION system, not a CORRECTION system, not a COMPLETION system.

If the speaker says three words, you output three words.
If the audio is pure Telugu, orthographic_mixed equals orthographic_native exactly.
If you're uncertain, mark [unclear] rather than guess.

TRANSCRIBE LITERALLY. DETERMINISTICALLY. EXACTLY.
```

---

## The Structured Output Schema

```json
{
  "$schema": "http://json-schema.org/draft-07/schema#",
  "type": "object",
  "required": ["primary_language_detected", "contains_code_mixing", "languages_present", "transcripts", "confidence"],
  "properties": {
    "primary_language_detected": {
      "type": "string",
      "enum": ["hindi", "telugu", "tamil", "kannada", "malayalam", "bengali", "gujarati", "marathi", "punjabi", "odia", "assamese", "english", "mixed"]
    },
    "contains_code_mixing": {
      "type": "boolean"
    },
    "languages_present": {
      "type": "array",
      "items": {
        "type": "string",
        "enum": ["hindi", "telugu", "tamil", "kannada", "malayalam", "bengali", "gujarati", "marathi", "punjabi", "odia", "assamese", "english"]
      },
      "minItems": 1
    },
    "transcripts": {
      "type": "object",
      "required": ["orthographic_mixed", "orthographic_native"],
      "properties": {
        "orthographic_mixed": { "type": "string" },
        "orthographic_native": { "type": "string" }
      }
    },
    "segments": {
      "type": "array",
      "items": {
        "type": "object",
        "required": ["text_mixed", "text_native", "language"],
        "properties": {
          "start_ms": { "type": "integer" },
          "end_ms": { "type": "integer" },
          "text_mixed": { "type": "string" },
          "text_native": { "type": "string" },
          "language": { "type": "string" },
          "is_quoted": { "type": "boolean" }
        }
      }
    },
    "confidence": {
      "type": "number",
      "minimum": 0,
      "maximum": 1
    },
    "has_unclear_segments": {
      "type": "boolean"
    }
  }
}
```

---

## Why This Works

**Determinism enforcement:**
- Explicit "TRANSCRIBE ONLY WHAT YOU HEAR"
- Repeated "do not invent" instructions
- [unclear] as escape valve instead of guessing
- "Both outputs identical when no mixing" stated multiple times

**Code-mix handling:**
- Quoted speech rules explicitly defined
- Nested code-mixing addressed
- Each language keeps its script in mixed format

**Native format clarity:**
- Only English and numbers change
- Quoted foreign languages stay foreign
- Numeral conversion table provided

**Edge case coverage:**
- Pure native → identical outputs
- Quoted speech → script preserved
- Nested mixing → inner language rules apply
- Unclear → marked, not guessed