o
    5¤˜ií.  ã                   @   s|  U d Z ddlmZmZ ddlmZmZ ddddœddd	dœd
dddœddddœddddœddddœddddœddddœddddœddd dœd!d"ddœd#d$d%dœd&œZeeeeef f e	d'< G d(d)„ d)eƒZ
G d*d+„ d+eƒZG d,d-„ d-eƒZd.ed/efd0d1„Zd/efd2d3„Zd4d5d6d7œd5d8d7œd4d9d5g d:¢d;œd5g d<¢d;œd5g d=¢d;œd5d>d7œd?œg d@¢dAdBœd5dCd7œdDœg dD¢dAdEœZdFS )Ga}  
Pydantic schemas for structured transcription output.

Architecture change (v4): Gemini outputs fewer fields, code derives the rest.
  Gemini outputs:
    1. transcription   - Native script with punctuation (PRIMARY, authoritative)
    2. tagged          - Code-mixed + audio event tags [laugh] etc.
    3. speaker         - Metadata: emotion, style, pace, accent
    4. detected_language

  Code derives (deterministic, not Gemini):
    5. romanized       - uroman-based Latin transliteration (from transcription)
    6. code_switch     - dropped (tagged subsumes it)

Prompt v4 design (evolved from v1 -> v2 strict -> v3 field derivation -> v4 simplified):
  - Reduced Gemini output from 4 text fields to 2 (transcription + tagged)
  - Less cognitive load = better adherence, more deterministic output
  - romanized derived deterministically via uroman = stable, reproducible MMS alignment
é    )ÚOptionalÚDict)Ú	BaseModelÚFieldzhi-INÚ
Devanagariz'Preserve Nukta when clearly pronounced.)Úbcp47ÚscriptÚscript_ruleszmr-INz&Preserve retroflex lateral accurately.zte-INÚTeluguzADon't over-split words. Preserve Sandhi/combined forms as spoken.zta-INÚTamilz-Distinguish short and long vowels accurately.zkn-INÚKannadaz/Preserve agglutinated/combined forms as spoken.zml-INÚ	Malayalamz8Don't split agglutinated words. Preserve chillu letters.zgu-INÚGujaratiÚ zpa-INÚGurmukhizbn-INÚBengaliz4Preserve Chandrabindu for nasalization where spoken.zas-INÚAssamesez:Use Assamese-specific characters, NOT Bengali equivalents.zor-INÚOdiazen-INÚLatinzBStandard English spelling. Don't phonetically approximate accents.)ÚHindiÚMarathir
   r   r   r   r   ÚPunjabir   r   r   ÚEnglishÚLANGUAGE_CONFIGSc                   @   sb   e Zd ZU dZedddZeed< edddZeed< ed	d
dZ	eed< edddZ
eed< dS )ÚSpeakerMetaz"Speaker metadata for TTS training.Úneutralz.neutral, happy, sad, angry, excited, surprised©ÚdefaultÚdescriptionÚemotionÚconversationalzEconversational, narrative, excited, calm, emphatic, sarcastic, formalÚspeaking_styleÚnormalzslow, normal, fastÚpacer   z>Regional accent/dialect if detectable, empty string if unknownÚaccentN)Ú__name__Ú
__module__Ú__qualname__Ú__doc__r   r   ÚstrÚ__annotations__r!   r#   r$   © r+   r+   úA/home/ubuntu/maya3_transcribe/src/backend/transcription_schema.pyr   X   s$   
 þþþþr   c                   @   sŠ   e Zd ZU dZeddZeed< edddZeed< edd	dZ	eed
< eddZ
eed< edddZee ed< edddZeed< dS )ÚTranscriptionOutputz>Structured output: 2 text fields from Gemini + derived fields.zhCode-mixed transcription with punctuation. Each language in its original script (English stays English).©r   Útranscriptionr   zcMixed script: native + English in Latin. v4: derived from tagged by stripping event tags, or empty.r   Úcode_switchzfFull Roman/Latin script transliteration. v4: derived via uroman from transcription, not Gemini output.Ú	romanizedz¬Code-mixed transcription with audio event tags. Native script for primary language, Latin for English words, plus [laugh] [cough] [sigh] etc. at positions where they occur.ÚtaggedNz/Speaker metadata: emotion, style, pace, accent.Úspeakerz*The language actually spoken in the audio.Údetected_language)r%   r&   r'   r(   r   r/   r)   r*   r0   r1   r2   r3   r   r   r4   r+   r+   r+   r,   r-   l   s0   
 ÿþþÿþþr-   c                   @   s  e Zd ZU dZeddZeed< edddZe	ed< ed	d
dZ
e	ed< eddZeed< eddZeed< eddZeed< eddZeed< edddZee ed< edddZee ed< edddZee ed< edddZee ed< edefdd„ƒZdS ) ÚTranscriptionResultz=Complete result for a transcribed segment including metadata.z Identifier for the audio segmentr.   Ú
segment_idr   z Chunk index if segment was splitr   Úchunk_indexé   zTotal chunks for this segmentÚtotal_chunkszDuration of audio in secondsÚduration_seczPrimary language of the audioÚlanguagezThe transcription outputsr/   z#Gemini model used for transcriptionÚ
model_usedNzThinking level usedÚthinking_levelzAPI call timeÚprocessing_time_seczaccept/review/rejectÚvalidation_statuszAlignment score 0-1Úvalidation_scoreÚreturnc                 C   s   | j j S )z)Shortcut to primary native transcription.)r/   )Úselfr+   r+   r,   Únative£   s   zTranscriptionResult.native)r%   r&   r'   r(   r   r6   r)   r*   r7   Úintr9   r:   Úfloatr;   r/   r-   r<   r=   r   r>   r?   r@   ÚpropertyrC   r+   r+   r+   r,   r5   ‘   s   
 r5   r;   rA   c           	      C   sÎ   t  | i ¡}| d| › d¡}| dd¡}| dd¡}d}|r*d|  ¡ › d|› d}|  ¡ d	kr5d
}d}nd| › d|› d|› d}d}|rM| › d|› dn| }d|› d| › d| › d| › d|› d|› d|› dS )zs
    System instruction for Gemini transcription.
    v4: 2 text fields (transcription + tagged) instead of 4.
    r   z nativer   r   r	   z
SCRIPT RULES FOR z:
Ú
Úenglishz#Write in standard English spelling.z?Same as transcription with audio event tags at their positions.zWrite z
 words in a   script.
   Keep English words in English (Latin script) exactly as spoken.
   Keep Hindi words in Devanagari, Tamil words in Tamil script, etc.
   Each language stays in its original script. Do NOT transliterate.
   Example: speaker says 'salt biscuits manchidi' -> salt biscuits z
(manchidi)z›Same text as transcription with audio event tags inserted at their positions.
   Do NOT change any words or scripts - just add the tags where events occur.z (ú)zYou are a verbatim speech-to-text transcription system. You are NOT a conversational assistant. Your output must precisely match the audio content.

TARGET: zµ

CRITICAL RULES (violations cause rejection):
1. NEVER TRANSLATE. This is transcription, not translation. If the speaker says English words, those are English. If the speaker says z words, those are a*  . Write what you HEAR, not what you think it means in another language.
2. VERBATIM FIDELITY: Every repetition, filler, stammer, false start, hesitation - exactly as spoken.
3. NO CORRECTION: Do not fix grammar, pronunciation, dialect, or word choice.
4. NO HALLUCINATION: Never add words or phrases not in the audio. If audio cuts off mid-sentence, STOP where the audio stops. Do not complete anything. Output ONLY the JSON.
5. UNCERTAINTY: If a word is unclear, write [UNK]. Use [INAUDIBLE] for unintelligible speech. Use [NO_SPEECH] for no speech (silence, noise, music only).
6. BOUNDARY HANDLING: Audio is VAD-cut and may start/end mid-speech. Transcribe everything you can confidently hear. Only omit what is truly inaudible.
7. LANGUAGE MISMATCH: Trust what you hear. If audio is clearly different from zá, transcribe in that language's script and set detected_language accordingly.

PUNCTUATION (prosody-based, not grammar):
- Only: comma, period, ? and !
- Insert from audible pauses/intonation only. No pause = no punctuation.
ar  
FIELD DERIVATION:
"transcription" is the PRIMARY authoritative output. It IS code-mixed: each language in its own script.
"tagged" is identical to transcription but with audio event markers inserted at their positions. Do NOT re-interpret the audio for tagged - just copy transcription and add tags.

OUTPUT FIELDS:

1. transcription (AUTHORITATIVE - native script)
   z
   Punctuation: period, comma, ? and ! only, from audible prosodic cues.

2. tagged (derived from transcription - code-mixed + event tags)
   a.  
   ONLY these tags, ONLY if clearly and prominently audible:
   [laugh] [cough] [sigh] [breath] [singing] [noise] [music] [applause]

3. speaker (metadata from audio prosody)
   emotion: neutral | happy | sad | angry | excited | surprised
   speaking_style: conversational | narrative | excited | calm | emphatic | sarcastic | formal
   pace: slow | normal | fast
   accent: regional dialect/accent if confidently detectable, empty string otherwise.

4. detected_language
   The language you actually hear spoken. If code-mixed, write the dominant language.)r   ÚgetÚupperÚlower)	r;   Úlang_configÚscript_namer   r	   Úscript_sectionÚnative_field_ruleÚtagged_ruleÚ
lang_labelr+   r+   r,   Úget_transcription_prompt©   s@   üÿÿþûûõðèärS   c                   C   s   dS )z#User prompt to accompany the audio.zuTranscribe this audio segment following the system instructions. Return a valid JSON object with all required fields.r+   r+   r+   r+   r,   Úget_user_promptõ   s   rT   ÚobjectÚstringz4Native script transcription with minimal punctuation)Útyper   z.Code-mixed transcription with audio event tagszSpeaker metadata)r   ÚhappyÚsadÚangryÚexcitedÚ	surprised)rW   Úenum)r    Ú	narrativer[   ÚcalmÚemphaticÚ	sarcasticÚformal)Úslowr"   Úfastz'Regional accent/dialect or empty string)r   r!   r#   r$   )r   r!   r#   F)rW   r   Ú
propertiesÚrequiredÚadditionalPropertiesz%Language actually spoken in the audio)r/   r2   r3   r4   )rW   re   rf   rg   N)r(   Útypingr   r   Úpydanticr   r   r   r)   r*   r   r-   r5   rS   rT   ÚTRANSCRIPTION_JSON_SCHEMAr+   r+   r+   r,   Ú<module>   s²    ýýýýýýýýýýýýÈ@%LþþþþþþíäþÙ,
Ñ