Context on what we are doing

We need to create a pipeline to generate transcripts from gemini-3-flash model from google. Now we would need a proper prompt which i've crafted here a prompt @prompt.txt which is a high level overview of what and how i want the model to act or behave when generating transcripts. and then i'm gonna give you my whole context on how to and where to and when to do everything, so we can plan accordingly from the start. 


about sourcing the data and how they are pre-cleaned

firstly to start with, my data is in-the-wild youtube data. so i've downloaded youtube videos and uploaded them to R2 and each of the video in R2 is now in videoID.webm. Now my Goal is to use these videos as source of audio and  create some insanely high quality ASR and TTS models.  Now i've done a basic preprocessing steps inorder to convert these youtube videos into a single speaker segments based on VAD activity and speaker diarization. Now the issue here is this process is not performed at 100% accuracy. meaning, since the videos are youtube podcasts, there will always be many acknowledgements like, haa, yaa, ohh, stuff like that. and these segmentations or speaker similarity models are quite not that precise to catch those things and some quite slipped through. 


R2 format and addressing real issue and accepting minor error. 


Now the data that i have in R2 is in the format of videoID.tar and upon unzipping it we get metadata.json and a segments/ folder with all segments and speaker numbered based on timestamps.flac format and now these segments are not 100% completely single speakered but mostly are. can assume few errors. Given a big scale, i'm assuming all tehse are fine. and also now, one more issue here is that, these segments are not cleanly started with silence and ends in silence on how a TTS system performs. but some of teh segments are cut or ended or started abruptly in teh previous pre-cleanup phase, and might start with the middle of the sentence of a speaker. 


Now that's everything about my data. And i am now supporting 11 indian languages and english. so that's 12. - Hindi, Marathi, Telugu, Tamil, Kannada, Malayalam, Gujarati, Punjabi, Bengali, Assamese, Odia, English

And all of these VideoIDs have simple metadata that is stored in supabase and has language tagged and other details mentioned in it like number of segments, etc. but mostly for this transcription task we might need just the language id for the videoID cause we might need to pass it in the prompt. 

Now coming to the prompt. 

i've prepared a pretty decent prompt. although i need you to