● I think your inference agent's analysis is solid. The diminishing returns + English/Hindi regression is a classic sign of temperature sampling
  doing its job early but hurting in the later phase.

  Let me check what's actually configured in our training pipeline for the data mixing:

  Searched for 2 patterns, read 1 file (ctrl+o to expand)

● Bash(cat /root/gemini-asr/lf_asr/artifacts/phase2/buckets/bucket_config.json 2>/dev/null | python3 -c "
      import json, sys…)
  ⎿  Error: Exit code 1
     Traceback (most recent call last):
       File "<string>", line 8, in <module>
     KeyError: 'id'
     Total hours: 148167h
     Total samples: 75003423

● Bash(python3 -c "
      import json…)
  ⎿  {
       "bucket_id": "b_0_3",
       "range": "0.0-3.0s",
     … +27 lines (ctrl+o to expand)

● Here's the key insight: the data mixing isn't language-aware at all — it's bucket (duration) weighted. There's no temperature sampling by
  language in the current pipeline. The buckets are weighted by audio hours, so the model sees languages proportional to their representation in
  the dataset.

  The "temperature mixing to prioritize low resource languages" — that must have been applied upstream when building the training data/parquets,
  not in our training code. The training pipeline just sees 75M samples across 7 duration buckets and samples proportionally to hours.

  So the question is: what's the actual per-language distribution in the training data?

● Bash(python3 -c "
      import pyarrow.parquet as pq…)
  ⎿     en:    59113h ( 39.9%)  27,006,047 samples
        hi:    18459h ( 12.5%)  10,423,575 samples
        te:    15558h ( 10.5%)   8,743,426 samples
     … +10 lines (ctrl+o to expand)
  ⎿  (timeout 5m)

● This is very revealing. The data distribution is not temperature-sampled at the training pipeline level:

  ┌───────────┬─────────┬───────┬───────────────────┐
  │ Language  │  Hours  │   %   │  Status at 100k   │
  ├───────────┼─────────┼───────┼───────────────────┤
  │ English   │ 59,113h │ 39.9% │ Regressing (+0.9) │
  ├───────────┼─────────┼───────┼───────────────────┤
  │ Hindi     │ 18,459h │ 12.5% │ Regressing (+0.1) │
  ├───────────┼─────────┼───────┼───────────────────┤
  │ Telugu    │ 15,558h │ 10.5% │ Plateaued (-0.4)  │
  ├───────────┼─────────┼───────┼───────────────────┤
  │ Malayalam │ 9,895h  │ 6.7%  │ Improving (-1.1)  │
  ├───────────┼─────────┼───────┼───────────────────┤
  │ Tamil     │ 9,336h  │ 6.3%  │ Improving (-1.6)  │
  ├───────────┼─────────┼───────┼───────────────────┤
  │ Punjabi   │ 9,063h  │ 6.1%  │ Improving (-1.3)  │
  ├───────────┼─────────┼───────┼───────────────────┤
  │ Gujarati  │ 6,269h  │ 4.2%  │ Plateaued (-0.1)  │
  ├───────────┼─────────┼───────┼───────────────────┤
  │ Bengali   │ 5,816h  │ 3.9%  │ Improving (-1.3)  │
  ├───────────┼─────────┼───────┼───────────────────┤
  │ Kannada   │ 5,562h  │ 3.8%  │ Plateaued (-0.4)  │
  ├───────────┼─────────┼───────┼───────────────────┤
  │ Marathi   │ 4,919h  │ 3.3%  │ Regressing (+0.6) │
  ├───────────┼─────────┼───────┼───────────────────┤
  │ Odia      │ 2,461h  │ 1.7%  │ Regressing (+0.2) │
  ├───────────┼─────────┼───────┼───────────────────┤
  │ Assamese  │ 1,716h  │ 1.2%  │ Improving (-2.5)  │
  └───────────┴─────────┴───────┴───────────────────┘

  English is 40% of the data and still regressing. That's not temperature sampling causing it — it's something else:

  1. The cosine LR decay — at step 100k/612k we're still near peak LR (1.886e-5 vs peak 2e-5). The model is still learning aggressively, which
  causes some forgetting of already-converged patterns.
  2. The batch skipping — we were skipping 20.9% of batches at seq_len>500, which disproportionately affected longer-transcript languages. Now
  fixed to 700.
  3. English was already strong in the base model — small regressions are normal as the model adapts to Indic scripts.

  The good news: with the LR schedule being cosine over 612k steps, the LR will naturally decay. By step 200-300k it'll be significantly lower,
  which should stabilize English/Hindi while Indic languages continue improving.

  I wouldn't change anything right now — the fix to max_batch_seq_len=700 means more data is being used, and the natural cosine decay will handle
  the rest. Let's check again at the 150k benchmark. If English/Hindi regression accelerates, then we'd consider early stopping or freezing the
  encoder.