is this cool ?

W0404 23:22:37.924000 1113384 torch/distributed/run.py:792] 
W0404 23:22:37.924000 1113384 torch/distributed/run.py:792] *****************************************
W0404 23:22:37.924000 1113384 torch/distributed/run.py:792] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
W0404 23:22:37.924000 1113384 torch/distributed/run.py:792] *****************************************
2026-04-04 23:22:48 [INFO] World size: 8, Device: cuda:0
2026-04-04 23:22:48 [INFO] Loading model: /workspace/training/tokenizer_extension/extended_model

Loading weights:   0%|          | 0/2150 [00:00<?, ?it/s]
Loading weights:   0%|          | 0/2150 [00:00<?, ?it/s]
Loading weights:   0%|          | 0/2150 [00:00<?, ?it/s]
Loading weights:  32%|███▏      | 692/2150 [00:00<00:00, 6919.15it/s]
Loading weights:   0%|          | 0/2150 [00:00<?, ?it/s]
Loading weights:  33%|███▎      | 705/2150 [00:00<00:00, 7048.96it/s]
Loading weights:  43%|████▎     | 926/2150 [00:00<00:00, 9255.33it/s]
Loading weights:  64%|██████▍   | 1384/2150 [00:00<00:00, 5908.03it/s]
Loading weights:  31%|███       | 659/2150 [00:00<00:00, 6558.92it/s]
Loading weights:   0%|          | 0/2150 [00:00<?, ?it/s]
Loading weights:  66%|██████▌   | 1410/2150 [00:00<00:00, 6064.16it/s]
Loading weights:  86%|████████▌ | 1852/2150 [00:00<00:00, 7347.71it/s]
Loading weights:  61%|██████    | 1315/2150 [00:00<00:00, 6017.27it/s]
Loading weights:  40%|███▉      | 853/2150 [00:00<00:00, 8496.91it/s]
Loading weights:  92%|█████████▏| 1985/2150 [00:00<00:00, 4860.64it/s]
Loading weights:   0%|          | 0/2150 [00:00<?, ?it/s]
Loading weights: 100%|██████████| 2150/2150 [00:00<00:00, 7008.54it/s]

Loading weights: 100%|██████████| 2150/2150 [00:00<00:00, 5158.57it/s]

Loading weights:   0%|          | 0/2150 [00:00<?, ?it/s]
Loading weights:  94%|█████████▍| 2026/2150 [00:00<00:00, 5470.86it/s]
Loading weights: 100%|██████████| 2150/2150 [00:00<00:00, 5656.18it/s]

Loading weights:   0%|          | 0/2150 [00:00<?, ?it/s]
Loading weights:  33%|███▎      | 700/2150 [00:00<00:00, 6962.74it/s]
Loading weights:  89%|████████▉ | 1920/2150 [00:00<00:00, 5675.99it/s]
Loading weights: 100%|██████████| 2150/2150 [00:00<00:00, 6399.68it/s]

Loading weights:  79%|███████▉  | 1703/2150 [00:00<00:00, 7641.68it/s]
Loading weights:  29%|██▉       | 621/2150 [00:00<00:00, 6196.49it/s]
Loading weights: 100%|██████████| 2150/2150 [00:00<00:00, 7618.80it/s]

Loading weights:  49%|████▉     | 1056/2150 [00:00<00:00, 10556.03it/s]
Loading weights:  65%|██████▍   | 1397/2150 [00:00<00:00, 5895.26it/s]
Loading weights:  58%|█████▊    | 1241/2150 [00:00<00:00, 5489.99it/s]
Loading weights:  98%|█████████▊| 2112/2150 [00:00<00:00, 8627.64it/s] 
Loading weights: 100%|██████████| 2150/2150 [00:00<00:00, 8799.29it/s]

Loading weights:  93%|█████████▎| 1997/2150 [00:00<00:00, 5466.76it/s]
Loading weights:  84%|████████▎ | 1796/2150 [00:00<00:00, 5067.26it/s]
Loading weights: 100%|██████████| 2150/2150 [00:00<00:00, 5606.81it/s]

Loading weights: 100%|██████████| 2150/2150 [00:00<00:00, 5140.42it/s]
2026-04-04 23:23:43 [INFO] Manual gradient checkpointing enabled for encoder layers
2026-04-04 23:23:43 [INFO] Parameters: 2.15B total, 2.15B trainable, 0 frozen groups
2026-04-04 23:23:49 [INFO] Decoder prompt (hi): <|startofcontext|><|startoftranscript|><|emo:undefined|><|hi|><|hi|><|pnc|><|noitn|><|notimestamp|><|nodiarize|> → 9 tokens
2026-04-04 23:25:45 [INFO] DataLoader: 8 workers, max_batch_mel_frames=800000, max_batch_utterances=64
2026-04-04 23:25:46 [INFO] Optimizer: AdamW, base_lr=0.0002, LLRD=0.9, max_encoder_depth=47
wandb: [wandb.login()] Loaded credentials for https://api.wandb.ai from WANDB_API_KEY.
wandb: Currently logged in as: bharathkumar1922001 (maya-research-maya-research) to https://api.wandb.ai. Use `wandb login --relogin` to force relogin
wandb: setting up run 85dzyawa
wandb: Tracking run with wandb version 0.25.1
wandb: Run data is saved locally in /workspace/training/wandb/run-20260404_232549-85dzyawa
wandb: Run `wandb offline` to turn off syncing.
wandb: Syncing run indic-lang-extension-v1
wandb: ⭐️ View project at https://wandb.ai/maya-research-maya-research/maya-asr-cohere-transcribe
wandb: 🚀 View run at https://wandb.ai/maya-research-maya-research/maya-asr-cohere-transcribe/runs/85dzyawa
2026-04-04 23:25:58 [INFO] Starting training: 2 epochs, ~142574 estimated steps
2026-04-04 23:25:58 [INFO] Warmup: 2851 steps (2%)
2026-04-04 23:25:58 [INFO] Strategy: ddp, BF16: True
2026-04-04 23:25:58 [INFO] Gradient accumulation: 2
2026-04-04 23:25:58 [INFO] === Epoch 1/2 ===
2026-04-04 23:28:47 [INFO] Step 50 | loss=15.4257 | lr=3.44e-06 | grad=78.50 | throughput=181.6x RT
2026-04-04 23:31:02 [INFO] Step 100 | loss=13.5723 | lr=6.94e-06 | grad=40.25 | throughput=150.6x RT
2026-04-04 23:33:39 [INFO] Step 150 | loss=9.6411 | lr=1.05e-05 | grad=4.12 | throughput=89.8x RT
2026-04-04 23:36:34 [INFO] Step 200 | loss=8.9768 | lr=1.40e-05 | grad=1.89 | throughput=82.1x RT
2026-04-04 23:39:48 [INFO] Step 250 | loss=8.3982 | lr=1.75e-05 | grad=2.12 | throughput=81.1x RT
2026-04-04 23:42:42 [INFO] Step 300 | loss=7.8483 | lr=2.10e-05 | grad=1.70 | throughput=35.9x RT
2026-04-04 23:46:01 [INFO] Step 350 | loss=7.5591 | lr=2.45e-05 | grad=1.73 | throughput=52.8x RT
2026-04-04 23:49:09 [INFO] Step 400 | loss=7.2173 | lr=2.80e-05 | grad=1.81 | throughput=40.0x RT
2026-04-04 23:52:19 [INFO] Step 450 | loss=7.0046 | lr=3.15e-05 | grad=2.11 | throughput=31.1x RT
2026-04-04 23:56:07 [INFO] Step 500 | loss=6.9517 | lr=3.50e-05 | grad=3.59 | throughput=41.1x RT
2026-04-04 23:59:29 [INFO] Step 550 | loss=6.7215 | lr=3.85e-05 | grad=3.20 | throughput=20.0x RT
2026-04-05 00:03:12 [INFO] Step 600 | loss=6.6965 | lr=4.20e-05 | grad=1.27 | throughput=27.8x RT
2026-04-05 00:07:01 [INFO] Step 650 | loss=6.5202 | lr=4.55e-05 | grad=6.28 | throughput=26.9x RT
2026-04-05 00:10:26 [INFO] Step 700 | loss=6.3383 | lr=4.90e-05 | grad=2.41 | throughput=16.7x RT
2026-04-05 00:14:20 [INFO] Step 750 | loss=6.3453 | lr=5.25e-05 | grad=5.12 | throughput=26.4x RT
2026-04-05 00:18:05 [INFO] Step 800 | loss=6.1364 | lr=5.61e-05 | grad=5.62 | throughput=15.5x RT
2026-04-05 00:21:43 [INFO] Step 850 | loss=6.1014 | lr=5.96e-05 | grad=2.44 | throughput=16.7x RT

dont disturb the run. what could be the reason for throughput decline pattern ?