{
    "blueprint_version": 1,
    "agent_context_to_download": "Context: We downloaded the full `finalsftdata` dataset locally to `/root/sft_data`.\n\nSource/index strategy:\n- We did not rely on live recursive LIST calls during the bulk transfer.\n- We used the prefetched metadata/indexes already in repo:\n  - `finalsftdata.json` for dataset blueprint, shard/layout info, expected totals\n  - `s5cmd_prebuilt_batch.txt` as the precomputed object/shard index\n- We parsed the prebuilt batch to recover shard directories, then synthesized the full object list per shard:\n  - `audio.tar`\n  - `audio_index.parquet`\n  - `manifest.json`\n  - `metadata.parquet`\n  - `xcodec2_tokens.parquet`\n- `indicvoices` is expected to miss `xcodec2_tokens.parquet`, so total downloaded files are `27,500` vs blueprint `27,550`.\n\nDownload strategy:\n- We benchmarked `boto3`, `s5cmd`, `rclone`, and `aria2c`.\n- Best bulk downloader was `s5cmd` in batch mode.\n- Final production run used extremely high parallelism with dual `s5cmd` streams:\n  - audio stream: `384` workers for `audio.tar`\n  - metadata stream: `128` workers for parquet/json files\n  - total: `512` workers\n  - per-file multipart concurrency: `10`\n  - part size: `64 MiB`\n- This avoided repeated list overhead and maximized transfer concurrency from the prefetched index.\n\nResult:\n- Download completed successfully to `/root/sft_data`\n- Total files: `27,500`\n- Total size on disk: about `16 TiB`\n- Zero download errors\n- Sustained throughput was about `592-593 MiB/s`\n- Total elapsed time was about `8.6h`\n\nUseful files:\n- `finalsftdata.json`\n- `s5cmd_prebuilt_batch.txt`\n- `full_download.py`\n- `launch_download.sh`\n- `check_download.sh`\n- `Storage.md`\n\nCredentials:\n- R2 credentials were loaded from `.env`\n- Endpoint is Cloudflare R2",
    "generated_at": "2026-03-18T04:39:45.998969+00:00",
    "bucket": "finalsftdata",
    "excluded_prefixes": [
      {
        "requested": "v1",
        "matched_prefix": "v1"
      },
      {
        "requested": "nemotron-ckpts",
        "matched_prefix": "nemotron_chkpts"
      }
    ],
    "overall": {
      "prefix_count": 12,
      "prefixes": [
        "ears",
        "expresso",
        "final-export",
        "globe",
        "hifitts2",
        "indicvoices",
        "indicvoices-r",
        "josh",
        "joshdelivery",
        "librittsr",
        "ljspeech",
        "vctk"
      ],
      "unique_language_count": 12,
      "unique_languages": [
        "as",
        "bn",
        "en",
        "gu",
        "hi",
        "kn",
        "ml",
        "mr",
        "or",
        "pa",
        "ta",
        "te"
      ],
      "total_shards": 5559,
      "total_objects": 27550,
      "total_bytes": 17494928445210,
      "total_gib": 16293.422
    },
    "schema_notes": {
      "standard_components": [
        "audio.tar",
        "audio_index.parquet",
        "manifest.json",
        "metadata.parquet"
      ],
      "optional_components": [
        "xcodec2_tokens.parquet"
      ],
      "layout_patterns": [
        {
          "pattern": "final-export/production/shards/lang=<lang>/<shard_id>/<component>",
          "prefixes": [
            "final-export"
          ]
        },
        {
          "pattern": "<prefix>/lang=<lang>/<shard_id>/<component>",
          "prefixes": [
            "ears",
            "expresso",
            "globe",
            "hifitts2",
            "indicvoices-r",
            "josh",
            "joshdelivery",
            "librittsr",
            "ljspeech",
            "vctk"
          ]
        },
        {
          "pattern": "indicvoices/<lang>/<shard_id>/<component>",
          "prefixes": [
            "indicvoices"
          ]
        }
      ],
      "component_exceptions": [
        "indicvoices shards do not include `xcodec2_tokens.parquet` in the current bucket snapshot."
      ]
    },
    "prefixes": {
      "ears": {
        "path_pattern": "ears/lang=<lang>/<shard_id>/<component>",
        "language_count": 1,
        "languages": {
          "en": {
            "shard_count": 2,
            "object_count": 10,
            "total_bytes": 4647699587,
            "total_gib": 4.329,
            "components": [
              "audio.tar",
              "audio_index.parquet",
              "manifest.json",
              "metadata.parquet",
              "xcodec2_tokens.parquet"
            ],
            "sample_shard": "ears/lang=en/en_ears_shard_1773388452_000001"
          }
        },
        "total_shards": 2,
        "total_objects": 10,
        "total_bytes": 4647699587,
        "total_gib": 4.329,
        "components": [
          "audio.tar",
          "audio_index.parquet",
          "manifest.json",
          "metadata.parquet",
          "xcodec2_tokens.parquet"
        ],
        "sample_shard": "ears/lang=en/en_ears_shard_1773388452_000001"
      },
      "expresso": {
        "path_pattern": "expresso/lang=<lang>/<shard_id>/<component>",
        "language_count": 1,
        "languages": {
          "en": {
            "shard_count": 1,
            "object_count": 5,
            "total_bytes": 705435962,
            "total_gib": 0.657,
            "components": [
              "audio.tar",
              "audio_index.parquet",
              "manifest.json",
              "metadata.parquet",
              "xcodec2_tokens.parquet"
            ],
            "sample_shard": "expresso/lang=en/en_expresso_shard_1773387888_000001"
          }
        },
        "total_shards": 1,
        "total_objects": 5,
        "total_bytes": 705435962,
        "total_gib": 0.657,
        "components": [
          "audio.tar",
          "audio_index.parquet",
          "manifest.json",
          "metadata.parquet",
          "xcodec2_tokens.parquet"
        ],
        "sample_shard": "expresso/lang=en/en_expresso_shard_1773387888_000001"
      },
      "final-export": {
        "path_pattern": "final-export/production/shards/lang=<lang>/<shard_id>/<component>",
        "language_count": 12,
        "languages": {
          "as": {
            "shard_count": 39,
            "object_count": 195,
            "total_bytes": 118549042906,
            "total_gib": 110.407,
            "components": [
              "audio.tar",
              "audio_index.parquet",
              "manifest.json",
              "metadata.parquet",
              "xcodec2_tokens.parquet"
            ],
            "sample_shard": "final-export/production/shards/lang=as/as_shard_1773353123057162142_30709382_000034_bf9155ea"
          },
          "bn": {
            "shard_count": 158,
            "object_count": 790,
            "total_bytes": 529257072033,
            "total_gib": 492.909,
            "components": [
              "audio.tar",
              "audio_index.parquet",
              "manifest.json",
              "metadata.parquet",
              "xcodec2_tokens.parquet"
            ],
            "sample_shard": "final-export/production/shards/lang=bn/bn_shard_1773352123394248078_20536452_000023_7ea7a19a"
          },
          "en": {
            "shard_count": 1231,
            "object_count": 6155,
            "total_bytes": 4248410378055,
            "total_gib": 3956.64,
            "components": [
              "audio.tar",
              "audio_index.parquet",
              "manifest.json",
              "metadata.parquet",
              "xcodec2_tokens.parquet"
            ],
            "sample_shard": "final-export/production/shards/lang=en/en_shard_1773348768261126927_32749165_000001_afbdce0d"
          },
          "gu": {
            "shard_count": 186,
            "object_count": 930,
            "total_bytes": 600391092106,
            "total_gib": 559.158,
            "components": [
              "audio.tar",
              "audio_index.parquet",
              "manifest.json",
              "metadata.parquet",
              "xcodec2_tokens.parquet"
            ],
            "sample_shard": "final-export/production/shards/lang=gu/gu_shard_1773352335632268023_32728031_000025_1a85fe52"
          },
          "hi": {
            "shard_count": 698,
            "object_count": 3490,
            "total_bytes": 2264034395178,
            "total_gib": 2108.546,
            "components": [
              "audio.tar",
              "audio_index.parquet",
              "manifest.json",
              "metadata.parquet",
              "xcodec2_tokens.parquet"
            ],
            "sample_shard": "final-export/production/shards/lang=hi/hi_shard_1773349701664996730_30905386_000001_31c46e6c"
          },
          "kn": {
            "shard_count": 200,
            "object_count": 1000,
            "total_bytes": 747900180201,
            "total_gib": 696.536,
            "components": [
              "audio.tar",
              "audio_index.parquet",
              "manifest.json",
              "metadata.parquet",
              "xcodec2_tokens.parquet"
            ],
            "sample_shard": "final-export/production/shards/lang=kn/kn_shard_1773351637502146730_12128576_000020_d148347e"
          },
          "ml": {
            "shard_count": 372,
            "object_count": 1860,
            "total_bytes": 1352594222347,
            "total_gib": 1259.702,
            "components": [
              "audio.tar",
              "audio_index.parquet",
              "manifest.json",
              "metadata.parquet",
              "xcodec2_tokens.parquet"
            ],
            "sample_shard": "final-export/production/shards/lang=ml/ml_shard_1773351448264269505_32301437_000021_508a5ec8"
          },
          "mr": {
            "shard_count": 148,
            "object_count": 740,
            "total_bytes": 482143065280,
            "total_gib": 449.031,
            "components": [
              "audio.tar",
              "audio_index.parquet",
              "manifest.json",
              "metadata.parquet",
              "xcodec2_tokens.parquet"
            ],
            "sample_shard": "final-export/production/shards/lang=mr/mr_shard_1773352042928966654_18944001_000043_84213297"
          },
          "or": {
            "shard_count": 78,
            "object_count": 390,
            "total_bytes": 245780080043,
            "total_gib": 228.901,
            "components": [
              "audio.tar",
              "audio_index.parquet",
              "manifest.json",
              "metadata.parquet",
              "xcodec2_tokens.parquet"
            ],
            "sample_shard": "final-export/production/shards/lang=or/or_shard_1773352597455653389_28018611_000023_6f990836"
          },
          "pa": {
            "shard_count": 343,
            "object_count": 1715,
            "total_bytes": 1194838283265,
            "total_gib": 1112.78,
            "components": [
              "audio.tar",
              "audio_index.parquet",
              "manifest.json",
              "metadata.parquet",
              "xcodec2_tokens.parquet"
            ],
            "sample_shard": "final-export/production/shards/lang=pa/pa_shard_1773351592026650801_25999748_000019_12f64d7a"
          },
          "ta": {
            "shard_count": 312,
            "object_count": 1560,
            "total_bytes": 1089424618362,
            "total_gib": 1014.606,
            "components": [
              "audio.tar",
              "audio_index.parquet",
              "manifest.json",
              "metadata.parquet",
              "xcodec2_tokens.parquet"
            ],
            "sample_shard": "final-export/production/shards/lang=ta/ta_shard_1773351323809562164_13351895_000038_9c498343"
          },
          "te": {
            "shard_count": 585,
            "object_count": 2925,
            "total_bytes": 1869669073691,
            "total_gib": 1741.265,
            "components": [
              "audio.tar",
              "audio_index.parquet",
              "manifest.json",
              "metadata.parquet",
              "xcodec2_tokens.parquet"
            ],
            "sample_shard": "final-export/production/shards/lang=te/te_shard_1773350043788876691_30690797_000011_dad6dfc1"
          }
        },
        "total_shards": 4350,
        "total_objects": 21750,
        "total_bytes": 14742991503467,
        "total_gib": 13730.481,
        "components": [
          "audio.tar",
          "audio_index.parquet",
          "manifest.json",
          "metadata.parquet",
          "xcodec2_tokens.parquet"
        ],
        "sample_shard": "final-export/production/shards/lang=as/as_shard_1773353123057162142_30709382_000034_bf9155ea"
      },
      "globe": {
        "path_pattern": "globe/lang=<lang>/<shard_id>/<component>",
        "language_count": 1,
        "languages": {
          "en": {
            "shard_count": 39,
            "object_count": 195,
            "total_bytes": 35471454519,
            "total_gib": 33.035,
            "components": [
              "audio.tar",
              "audio_index.parquet",
              "manifest.json",
              "metadata.parquet",
              "xcodec2_tokens.parquet"
            ],
            "sample_shard": "globe/lang=en/en_globe_shard_1773387480_000001"
          }
        },
        "total_shards": 39,
        "total_objects": 195,
        "total_bytes": 35471454519,
        "total_gib": 33.035,
        "components": [
          "audio.tar",
          "audio_index.parquet",
          "manifest.json",
          "metadata.parquet",
          "xcodec2_tokens.parquet"
        ],
        "sample_shard": "globe/lang=en/en_globe_shard_1773387480_000001"
      },
      "hifitts2": {
        "path_pattern": "hifitts2/lang=<lang>/<shard_id>/<component>",
        "language_count": 1,
        "languages": {
          "en": {
            "shard_count": 622,
            "object_count": 3110,
            "total_bytes": 1660373535487,
            "total_gib": 1546.343,
            "components": [
              "audio.tar",
              "audio_index.parquet",
              "manifest.json",
              "metadata.parquet",
              "xcodec2_tokens.parquet"
            ],
            "sample_shard": "hifitts2/lang=en/en_hifitts2_shard_1773346451_000001"
          }
        },
        "total_shards": 622,
        "total_objects": 3110,
        "total_bytes": 1660373535487,
        "total_gib": 1546.343,
        "components": [
          "audio.tar",
          "audio_index.parquet",
          "manifest.json",
          "metadata.parquet",
          "xcodec2_tokens.parquet"
        ],
        "sample_shard": "hifitts2/lang=en/en_hifitts2_shard_1773346451_000001"
      },
      "indicvoices": {
        "path_pattern": "indicvoices/<lang>/<shard_id>/<component>",
        "language_count": 11,
        "languages": {
          "as": {
            "shard_count": 32,
            "object_count": 128,
            "total_bytes": 50094264856,
            "total_gib": 46.654,
            "components": [
              "audio.tar",
              "audio_index.parquet",
              "manifest.json",
              "metadata.parquet"
            ],
            "sample_shard": "indicvoices/as/as_shard_1773307606_000001"
          },
          "bn": {
            "shard_count": 23,
            "object_count": 92,
            "total_bytes": 40172702580,
            "total_gib": 37.414,
            "components": [
              "audio.tar",
              "audio_index.parquet",
              "manifest.json",
              "metadata.parquet"
            ],
            "sample_shard": "indicvoices/bn/bn_shard_1773308098_000001"
          },
          "gu": {
            "shard_count": 7,
            "object_count": 28,
            "total_bytes": 13906988213,
            "total_gib": 12.952,
            "components": [
              "audio.tar",
              "audio_index.parquet",
              "manifest.json",
              "metadata.parquet"
            ],
            "sample_shard": "indicvoices/gu/gu_shard_1773308457_000001"
          },
          "hi": {
            "shard_count": 26,
            "object_count": 104,
            "total_bytes": 40753619495,
            "total_gib": 37.955,
            "components": [
              "audio.tar",
              "audio_index.parquet",
              "manifest.json",
              "metadata.parquet"
            ],
            "sample_shard": "indicvoices/hi/hi_shard_1773308742_000001"
          },
          "kn": {
            "shard_count": 19,
            "object_count": 76,
            "total_bytes": 33063993358,
            "total_gib": 30.793,
            "components": [
              "audio.tar",
              "audio_index.parquet",
              "manifest.json",
              "metadata.parquet"
            ],
            "sample_shard": "indicvoices/kn/kn_shard_1773309188_000001"
          },
          "ml": {
            "shard_count": 23,
            "object_count": 92,
            "total_bytes": 41660837528,
            "total_gib": 38.8,
            "components": [
              "audio.tar",
              "audio_index.parquet",
              "manifest.json",
              "metadata.parquet"
            ],
            "sample_shard": "indicvoices/ml/ml_shard_1773309580_000001"
          },
          "mr": {
            "shard_count": 20,
            "object_count": 80,
            "total_bytes": 36545343094,
            "total_gib": 34.036,
            "components": [
              "audio.tar",
              "audio_index.parquet",
              "manifest.json",
              "metadata.parquet"
            ],
            "sample_shard": "indicvoices/mr/mr_shard_1773310041_000001"
          },
          "or": {
            "shard_count": 27,
            "object_count": 108,
            "total_bytes": 42936883774,
            "total_gib": 39.988,
            "components": [
              "audio.tar",
              "audio_index.parquet",
              "manifest.json",
              "metadata.parquet"
            ],
            "sample_shard": "indicvoices/or/or_shard_1773310460_000001"
          },
          "pa": {
            "shard_count": 20,
            "object_count": 80,
            "total_bytes": 38832222889,
            "total_gib": 36.165,
            "components": [
              "audio.tar",
              "audio_index.parquet",
              "manifest.json",
              "metadata.parquet"
            ],
            "sample_shard": "indicvoices/pa/pa_shard_1773310927_000001"
          },
          "ta": {
            "shard_count": 29,
            "object_count": 116,
            "total_bytes": 50284045672,
            "total_gib": 46.831,
            "components": [
              "audio.tar",
              "audio_index.parquet",
              "manifest.json",
              "metadata.parquet"
            ],
            "sample_shard": "indicvoices/ta/ta_shard_1773311415_000001"
          },
          "te": {
            "shard_count": 19,
            "object_count": 76,
            "total_bytes": 33822906664,
            "total_gib": 31.5,
            "components": [
              "audio.tar",
              "audio_index.parquet",
              "manifest.json",
              "metadata.parquet"
            ],
            "sample_shard": "indicvoices/te/te_shard_1773311937_000001"
          }
        },
        "total_shards": 245,
        "total_objects": 980,
        "total_bytes": 422073808123,
        "total_gib": 393.087,
        "components": [
          "audio.tar",
          "audio_index.parquet",
          "manifest.json",
          "metadata.parquet"
        ],
        "sample_shard": "indicvoices/as/as_shard_1773307606_000001"
      },
      "indicvoices-r": {
        "path_pattern": "indicvoices-r/lang=<lang>/<shard_id>/<component>",
        "language_count": 11,
        "languages": {
          "as": {
            "shard_count": 5,
            "object_count": 25,
            "total_bytes": 11095434853,
            "total_gib": 10.333,
            "components": [
              "audio.tar",
              "audio_index.parquet",
              "manifest.json",
              "metadata.parquet",
              "xcodec2_tokens.parquet"
            ],
            "sample_shard": "indicvoices-r/lang=as/as_ivr_shard_1773374538_000001"
          },
          "bn": {
            "shard_count": 3,
            "object_count": 15,
            "total_bytes": 7868647025,
            "total_gib": 7.328,
            "components": [
              "audio.tar",
              "audio_index.parquet",
              "manifest.json",
              "metadata.parquet",
              "xcodec2_tokens.parquet"
            ],
            "sample_shard": "indicvoices-r/lang=bn/bn_ivr_shard_1773374655_000001"
          },
          "gu": {
            "shard_count": 1,
            "object_count": 5,
            "total_bytes": 308478590,
            "total_gib": 0.287,
            "components": [
              "audio.tar",
              "audio_index.parquet",
              "manifest.json",
              "metadata.parquet",
              "xcodec2_tokens.parquet"
            ],
            "sample_shard": "indicvoices-r/lang=gu/gu_ivr_shard_1773373689_000001"
          },
          "hi": {
            "shard_count": 2,
            "object_count": 10,
            "total_bytes": 5051774177,
            "total_gib": 4.705,
            "components": [
              "audio.tar",
              "audio_index.parquet",
              "manifest.json",
              "metadata.parquet",
              "xcodec2_tokens.parquet"
            ],
            "sample_shard": "indicvoices-r/lang=hi/hi_ivr_shard_1773378138_000001"
          },
          "kn": {
            "shard_count": 2,
            "object_count": 10,
            "total_bytes": 2902232788,
            "total_gib": 2.703,
            "components": [
              "audio.tar",
              "audio_index.parquet",
              "manifest.json",
              "metadata.parquet",
              "xcodec2_tokens.parquet"
            ],
            "sample_shard": "indicvoices-r/lang=kn/kn_ivr_shard_1773374752_000001"
          },
          "ml": {
            "shard_count": 2,
            "object_count": 10,
            "total_bytes": 5467558235,
            "total_gib": 5.092,
            "components": [
              "audio.tar",
              "audio_index.parquet",
              "manifest.json",
              "metadata.parquet",
              "xcodec2_tokens.parquet"
            ],
            "sample_shard": "indicvoices-r/lang=ml/ml_ivr_shard_1773380146_000001"
          },
          "mr": {
            "shard_count": 2,
            "object_count": 10,
            "total_bytes": 3377003296,
            "total_gib": 3.145,
            "components": [
              "audio.tar",
              "audio_index.parquet",
              "manifest.json",
              "metadata.parquet",
              "xcodec2_tokens.parquet"
            ],
            "sample_shard": "indicvoices-r/lang=mr/mr_ivr_shard_1773376021_000001"
          },
          "or": {
            "shard_count": 2,
            "object_count": 10,
            "total_bytes": 4807745956,
            "total_gib": 4.478,
            "components": [
              "audio.tar",
              "audio_index.parquet",
              "manifest.json",
              "metadata.parquet",
              "xcodec2_tokens.parquet"
            ],
            "sample_shard": "indicvoices-r/lang=or/or_ivr_shard_1773380835_000001"
          },
          "pa": {
            "shard_count": 2,
            "object_count": 10,
            "total_bytes": 4899688929,
            "total_gib": 4.563,
            "components": [
              "audio.tar",
              "audio_index.parquet",
              "manifest.json",
              "metadata.parquet",
              "xcodec2_tokens.parquet"
            ],
            "sample_shard": "indicvoices-r/lang=pa/pa_ivr_shard_1773383202_000001"
          },
          "ta": {
            "shard_count": 3,
            "object_count": 15,
            "total_bytes": 6539727035,
            "total_gib": 6.091,
            "components": [
              "audio.tar",
              "audio_index.parquet",
              "manifest.json",
              "metadata.parquet",
              "xcodec2_tokens.parquet"
            ],
            "sample_shard": "indicvoices-r/lang=ta/ta_ivr_shard_1773377630_000001"
          },
          "te": {
            "shard_count": 3,
            "object_count": 15,
            "total_bytes": 8977651617,
            "total_gib": 8.361,
            "components": [
              "audio.tar",
              "audio_index.parquet",
              "manifest.json",
              "metadata.parquet",
              "xcodec2_tokens.parquet"
            ],
            "sample_shard": "indicvoices-r/lang=te/te_ivr_shard_1773386036_000001"
          }
        },
        "total_shards": 27,
        "total_objects": 135,
        "total_bytes": 61295942501,
        "total_gib": 57.086,
        "components": [
          "audio.tar",
          "audio_index.parquet",
          "manifest.json",
          "metadata.parquet",
          "xcodec2_tokens.parquet"
        ],
        "sample_shard": "indicvoices-r/lang=as/as_ivr_shard_1773374538_000001"
      },
      "josh": {
        "path_pattern": "josh/lang=<lang>/<shard_id>/<component>",
        "language_count": 7,
        "languages": {
          "bn": {
            "shard_count": 19,
            "object_count": 95,
            "total_bytes": 43358688335,
            "total_gib": 40.381,
            "components": [
              "audio.tar",
              "audio_index.parquet",
              "manifest.json",
              "metadata.parquet",
              "xcodec2_tokens.parquet"
            ],
            "sample_shard": "josh/lang=bn/bn_josh_shard_1773314769_000001"
          },
          "en": {
            "shard_count": 21,
            "object_count": 105,
            "total_bytes": 46157687063,
            "total_gib": 42.988,
            "components": [
              "audio.tar",
              "audio_index.parquet",
              "manifest.json",
              "metadata.parquet",
              "xcodec2_tokens.parquet"
            ],
            "sample_shard": "josh/lang=en/en_josh_shard_1773315418_000001"
          },
          "gu": {
            "shard_count": 21,
            "object_count": 105,
            "total_bytes": 40772518964,
            "total_gib": 37.972,
            "components": [
              "audio.tar",
              "audio_index.parquet",
              "manifest.json",
              "metadata.parquet",
              "xcodec2_tokens.parquet"
            ],
            "sample_shard": "josh/lang=gu/gu_josh_shard_1773326191_000001"
          },
          "hi": {
            "shard_count": 20,
            "object_count": 100,
            "total_bytes": 44620984988,
            "total_gib": 41.557,
            "components": [
              "audio.tar",
              "audio_index.parquet",
              "manifest.json",
              "metadata.parquet",
              "xcodec2_tokens.parquet"
            ],
            "sample_shard": "josh/lang=hi/hi_josh_shard_1773326828_000001"
          },
          "mr": {
            "shard_count": 19,
            "object_count": 95,
            "total_bytes": 41920907302,
            "total_gib": 39.042,
            "components": [
              "audio.tar",
              "audio_index.parquet",
              "manifest.json",
              "metadata.parquet",
              "xcodec2_tokens.parquet"
            ],
            "sample_shard": "josh/lang=mr/mr_josh_shard_1773327599_000001"
          },
          "ta": {
            "shard_count": 20,
            "object_count": 100,
            "total_bytes": 43641769912,
            "total_gib": 40.645,
            "components": [
              "audio.tar",
              "audio_index.parquet",
              "manifest.json",
              "metadata.parquet",
              "xcodec2_tokens.parquet"
            ],
            "sample_shard": "josh/lang=ta/ta_josh_shard_1773328227_000001"
          },
          "te": {
            "shard_count": 20,
            "object_count": 100,
            "total_bytes": 42424032003,
            "total_gib": 39.51,
            "components": [
              "audio.tar",
              "audio_index.parquet",
              "manifest.json",
              "metadata.parquet",
              "xcodec2_tokens.parquet"
            ],
            "sample_shard": "josh/lang=te/te_josh_shard_1773328991_000001"
          }
        },
        "total_shards": 140,
        "total_objects": 700,
        "total_bytes": 302896588567,
        "total_gib": 282.094,
        "components": [
          "audio.tar",
          "audio_index.parquet",
          "manifest.json",
          "metadata.parquet",
          "xcodec2_tokens.parquet"
        ],
        "sample_shard": "josh/lang=bn/bn_josh_shard_1773314769_000001"
      },
      "joshdelivery": {
        "path_pattern": "joshdelivery/lang=<lang>/<shard_id>/<component>",
        "language_count": 5,
        "languages": {
          "bn": {
            "shard_count": 18,
            "object_count": 90,
            "total_bytes": 41328827101,
            "total_gib": 38.49,
            "components": [
              "audio.tar",
              "audio_index.parquet",
              "manifest.json",
              "metadata.parquet",
              "xcodec2_tokens.parquet"
            ],
            "sample_shard": "joshdelivery/lang=bn/bn_josh_shard_1773330150_000001"
          },
          "en": {
            "shard_count": 22,
            "object_count": 110,
            "total_bytes": 47944813324,
            "total_gib": 44.652,
            "components": [
              "audio.tar",
              "audio_index.parquet",
              "manifest.json",
              "metadata.parquet",
              "xcodec2_tokens.parquet"
            ],
            "sample_shard": "joshdelivery/lang=en/en_josh_shard_1773330527_000001"
          },
          "gu": {
            "shard_count": 24,
            "object_count": 120,
            "total_bytes": 44841378712,
            "total_gib": 41.762,
            "components": [
              "audio.tar",
              "audio_index.parquet",
              "manifest.json",
              "metadata.parquet",
              "xcodec2_tokens.parquet"
            ],
            "sample_shard": "joshdelivery/lang=gu/gu_josh_shard_1773330996_000001"
          },
          "hi": {
            "shard_count": 21,
            "object_count": 105,
            "total_bytes": 45157265588,
            "total_gib": 42.056,
            "components": [
              "audio.tar",
              "audio_index.parquet",
              "manifest.json",
              "metadata.parquet",
              "xcodec2_tokens.parquet"
            ],
            "sample_shard": "joshdelivery/lang=hi/hi_josh_shard_1773331478_000001"
          },
          "te": {
            "shard_count": 20,
            "object_count": 100,
            "total_bytes": 41108543088,
            "total_gib": 38.285,
            "components": [
              "audio.tar",
              "audio_index.parquet",
              "manifest.json",
              "metadata.parquet",
              "xcodec2_tokens.parquet"
            ],
            "sample_shard": "joshdelivery/lang=te/te_josh_shard_1773332471_000001"
          }
        },
        "total_shards": 105,
        "total_objects": 525,
        "total_bytes": 220380827813,
        "total_gib": 205.246,
        "components": [
          "audio.tar",
          "audio_index.parquet",
          "manifest.json",
          "metadata.parquet",
          "xcodec2_tokens.parquet"
        ],
        "sample_shard": "joshdelivery/lang=bn/bn_josh_shard_1773330150_000001"
      },
      "librittsr": {
        "path_pattern": "librittsr/lang=<lang>/<shard_id>/<component>",
        "language_count": 1,
        "languages": {
          "en": {
            "shard_count": 24,
            "object_count": 120,
            "total_bytes": 39977559007,
            "total_gib": 37.232,
            "components": [
              "audio.tar",
              "audio_index.parquet",
              "manifest.json",
              "metadata.parquet",
              "xcodec2_tokens.parquet"
            ],
            "sample_shard": "librittsr/lang=en/en_librittsr_shard_1773391719_000001"
          }
        },
        "total_shards": 24,
        "total_objects": 120,
        "total_bytes": 39977559007,
        "total_gib": 37.232,
        "components": [
          "audio.tar",
          "audio_index.parquet",
          "manifest.json",
          "metadata.parquet",
          "xcodec2_tokens.parquet"
        ],
        "sample_shard": "librittsr/lang=en/en_librittsr_shard_1773391719_000001"
      },
      "ljspeech": {
        "path_pattern": "ljspeech/lang=<lang>/<shard_id>/<component>",
        "language_count": 1,
        "languages": {
          "en": {
            "shard_count": 1,
            "object_count": 5,
            "total_bytes": 1675796351,
            "total_gib": 1.561,
            "components": [
              "audio.tar",
              "audio_index.parquet",
              "manifest.json",
              "metadata.parquet",
              "xcodec2_tokens.parquet"
            ],
            "sample_shard": "ljspeech/lang=en/en_ljspeech_shard_1773388129_000001"
          }
        },
        "total_shards": 1,
        "total_objects": 5,
        "total_bytes": 1675796351,
        "total_gib": 1.561,
        "components": [
          "audio.tar",
          "audio_index.parquet",
          "manifest.json",
          "metadata.parquet",
          "xcodec2_tokens.parquet"
        ],
        "sample_shard": "ljspeech/lang=en/en_ljspeech_shard_1773388129_000001"
      },
      "vctk": {
        "path_pattern": "vctk/lang=<lang>/<shard_id>/<component>",
        "language_count": 1,
        "languages": {
          "en": {
            "shard_count": 3,
            "object_count": 15,
            "total_bytes": 2438293826,
            "total_gib": 2.271,
            "components": [
              "audio.tar",
              "audio_index.parquet",
              "manifest.json",
              "metadata.parquet",
              "xcodec2_tokens.parquet"
            ],
            "sample_shard": "vctk/lang=en/en_vctk_shard_1773389808_000001"
          }
        },
        "total_shards": 3,
        "total_objects": 15,
        "total_bytes": 2438293826,
        "total_gib": 2.271,
        "components": [
          "audio.tar",
          "audio_index.parquet",
          "manifest.json",
          "metadata.parquet",
          "xcodec2_tokens.parquet"
        ],
        "sample_shard": "vctk/lang=en/en_vctk_shard_1773389808_000001"
      }
    },
    "sample_shard": {
      "selected_remote_prefix": "final-export/production/shards/lang=en/en_shard_1773349950649851143_31538120_000016_cdd33d34",
      "local_directory": "finetune/finalsftdata_sample_shard/en_shard_1773349950649851143_31538120_000016_cdd33d34",
      "reason_selected": "Small but representative final-export shard: 80 segments, 3 source videos, all 5 standard components present.",
      "component_sizes_bytes": {
        "audio.tar": 12062720,
        "audio_index.parquet": 8225,
        "manifest.json": 985,
        "metadata.parquet": 57388,
        "xcodec2_tokens.parquet": 39201
      },
      "manifest": {
        "audio_index_row_count": 80,
        "audio_index_sha256": "118bba4aed32b9c16d2d8da9a3fad2fa34a85cdbbc5ad2d5aeba3a60c47abfe1",
        "audio_index_size_bytes": 8225,
        "audio_tar_member_count": 80,
        "audio_tar_sha256": "a4867973f33a6acb543500b638d471ff21883cd741cfb6af24c63dce19fdf297",
        "audio_tar_size_bytes": 12062720,
        "created_at": "2026-03-12T21:12:30.656452+00:00",
        "language": "en",
        "metadata_row_count": 80,
        "metadata_sha256": "18e6377f4c6ffd62c8a0e12d1c02c5da3c940be4ed1f0fe63ae0994cf09975d5",
        "metadata_size_bytes": 57388,
        "run_id": "production-20260312",
        "segment_count": 80,
        "segment_id_set_sha256": "4f73a725923451721e207f4b9550916d80578548e0072b1f6f1377ae78956fd2",
        "shard_id": "en_shard_1773349950649851143_31538120_000016_cdd33d34",
        "source_microshard_count": 3,
        "source_video_ids_sample": [
          "XwZY2pj0svM",
          "XzBZIm2lJKQ",
          "xZDc7CIR_do"
        ],
        "sum_flac_bytes": 11996300,
        "video_count": 3,
        "worker_id": "final-export-compact-31538120"
      },
      "join_keys": {
        "recommended_join_key": "normalized_segment_id",
        "normalized_segment_id_rule": "Strip one trailing `.flac` suffix from `segment_id` in `audio_index.parquet` and `metadata.parquet` before joining with `xcodec2_tokens.parquet`.",
        "raw_join_coverage": {
          "audio_to_metadata_matches": 80,
          "audio_to_xcodec_matches": 6,
          "metadata_to_xcodec_matches": 6
        },
        "normalized_join_coverage": {
          "audio_to_metadata_matches": 80,
          "audio_to_xcodec_matches": 80,
          "metadata_to_xcodec_matches": 80
        }
      },
      "audio_index": {
        "row_count": 80,
        "columns": [
          "segment_id",
          "video_id",
          "tar_member_name",
          "flac_size_bytes",
          "flac_sha256",
          "audio_duration_s"
        ],
        "duration_s": {
          "min": 2.308125,
          "max": 14.630958,
          "mean": 4.5397039375,
          "sum": 363.176315
        },
        "flac_size_bytes": {
          "min": 64753,
          "max": 522969,
          "mean": 149953.75,
          "sum": 11996300
        },
        "sample_rows": [
          {
            "segment_id": "SPEAKER_00_0008_483.51-486.20.flac",
            "video_id": "XzBZIm2lJKQ",
            "tar_member_name": "SPEAKER_00_0008_483.51-486.20.flac",
            "flac_size_bytes": 102603,
            "audio_duration_s": 2.988437
          },
          {
            "segment_id": "SPEAKER_00_0009_502.86-532.40.flac_split0",
            "video_id": "XzBZIm2lJKQ",
            "tar_member_name": "SPEAKER_00_0009_502.86-532.40.flac_split0.flac",
            "flac_size_bytes": 277228,
            "audio_duration_s": 7.3
          },
          {
            "segment_id": "SPEAKER_00_0009_502.86-532.40.flac_split1",
            "video_id": "XzBZIm2lJKQ",
            "tar_member_name": "SPEAKER_00_0009_502.86-532.40.flac_split1.flac",
            "flac_size_bytes": 258386,
            "audio_duration_s": 7.265
          }
        ]
      },
      "metadata": {
        "row_count": 80,
        "columns": [
          "video_id",
          "segment_id",
          "parent_segment_file",
          "is_split_part",
          "split_index",
          "segment_language",
          "video_language",
          "youtube_audio_language",
          "youtube_default_language",
          "transcription_native",
          "transcription_romanized",
          "transcription_mixed",
          "transcription_tagged",
          "has_audio_tag",
          "duration_s",
          "sample_rate_hz",
          "rms_dbfs",
          "peak_dbfs",
          "zero_crossing_rate",
          "silence_fraction",
          "chars_per_sec",
          "words_per_sec",
          "tx_quality_score",
          "final_bucket",
          "audio_sha256",
          "flac_size_bytes",
          "meta_information"
        ],
        "duration_s": {
          "min": 2.308125,
          "max": 14.630958,
          "mean": 4.5397039375,
          "sum": 363.176315
        },
        "unique_video_ids": 3,
        "is_split_part_true": 6,
        "final_bucket_counts": {
          "dispose": 72,
          "redo": 5,
          "golden": 3
        },
        "segment_language_counts": {
          "en": 80
        },
        "sample_rate_hz_counts": {
          "48000": 80
        },
        "tx_quality_score": {
          "min": 0.75,
          "max": 1.0,
          "mean": 0.994375
        },
        "meta_information_top_level_keys": [
          "export_provenance",
          "language_evidence",
          "replay_provenance",
          "source_row_provenance",
          "transcript_provenance",
          "validation_provenance",
          "variant_provenance",
          "video_metadata"
        ],
        "meta_information_subkeys": {
          "language_evidence": [
            "corrected_language",
            "gemini_lang",
            "queue_language",
            "segment_language",
            "tx_detected_language",
            "youtube_audio_language",
            "youtube_default_language"
          ],
          "replay_provenance": [
            "is_split_segment",
            "leading_pad_ms",
            "original_end_ms",
            "original_start_ms",
            "parent_segment_file",
            "segment_file",
            "split_index_from_id",
            "trailing_pad_ms",
            "trimmed_end_ms",
            "trimmed_start_ms"
          ],
          "validation_provenance": [
            "conformer_multi_ctc_normalized",
            "consensus_lang",
            "final_bucket",
            "final_has_validation",
            "final_validation_source",
            "lid_agree_count",
            "lid_consensus",
            "mms_confidence"
          ],
          "variant_provenance": [
            "input_script_profile",
            "native_script_text",
            "processing_route",
            "romanized_text",
            "validation_errors"
          ],
          "video_metadata": [
            "channel_id",
            "channel_title",
            "description",
            "tags",
            "title",
            "youtube_audio_language",
            "youtube_default_language"
          ]
        },
        "sample_rows": [
          {
            "segment_id": "SPEAKER_00_0008_483.51-486.20.flac",
            "video_id": "XzBZIm2lJKQ",
            "transcription_native": "game, I lean under, I lean Florida",
            "duration_s": 2.988437,
            "final_bucket": "golden",
            "tx_quality_score": 1.0
          },
          {
            "segment_id": "SPEAKER_00_0009_502.86-532.40.flac_split0",
            "video_id": "XzBZIm2lJKQ",
            "transcription_native": "routine, I don't know why, it just feels comfortable to me. I like the three bets a night, it's sort of how I've kind of, sometimes it's four.",
            "duration_s": 7.3,
            "final_bucket": "dispose",
            "tx_quality_score": 1.0
          },
          {
            "segment_id": "SPEAKER_00_0009_502.86-532.40.flac_split1",
            "video_id": "XzBZIm2lJKQ",
            "transcription_native": "but most often it's three. I think that's a good opportunity for you guys. Some of you are parlaying it, I don't like it when you guys parlay my plays, but some of you",
            "duration_s": 7.265,
            "final_bucket": "dispose",
            "tx_quality_score": 1.0
          }
        ]
      },
      "xcodec2_tokens": {
        "row_count": 80,
        "columns": [
          "segment_id",
          "xcodec2_tokens",
          "token_count"
        ],
        "token_count": {
          "min": 115,
          "max": 731,
          "mean": 225.7375,
          "sum": 18059
        },
        "binary_length_bytes": {
          "min": 230,
          "max": 1462,
          "mean": 451.475
        },
        "sample_rows": [
          {
            "segment_id": "SPEAKER_00_0008_483.51-486.20",
            "token_count": 149
          },
          {
            "segment_id": "SPEAKER_00_0009_502.86-532.40.flac_split0",
            "token_count": 365
          },
          {
            "segment_id": "SPEAKER_00_0009_502.86-532.40.flac_split1",
            "token_count": 363
          }
        ]
      },
      "audio_tar": {
        "member_count": 80,
        "first_members": [
          "SPEAKER_00_0008_483.51-486.20.flac",
          "SPEAKER_00_0009_502.86-532.40.flac_split0.flac",
          "SPEAKER_00_0009_502.86-532.40.flac_split1.flac",
          "SPEAKER_00_0009_502.86-532.40.flac_split2.flac",
          "SPEAKER_00_0009_502.86-532.40.flac_split3.flac",
          "SPEAKER_00_0010_558.17-566.33.flac",
          "SPEAKER_00_0011_570.24-584.57.flac",
          "SPEAKER_00_0311_1884.19-1886.38.flac",
          "SPEAKER_00_0330_2008.35-2011.46.flac",
          "SPEAKER_00_0333_2019.66-2022.80.flac"
        ]
      }
    }
  }