---
name: R2 Upload + HF Dataset
overview: Upload all audio segments (~94GB) to Cloudflare R2, then create a HuggingFace dataset with audio, transcription, speaker info, R2 URLs, and content-based dedup IDs.
todos:
  - id: r2-upload
    content: Upload all ~94GB of audio files to Cloudflare R2 using boto3 with parallel workers
    status: in_progress
  - id: build-dataset
    content: "Build HF dataset with schema: id, audio, text, speaker_id, source, language, duration_s, r2_url, sample_rate"
    status: in_progress
  - id: push-hf
    content: Push dataset to saichranreddy/internal_indic_h (private) with Audio feature enabled
    status: pending
isProject: false
---

# R2 Audio Upload + HuggingFace Dataset

## Dataset Schema

Each row in the HuggingFace dataset will have:


| Column | Description |
| ------ | ----------- |


- **id** -- `SHA-256(audio_bytes + transcript_text)[:16]` -- deterministic, dedup-friendly
- **audio** -- HF Audio feature (playable/streamable on HF)
- **text** -- Transcription (cleaned, without "Speaker 0:" prefix)
- **speaker_id** -- Voice name extracted from source path (e.g., `aoede`, `kajal`, `modi`, `IISc_Hindi_Male_Spk001`)
- **source** -- Origin dataset (e.g., `google_tts`, `polly`, `sarvam`, `iisc_syspin`, `rasa_hindi`, `modi`)
- **language** -- `hi` (Hindi) for all rows
- **duration_s** -- Audio duration in seconds
- **r2_url** -- Full R2 URL path to the audio file
- **sample_rate** -- Audio sample rate in Hz

## Step 1: Upload Audio to R2

- Use `boto3` with S3-compatible API pointing to the R2 endpoint
- User will provide: R2 endpoint URL, access key, secret key, bucket name
- Upload structure: `audio/{source}/{speaker_id}/{filename}.wav`
- Use multiprocessing (8-16 workers) for parallel uploads
- Skip files that already exist in R2 (check by key) for resumability

## Step 2: Build HuggingFace Dataset

- Read `dataset_combined.jsonl` (190,022 entries)
- For each entry:
  - Compute `id = sha256(audio_bytes + text.encode())[:16]`
  - Extract `speaker_id` and `source` from the audio path
  - Get `duration_s` and `sample_rate` from the wav file
  - Build `r2_url` from the upload path
  - Strip "Speaker 0: " prefix from text
- Save as HF Dataset with Audio feature using `datasets` library
- Push to `saichranreddy/internal_indic_h` (private) with audio included

## Step 3: Dedup Check

- Before inserting any row, check if `id` already exists in the dataset
- On re-runs, only new (unseen) rows get appended
- The content-hash ID means identical audio+text pairs are automatically detected

## Data Breakdown (190,022 rows, ~94GB)

- IISc SYSPIN: 46,748 entries (multiple speakers)
- Polly Kajal: 43,826 entries
- Sarvam: 15,846 entries (multiple speakers)
- Rasa Hindi: 13,302 entries
- Google TTS: ~56,000 entries (17 voices)
- Modi: 3,325 entries

