Main Speaker Extraction — Pipeline 2 (karthyani)

YouTube clip 2:35–46:08 (43:33) · pyannote.ai precision-2 · 3 speakers · target = SPEAKER_00 · SoloSpeech + approach (d) post-mute

Original full clip (43:33)

clip.wav16 kHz mono · 2613 s

Clean SPEAKER_00 (target — concat reference only)

speaker_00_clean.wav12:03 · 231 segments concat

Clean SPEAKER_02 (secondary)

speaker_02_clean.wav2:07 · 62 segments

Clean SPEAKER_01 (likely noise)

speaker_01_clean.wav0:07 · 20 segments

★ DATASET — individual files (≥4s, no concat)

2 columns: clean SP00 turns (7:09, 69) · overlap SoloSpeech (19:00, 120)

Reference enrollment (used by SoloSpeech)

main_speaker_ref.wav14.40 s · longest clean SPEAKER_00 turn @ 893.26 s

Overlap regions — original vs SoloSpeech extracted

sort by: