Main Speaker Extraction — Pipeline 2 (karthyani)
YouTube clip 2:35–46:08 (43:33) · pyannote.ai precision-2 · 3 speakers · target = SPEAKER_00 · SoloSpeech + approach (d) post-mute
Original full clip (43:33)
clip.wav16 kHz mono · 2613 s
Clean SPEAKER_00 (target — concat reference only)
speaker_00_clean.wav12:03 · 231 segments concat
Clean SPEAKER_02 (secondary)
speaker_02_clean.wav2:07 · 62 segments
Clean SPEAKER_01 (likely noise)
speaker_01_clean.wav0:07 · 20 segments
★ DATASET — individual files (≥4s, no concat)
2 columns: clean SP00 turns (7:09, 69) · overlap SoloSpeech (19:00, 120)
Reference enrollment (used by SoloSpeech)
main_speaker_ref.wav14.40 s · longest clean SPEAKER_00 turn @ 893.26 s
Overlap regions — original vs SoloSpeech extracted
sort by: