o
    Sin                     @   s  d Z ddlZddlZddlZddlm  mZ ddlZddl	m
Z
 ddlmZ ddlmZmZmZmZmZmZ ddlmZ ddlmZ ddlmZmZmZ dd	lmZ dd
lmZ ddl m!Z! ddl"m#Z#m$Z$m%Z% ddl&m'Z'm(Z(m)Z)m*Z* g dddgg ddZ+g ddgg dg dZ,			d?dedee- dee. dee. ddf
dd Z/	!				"	d@de'd#ee' d$ee' dee- dee. dee. defd%d&Z0G d'd( d(eZ1	)dAd$e'd*e.deeee.e.e.f ee$ f ee.ee.e2f f f fd+d,Z3			dBd-ee' d.ee.ee.e2f f d/e-d0e'def
d1d2Z4		dCd-ee' d/e-d0e'defd3d4Z5d5ed6ee.ee1 f d.ee.ee.e2f f de%fd7d8Z6d5ed6ee.ee1 f de%fd9d:Z7				;	dDd#e'd$ee' d0ee' dee. d<e.d/e-dee.ee.eee%f f f fd=d>Z8dS )Eav  
The data preparation recipe for the ICSI Meeting Corpus. It follows the Kaldi recipe
by Pawel Swietojanski:
https://git.informatik.fh-nuernberg.de/poppto72658/kaldi/-/commit/d5815d3255bb62eacf2fba6314f194fe09966453

ICSI data comprises around 72 hours of natural, meeting-style overlapped English speech
recorded at International Computer Science Institute (ICSI), Berkley.
Speech is captured using the set of parallel microphones, including close-talk headsets,
and several distant independent microhones (i.e. mics that do not form any explicitly
known geometry, see below for an example layout). Recordings are sampled at 16kHz.

The correponding paper describing the ICSI corpora is [1]

[1] A Janin, D Baron, J Edwards, D Ellis, D Gelbart, N Morgan, B Peskin,
    T Pfau, E Shriberg, A Stolcke, and C Wooters, The ICSI meeting corpus.
    in Proc IEEE ICASSP, 2003, pp. 364-367


ICSI data did not come with any pre-defined splits for train/valid/eval sets as it was
mostly used as a training material for NIST RT evaluations. Some portions of the unrelased ICSI
data (as a part of this corpora) can be found in, for example, NIST RT04 amd RT05 evaluation sets.

This recipe, however, to be self-contained factors out training (67.5 hours), development (2.2 hours
and evaluation (2.8 hours) sets in a way to minimise the speaker-overlap between different partitions,
and to avoid known issues with available recordings during evaluation. This recipe follows [2] where
dev and eval sets are making use of {Bmr021, Bns001} and {Bmr013, Bmr018, Bro021} meetings, respectively.

[2] S Renals and P Swietojanski, Neural networks for distant speech recognition.
    in Proc IEEE HSCMA 2014 pp. 172-176. DOI:10.1109/HSCMA.2014.6843274

Below description is (mostly) copied from ICSI documentation for convenience.
=================================================================================

Simple diagram of the seating arrangement in the ICSI meeting room.

The ordering of seat numbers is as specified below, but their
alignment with microphones may not always be as precise as indicated
here. Also, the seat number only indicates where the participant
started the meeting. Since most of the microphones are wireless, they
were able to move around.

   Door


          1         2            3           4
     -----------------------------------------------------------------------
     |                      |                       |                      |   S
     |                      |                       |                      |   c
     |                      |                       |                      |   r
    9|   D1        D2       |   D3  PDA     D4      |                      |   e
     |                      |                       |                      |   e
     |                      |                       |                      |   n
     |                      |                       |                      |
     -----------------------------------------------------------------------
          8         7            6           5



D1, D2, D3, D4  - Desktop PZM microphones
PDA - The mockup PDA with two cheap microphones

The following are the TYPICAL channel assignments, although a handful
of meetings (including Bmr003, Btr001, Btr002) differed in assignment.

The mapping from the above, to the actual waveform channels in the corpora,
and (this recipe for a signle distant mic case) is:

D1 - chanE - (this recipe: sdm3)
D2 - chanF - (this recipe: sdm4)
D3 - chan6 - (this recipe: sdm1)
D4 - chan7 - (this recipe: sdm2)
PDA left - chanC
PDA right - chanD

-----------
Note (Pawel): The mapping for headsets is being extracted from mrt files.
In cases where IHM channels are missing for some speakers in some meetings,
in this recipe we either back off to distant channel (typically D2, default)
or (optionally) skip this speaker's segments entirely from processing.
This is not the case for eval set, where all the channels come with the
expected recordings, and split is the same for all conditions (thus allowing
for direct comparisons between IHM, SDM and MDM settings).

NOTE on data: The ICSI data is freely available from the website (see `download` below)
and also as LDC corpora. The annotations that we download below are same as
LDC2004T04, but there are some differences in the audio data, specifically in the
session names. Some sessions (Bns...) are named (bns...) in the LDC corpus, and the
Mix-Headset wav files are not available from the LDC corpus. So we recommend downloading
the public version even if you have an LDC subscription. The public data also includes
annotations of roles, dialog, summary etc. but we have not included them in this recipe.
    N)defaultdict)Path)DictList
NamedTupleOptionalTupleUnion)tqdm)$validate_recordings_and_supervisions)AudioSource	RecordingRecordingSet)read_sph)fix_manifests)normalize_text_ami)AlignmentItemSupervisionSegmentSupervisionSet)PathlikeSecondsadd_durationsresumable_download)FBdb001Bed002Bed003Bed004Bed005Bed006Bed008Bed009Bed010Bed011Bed012Bed013Bed014Bed015Bed016Bed017Bmr001Bmr002Bmr003Bmr005Bmr006Bmr007Bmr008Bmr009Bmr010Bmr011Bmr012Bmr014Bmr015Bmr016Bmr019Bmr020Bmr022Bmr023Bmr024Bmr025Bmr026Bmr027Bmr028Bmr029Bmr030Bmr031Bns002Bns003Bro003Bro004Bro005Bro007Bro008Bro010Bro011Bro012Bro013Bro014Bro015Bro016Bro017Bro018Bro019Bro022Bro023Bro024Bro025Bro026Bro027Bro028Bsr001Btr001Btr002Buw001Bmr021Bns001)Bmr013Bmr018Bro021traindevtest)0123456789ABrn   )EFrn   ro   )ihmsdmmdmihm-mixF&http://https://groups.inf.ed.ac.uk/amirv   
target_dirforce_downloadurlmicreturnc           
      C   s   t tjt ddD ]o}|dv r]t| D ]E}| d| d| d}| | }|jddd |d	| d }z	t|||d
 W q t	j
jy[ }	 ztd|  W Y d }	~	qd }	~	ww q| d| d}| | }|jddd |d }t|||d
 qd S )NzDownloading ICSI meetingsdesc)rv   rw   rx   z/ICSIsignals/SPH/z/chanz.sphTparentsexist_okchanfilenamer|   zSkipping failed download from z/ICSIsignals/NXT/z.interaction.wavzMix-Headset.wav)r
   	itertoolschainfrom_iterable
PARTITIONSvaluesMIC_TO_CHANNELSmkdirr   urlliberror	HTTPErrorloggingwarning)
r{   r|   r}   r~   itemchannelwav_urlwav_dirwav_pathe r   G/home/ubuntu/.local/lib/python3.10/site-packages/lhotse/recipes/icsi.pydownload_audio   s6   

r   .http://groups.inf.ed.ac.uk/ami	audio_dirtranscripts_dirc           	      C   s.  t | } |r
t |n| d }|rt |n| d }t|||| td | r4|s4td|  | S | d}| d}t|| d |d t|| d	 |d t| d	 }||  |rht | d 	| W d
   n1 srw   Y  t| d }|
d| W d
   | S 1 sw   Y  | S )a  
    Download ICSI audio and annotations for provided microphone setting.
    :param target_dir: Pathlike, the path in which audio and transcripts dir are created by default.
    :param audio_dir: Pathlike (default = '<target_dir>/audio'), the path to store the audio data.
    :param transcripts_dir: Pathlike (default = '<target_dir>/transcripts'), path to store the transcripts data
    :param force_download: bool (default = False), if True, download even if file is present.
    :param url: str (default = 'http://groups.inf.ed.ac.uk/ami'), download URL.
    :param mic: str {'ihm','ihm-mix','sdm','mdm'}, type of mic setting.
    :return: the path to downloaded and extracted directory with data.
    speechtranscriptszDownloading ICSI annotationsz/Skip downloading transcripts as they exist in: z4/ICSICorpusAnnotations/ICSI_original_transcripts.zipz(/ICSICorpusAnnotations/ICSI_core_NXT.zipzICSI_original_transcripts.zipr   zICSI_core_NXT.zipNztranscripts/preambles.mrt)r   r   r   infoexistsr   zipfileZipFile
extractallrenameextract)	r{   r   r   r|   r}   r~   annotations_url_mrtannotations_url_nxtzr   r   r   download_icsi   sF   



	
r   c                   @   sB   e Zd ZU eed< eed< eed< eed< eed< ee ed< dS )IcsiSegmentAnnotationtextspeakergender
start_timeend_timewordsN)__name__
__module____qualname__str__annotations__r   r   r   r   r   r   r   r      s   
 r   upper	normalizec           )         s  t t}t t}t t}t| d [}t| }|D ]I}|jdkre|jd }|D ]:}	|	jdkrd|	D ]0}
|
jdkrFdd t	|
D ||< q3|
jdkrc|
D ]}d	|jv rY|jd	 nd
|| |jd < qMq3q*qW d    n1 spw   Y  i }| d 
dD ]t}|jd\}}}g }d }t|=}t|}| D ]+}|jdkrq|d u rd|jv r|jd }t|jd }t|jd }|||f qW d    n1 sw   Y  |d u st|dkrq~||f}|| | }|||f||< q~i }| d 
dD ]}|jd\}}}||f}||vrq|| \}}}g }d}t|T}t|}t	| D ]@\}}|jdksRd|jvsR|jd dksRd|jvsR|jd dkrTq,t|jd }t|jd }||||jf q,W d    n	1 sxw   Y  |||f||< qt t}| D ]\}\}}}|| \}}}|d ||f}|D ]\ tt fdd|}t|dkrq|d d } |d d }!g }"|D ]Z}#t| t|#d dd}$t|!t|#d dd}%t|%|$ dd}&t|#d  |d!}'t|'dkrq|&dkrtd"|d  d| d| d#|  d$|! d% q|"t|$|&|'d& qd'd(d) |"D }(|| t|(||d | |!|"d* qq||fS )+Nzpreambles.mrtMeetingSessionPreambleChannelsc                 S   s   i | ]
\}}|j d  |qS )Name)attrib).0idxr   r   r   r   
<dictcomp>
  s    
z*parse_icsi_annotations.<locals>.<dictcomp>ParticipantsChannelchan6r   Segmentsz*.xmlr   segmentparticipant	starttimeendtimer   WordsFw c                    s   | d ko| d  kS )Nr      r   )r   seg_end	seg_startr   r   <lambda>X  s    z(parse_icsi_annotations.<locals>.<lambda>r      )ndigitsi>  )sampling_rate   r   Segment z	 at time -z5 has a word with zero or negative duration. Skipping.)startdurationsymbol c                 s   s    | ]}|j V  qd S N)r   )r   r   r   r   r   	<genexpr>o  s    z)parse_icsi_annotations.<locals>.<genexpr>)r   r   r   r   r   r   )r   listdictopenETparsegetroottagr   	enumerateglobstemsplitfloatappendlenr   itemsfiltermaxroundminr   r   r   r   r   joinr   ))r   r   annotationschannel_to_idx_mapspk_to_channel_mapfrootchild
meeting_id
grandchildgreatgrandchildr   segmentsfilemeet_idlocal_id_spk_segmentsspk_idtreesegr   r   keyr   r   	seg_wordscombine_with_nextiword	spk_wordsnew_keyr   endword_alignmentsr   w_startw_endw_durw_symbolr   r   r   r   parse_icsi_annotations   s   















&$r  audio_pathsr   save_to_wav
output_dirc              
      s0  dd l }ddlm} |dd | } d u rtt g }t| ddD ]o\} vr8dd tg d	D  < t|d \}	}
|rtt	|d
  }|j
ddd t|D ]\}}t|\}}||j d }|||j|
 |||< qU|t fddt|D |
|	jd |	jd |
 d q#t|S )Nr   )groupbyc                 S   s
   | j d S )N)parts)pr   r   r   r     s   
 z'prepare_audio_grouped.<locals>.<lambda>Preparing audior   c                 S   s   i | ]\}}||qS r   r   )r   r   cr   r   r   r     s    
z)prepare_audio_grouped.<locals>.<dictcomp>)chanEchanFr   chan7wavsTr   .wavc                    s8   g | ]}|j   v rtd   |j  gt|dqS )r  typechannelssource)r   r   r   )r   
audio_pathr   session_namer   r   
<listcomp>  s    z)prepare_audio_grouped.<locals>.<listcomp>r   idsourcesr   num_samplesr   )	soundfilecytoolzr  r   r   r
   r   r   r   r   r   r   writeTr   r   sortedshaper   from_recordings)r  r   r  r  sfr  channel_wavs
recordingschannel_pathsaudio_sf
sampleratesession_dirr  r)  audior  r   r   r*  r   prepare_audio_grouped  sD   


	
r@  c              
   C   s   dd l }g }t| ddD ]c}|jd }|jdkr'||}|j}|j}	|j}
n.t|\}}
|j	\}	}|rUt
|d | }|jddd ||j d }|||j|
 |}|t|td	tt|	t|d
g|
|||
 d qt|S )Nr   r  r   r  r$  r#  Tr   r  r%  r-  )r1  r
   r  suffix	SoundFileframesr'  r=  r   r6  r   r   r   r3  r4  r   r   r   r   ranger   r   r7  )r  r  r  r8  r:  r)  r+  r<  
num_framesnum_channelsr=  r>  r   r   r   r   prepare_audio_single  s@   





rG  r?  r   c                    s    fdd D }g }t | ddD ]e}|jD ]_}|j\}||j|f}|d u r*qt|D ]G\}	}
|
j|
j }|
j|jkrOt	
d|j d| d|	 d q.|dkru|t|j d| d|	 |j|
j||d	|
j|
j|
jd
|
jid
 q.qqt|S )Nc                    s.   i | ]}|d  |d   |d  f | qS )r   r   r   )r   r	  r   r   r   r   r     s     z+prepare_supervision_ihm.<locals>.<dictcomp>Preparing supervisionr   r   r   z8 exceeds recording duration. Not adding to supervisions.r   Englishr  
r.  recording_idr   r   r   languager   r   r   	alignment)r
   r/  r'  getr.  r   r   r   r   r   r   r   r   r   r   r   r   r   from_segments)r?  r   r   annotation_by_id_and_channelr   	recordingr(  r   
annotationseg_idxseg_infor   r   rH  r   prepare_supervision_ihm  sH   

"rV  c                 C   s  t t}| D ]\}}||d  | qg }t| ddD ]`}||j}|jd }|d u r9t	d|j  qt
|jdkrKt	d|j d qt|D ].\}	}
|
j|
j }|dkr}|t|j d|	 |j|
j||jd	|
j|
j|
jd
|
jid
 qOqt|S )Nr   rI  r   z"No annotation found for recording r   z"More than 1 channels in recording z. Skipping this recording.r   rJ  r  rK  )r   r   r   extendr
   rO  r.  r/  r   r   r   r'  r   r   r   r   r   channel_idsr   r   r   r   r   rP  )r?  r   annotation_by_idr	  valuer   rR  rS  r(  rT  rU  r   r   r   r   prepare_supervision_other   sF   

r[  kaldinormalize_textc              	      s   t | } |durt |n| d }|  sJ d|  | s&J d| |t v s4J d| d|r>|dus>J d|durMt |}|jddd td	 t||d
\}}td dt| }|dksn|dkr| 	d| d}	t
t|	|dkr|nd||}
n"|dks|dkrt|r| 	d| dn| 	d}	tt|	||}
td |dkrt|
||nt|
|}tt}dD ]H |
 fdd}| fdd}t||\}}t|| |dur||d| d  d  ||d| d  d  ||d| < qt|S )a  
    Returns the manifests which consist of the Recordings and Supervisions
    :param audio_dir: Pathlike, the path which holds the audio data
    :param transcripts_dir: Pathlike, the path which holds the transcripts data
    :param output_dir: Pathlike, the path where to write the manifests - `None` means manifests aren't stored on disk.
    :param mic: str {'ihm','ihm-mix','sdm','mdm'}, type of mic to use.
    :param normalize_text: str {'none', 'upper', 'kaldi'} normalization of text
    :param save_to_wav: bool, whether to save the sph audio to wav format
    :return: a Dict whose key is ('train', 'dev', 'test'), and the values are dicts of manifests under keys
        'recordings' and 'supervisions'.
    Nr   zNo such directory: zMic z not supportedz/output_dir must be specified when saving to wavTr   zParsing ICSI transcriptsr   zPreparing recording manifestsr   rv   rx   zchan[z].sphrw   ry   z*.wavzPreparing supervision manifestsrd   c                       | j t  v S r   )r.  r   xpartr   r   r         zprepare_icsi.<locals>.<lambda>c                    r^  r   )rL  r   r_  ra  r   r   r     rc  zicsi-_recordings_z	.jsonl.gz_supervisions_)r:  supervisions)r   is_dirr   keysr   r   r   r  r   rglobr@  r   r   rG  rV  r[  r   r   r   r   r   to_file)r   r   r  r~   r]  r  r   r   r'  r  r?  supervision	manifests
audio_partsupervision_partr   ra  r   prepare_icsiK  sj   






ro  )Frz   rv   )r   NNFr   rv   )r   )NFN)FN)NNrv   r\  F)9__doc__r   r   r   xml.etree.ElementTreeetreeElementTreer   r   collectionsr   pathlibr   typingr   r   r   r   r   r	   	tqdm.autor
   lhotser   lhotse.audior   r   r   lhotse.audio.backendr   	lhotse.qar   lhotse.recipes.utilsr   lhotse.supervisionr   r   r   lhotse.utilsr   r   r   r   r   r   boolr   r   r   r   intr  r@  rG  rV  r[  ro  r   r   r   r   <module>   s   \ 
"
B
,
 
<
1
4
-