o
    ½e¦i¬  ã                   @   s.   d Z ddlZ	ddd„Zdd	„ Zddd„ZdS )zU
N-gram counting, discounting, interpolation, and backoff

Authors
 * Aku Rouhe 2020
é    NTú<s>ú</s>c                 C   s,   |rt  |ft| ƒ|f¡S t  t| ƒ|f¡S )aö  
    Pad sentence ends with start- and end-of-sentence tokens

    In speech recognition, it is important to predict the end of sentence
    and use the start of sentence to condition predictions. Typically this
    is done by adding special tokens (usually <s> and </s>) at the ends of
    each sentence. The <s> token should not be predicted, so some special
    care needs to be taken for unigrams.

    Arguments
    ---------
    sequence : iterator
        The sequence (any iterable type) to pad.
    pad_left : bool
        Whether to pad on the left side as well. True by default.
    left_pad_symbol : any
        The token to use for left side padding. "<s>" by default.
    right_pad_symbol : any
        The token to use for right side padding. "</s>" by default.

    Returns
    -------
    generator
        A generator that yields the padded sequence.

    Example
    -------
    >>> for token in pad_ends(["Speech", "Brain"]):
    ...     print(token)
    <s>
    Speech
    Brain
    </s>

    )Ú	itertoolsÚchainÚtuple)ÚsequenceÚpad_leftÚleft_pad_symbolÚright_pad_symbol© r   úU/home/ubuntu/transcripts/venv/lib/python3.10/site-packages/speechbrain/lm/counting.pyÚpad_ends   s
   &ÿr   c                 c   sž    |dkr	t dƒ‚|dkr| D ]}|fV  qdS t| ƒ}g }t|ddD ]\}}| |¡ ||d kr5 nq$dS |D ]}t|ƒ|f V  | |¡ |d= q:dS )a  
    Produce all Nth order N-grams from the sequence.

    This will generally be used in an N-gram counting pipeline.

    Arguments
    ---------
    sequence : iterator
        The sequence from which to produce N-grams.
    n : int
        The order of N-grams to produce

    Yields
    ------
    tuple
        Yields each ngram as a tuple.

    Returns
    -------
    None

    Example
    -------
    >>> for ngram in ngrams("Brain", 3):
    ...     print(ngram)
    ('B', 'r', 'a')
    ('r', 'a', 'i')
    ('a', 'i', 'n')

    r   zN must be >=1é   N)Ústart)Ú
ValueErrorÚiterÚ	enumerateÚappendr   )r   ÚnÚtokenÚiteratorÚhistoryÚhist_lengthr   r   r   Úngrams?   s(   €

ÿ
r   Fc                 c   sh    |dkr	t dƒ‚t| ƒ}g }|s| t|ƒ¡ |D ]}t|ƒ|kr%|d= |t|ƒfV  | |¡ qdS )ao  
    Produce each token with the appropriate context.

    The function produces as large N-grams as possible, so growing from
    unigrams/bigrams to max_n.

    E.G. when your model is a trigram model, you'll still only have one token
    of context (the start of sentence) for the first token.

    In general this is useful when evaluating an N-gram model.

    Arguments
    ---------
    sequence : iterator
        The sequence to produce tokens and context from.
    max_n : int
        The maximum N-gram length to produce.
    predict_first : bool
        To produce the first token in the sequence to predict (without
        context) or not. Essentially this should be False when the start of
        sentence symbol is the first in the sequence.

    Yields
    ------
    Any
        The token to predict
    tuple
        The context to predict conditional on.

    Example
    -------
    >>> for token, context in ngrams_for_evaluation("Brain", 3, True):
    ...     print(f"p( {token} |{' ' if context else ''}{' '.join(context)} )")
    p( B | )
    p( r | B )
    p( a | B r )
    p( i | r a )
    p( n | a i )
    r   zMax N must be >=1N)r   r   r   ÚnextÚlenr   )r   Úmax_nÚpredict_firstr   r   r   r   r   r   Úngrams_for_evaluationt   s   €(ür   )Tr   r   )F)Ú__doc__r   r   r   r   r   r   r   r   Ú<module>   s    

ÿ.5