Audio-JEPA: Joint-Embedding Predictive Architecture for Audio Representation Learning Ludovic TUNCAY ¹ Etienne LABBE ¹ Emmanouil BENETOS ² Thomas PELLEGRINI ¹ ludovic.tuncay@irit.fr etienne.labbe@irit.fr emmanouil.benetos@qmul.ac.uk thomas.pellegrini@irit.fr ¹ IRIT, Université de Toulouse, CNRS, Toulouse INP, Toulouse, France ² School of Electronic Engineering and Computer Science, Queen Mary University of London, UK Abstract—Building on the Joint-Embedding Predictive Architecture (JEPA) paradigm, a recent self-supervised learning framework that predicts latent representations of masked regions in high-level feature spaces, we propose Audio‑JEPA (Audio Joint‑Embedding Predictive Architecture), tailored specifically for audio data. Audio-JEPA uses a simple Vision Transformer backbone to predict latent representations of masked spectrogram patches rather than reconstructing raw audio. We pre‑train on unlabeled AudioSet clips (10s, 32kHz) with random patch masking on mel‑spectrograms. We evaluate on the X‑ARES suite covering speech, music, and environmental‐ sound tasks. Although our implementation is a straightforward translation of the original model to audio, the results still show comparable performance to wav2vec 2.0 and data2vec while using less than one‑fifth of their training data and with no hyper‑parameter tuning. All code and pretrained checkpoints will be released on GitHub¹. Index Terms—Self-supervised learning, audio representation, joint-embedding predictive architecture, Audio-JEPA, AudioSet I. Introduction Self-Supervised Learning (SSL) has revolutionized representation learning for speech and audio, enabling models to learn from unlabeled data and excel in diverse downstream tasks [1, 2, 3, 4]. Early SSL approaches for audio, such as contrastive predictive coding and wav2vec 2.0, learned latent speech representations by masking the input and solving a contrastive task over latent codes [5]. Follow-up methods like HuBERT [1] introduced offline clustering to generate pseudolabels for masked audio segments and WavLM [6] applied data augmentation and denoising to improve robustness in speech representation learning. More recently, latent prediction approaches have gained traction: data2vec [7] and its efficient successor data2vec 2.0 [8] employ a teacher–student framework to predict contextualized latent representations of the input, achieving strong results across vision, speech, and language tasks. In the audio domain, Niizumi et al. introduced Masked Modeling Duo (M2D) [4], which uses two networks (online and momentum encoder) to predict masked patch embeddings and attained state-of-the-art results on numerous audio benchmarks. In computer vision, a new paradigm called Joint-Embedding Predictive Architecture (JEPA) [9, 10, 11] has been proposed to predict hidden content in a high-level latent space instead of pixel space. Notably, the image-based I-JEPA [10] model demonstrated that predicting representations of masked ¹https://github.com/LudovicTuncay/Audio-JEPA image regions can yield powerful visual features. The JEPA approach differs from prior masked reconstruction methods by focusing on semantic latent prediction rather than lower-level signal reconstruction. Inspired by these advances, an audio version of JEPA (termed A-JEPA) [12] was recently described by Fei et al. Their A-JEPA encodes a spectrogram “context” part and predicts the latent representations of masked “target” regions using a momentum-updated target encoder. During pre-training, they anneal the mask from fully random toward a spec-augment-style structured scheme. In contrast, our AudioJEPA sticks with purely random masking throughout for simplicity and maximal generality. At the time of writing, no official implementation or checkpoints for A-JEPA are available, motivating our from-scratch development. In the musical domain, Stem-JEPA [13] adapts the JEPA paradigm to multi-track recordings by jointly training an encoder and predictor to forecast embeddings of compatible instrument stems. While utilizing the JEPA backbone, Stem-JEPA differs methodologically as it masks entire instrument stems instead of individual spectrogram patches. Our work targets the ICME 2025 Audio Encoder Capability Challenge, where the goal is to learn general audio representations that perform well across a broad suite of tasks. We present our Audio-JEPA implementation - developed from scratch following the I-JEPA paradigm - and benchmark it on the challenge’s eXtensive Audio Representation and Evaluation Suite (X-ARES). Our contributions include: (1) adapting the JEPA masked latent prediction framework to audio spectrogram inputs using a Vision Transformer (ViT) backbone; (2) an extensive evaluation against prior self-supervised audio models on standard downstream tasks, assessing Audio-JEPA via both linear probing and k-nearest-neighbor evaluation. II. Related Work A. Self-Supervised Audio Representation Learning Early work in SSL for audio focused on predicting future or missing parts of the waveform. wav2vec 2.0 [5] pioneered masking in the latent speech representations and training the model to identify the true quantized latent of a masked segment among distractors (a contrastive loss). This approach enabled models to learn rich speech features and achieved remarkable results on speech recognition with limited labeled data. Building on this, HuBERT (Hidden-Unit BERT) [1] introduced a BERT-like masked prediction where the model predicts cluster assignments of masked audio frames. Fig. 1. A‑JEPA architecture. Mel‑spectrogram patches are split into visible and masked sets. A context encoder embeds visible patches, a lightweight predictor reconstructs masked‑patch embeddings, and a momentum‑updated target encoder provides targets. Training minimizes average L2 (Euclidean) distance. The dashed arrow denotes a stop‑gradient. HuBERT uses an offline k-means on acoustic features to provide target labels, and by iteratively refining these labels, it learns high-level speech units, matching wav2vec 2.0 performance on ASR benchmarks. Facebook’s data2vec presented a modality-general SSL approach: instead of contrastive or classification targets, data2vec [7] trains a student network to regress the contextualized embeddings produced by a teacher network (an Exponential Moving Average (EMA) of the student) for masked portions of the input. data2vec 2.0 [8] improved the efficiency of this method by not encoding masked tokens and using a lightweight decoder, achieving similar accuracies to Masked Autoencoders [14] in a fraction of the training time and matching wav2vec 2.0 on speech tasks with over a 10× speedup. These latent regression approaches eliminate the need for discrete targets and have set strong baselines in audio. B. Masked Prediction with Dual Networks The use of two networks (online/target) for masked prediction has also been explored in specialized audio SSL methods. M2D (Masked Modeling Duo) [4] employs an online network that sees the unmasked patches and a momentum target network that encodes only the masked patches. The online network predicts the target network’s representation of the masked region, encouraging both networks to effectively model the input. This design, inspired by Masked Autoencoders but working in representation space, led M2D to state-of-the-art results on a range of audio classification tasks (environmental sound, speaker ID, music genre, etc.). Notably, M2D achieved top performance on datasets like UrbanSound8K [15], VoxCeleb1 [16], GTZAN [17], and SpeechCommands [18] with a single universal model. Such results highlight the power of using latent prediction instead of raw signal reconstruction for learning transferable audio features. Other contemporary models include WavLM [6], which extended HuBERT with simulated noisy inputs and achieved strong results on both speech recognition and classification tasks. C. Joint-Embedding Predictive Architectures Rather than predicting low-level details of masked inputs, JEPA methods aim to predict higher-level representations. The image-based I-JEPA [10] demonstrated that a ViT can learn excellent representations by predicting the latent representations of masked image patches, as opposed to generating pixels. By operating in the feature space, I-JEPA forces the model to capture abstract semantic information and ignore minute pixel-level differences. The concept has since been extended to other modalities and combinations (e.g., TIJEPA [19] for text–image, GeoJEPA [20] for geospatial data, etc.), showing JEPA’s flexibility. For audio, Fei et al. recently proposed A-JEPA [12], applying the same principle to spectrogram inputs. While A‑JEPA and M2D both adopt a dual‑network masked‑prediction framework, they differ in how the target encoder is applied: M2D processes only the masked spectrogram patches, whereas JEPA processes the entire spectrogram (context + masked). This richer context enables more detailed representations of the masked regions. Their design uses a context encoder to process unmasked spectrogram patches and a target encoder (the EMA of the context network) to encode masked regions, with a lightweight predictor network aligning the two in latent space. Our work is directly inspired by this approach but contrary to M2D, JEPA does not require data augmentation and the whole spectrogram is seen by the target encoder. We evaluate Audio-JEPA in our experiments, underlining how it bridges the gap between vision-style masked modeling and audio understanding. III. Proposed Method: JEPA for Audio In this section, we describe our adaptation of the JEPA paradigm to the audio domain, which we call Audio-JEPA. We first present the overall architecture, then detail the self‑supervised training objectives, and finally highlight the audio‑specific design choices that make A‑JEPA effective on diverse sound data. A. Overall architecture As shown in the Fig. 1, our Audio-JEPA model consists of three main modules: 1) Context encoder: Processes the “visible” subset of Mel-spectrogram patches 2) Target Encoder: Provides stable target embeddings via an Exponential Moving Average (EMA) of the context encoder’s parameters 3) Lightweight Predictor Network: Takes context embeddings and predicts latent representations for masked (“target”) patches Upon converting an input waveform to a Mel‑spectrogram and partitioning it into non‑overlapping time–frequency patches, we randomly mask a fixed proportion of patches. The context encoder embeds the remaining visible patches, producing a context representation. The lightweight predictor network then reconstructs embeddings for the masked patches. In parallel, the target encoder (updated by EMA rather than gradient descent) encodes the true masked patches. Training minimizes the average L2 distance between the predictor’s outputs and the target encoder’s embeddings, with stop‑gradient applied between the predictor and the target encoder. This implementation is a direct adaptation of I-JEPA to the audio domain, by considering the spectrogram as a single channel, possibly non-square, image. B. Training objective We train Audio-JEPA using the average 𝐿2 distance between the predicted patch-level representations and the target patchlevel representation in the masked parts. Formally, let 𝑐𝑖 = 𝑓ctx (𝑥\𝑀 ) , 𝑐̂𝑗 = 𝑔pred (𝑐), 𝑖 𝑡𝑗 = 𝑓tgt (𝑥)𝑗 (1) where 𝑥 are the Mel-spectrogram patches, 𝑥\𝑀 are the visible patches, 𝑓ctx and 𝑓tgt are the context and target encoders respectively, and 𝑔pred the lightweight predictor. The loss is then 1 ℒ= ∑ ‖̂ 𝑐 − 𝑡𝑗 ‖22 (2) |𝑀 | 𝑗∈𝑀 𝑗 We update 𝑓ctx and 𝑔pred parameters via backpropagation, while 𝑓tgt parameters are updated as 𝜃tgt ← 𝜏 𝜃tgt + (1 − 𝜏 )𝜃ctx (3) with 𝜏 the EMA decay factor. This design stabilizes target representation and prevents collapse. C. Evaluation Our assessment follows the eXtensive Audio Representation and Evaluation Suite (X-ARES)², which brings together 21 publicly available audio datasets spanning a variety of tasks and domains. Using the frozen target encoder in the evaluation, we employ two complementary evaluation strategies drawn from X‑ARES: a) Linear Probing (MLP): For each downstream task, we freeze the pre-trained encoder and attach a single linear layer. This classifier is trained on the task’s labeled data using a ²https://github.com/jimbozhang/xares fixed set of hyperparameters. By holding the original model weights constant, this procedure reveals how readily the learned representations can be linearly separated and adapted to new tasks. b) k‑Nearest Neighbors (kNN): Without any additional training, we directly apply a kNN classifier to the frozen embeddings. This non‑parametric evaluation highlights the raw discriminative power of the representations. Although it may underperform more sophisticated fine‑tuning methods, kNN offers a strict baseline for the intrinsic quality of the learned features. Due to the architecture and loss, the model’s outputs are not guaranteed to be linearly separable as explained in the VJEPA paper [11]. Therefore, we do not expect good results in that section. However, we should observe decent performance in the kNN task. IV. Implementation Details In this section we summarize the key implementation choices and hyperparameters used to train A‑JEPA. Tables 2–3 collect the most important settings; for clarity we defer dataset splitting, and hardware details to Section V. A. Data processing We work with 1921982 AudioSet clips resampled to 32kHz and 10s duration totaling to 5338 hours of audio. Each waveform is converted to a 128‑band Mel‑spectrogram with 256 time bins (via a frame size and hop chosen accordingly such that the frame size is 2.5 times the size of the hop). Per example, we randomly sample 40 %–60 % (exact value per batch is uniformly sampled) of the patches indices to be masked. Each batch contains 256 audio clips. Preliminary experiments showed the block masking strategy from I-JEPA yielded lower performance than random masking. B. Model Architecture In Table I you can see the exact ViT hyperparameters used for each module. The context and target encoders share the same ViT configuration with 16×16 patches, a 768‑dimensional embedding, 12 layers, 12 attention heads and an MLP ratio of 4.0. The target encoder is kept architecturally identical to the context encoder and updated via EMA where the parameter 𝜏 is set in the same way as in BYOL [21]. The predictor uses an embedding size of 384, a head count of 12 and contains 6 layers. MLP ratio remains the same at 4.0. The predictor reprojects the embeddings to 768 after going through the ViT so that its outputs can to be compared with those of the target encoder. The total number of trainable parameters during training is 96.7M, with 85.4M parameters used at inference since the predictor is not used. C. Optimization and Scheduling We train using AdamW [22] (𝛽1 = 0.9, 𝛽2 = 0.95, weight decay = 0.05) and an initial learning rate of 3 ⋅ 10−4 . A warmup‑cosine scheduler ramps from 1 ⋅ 10−6 to 3 ⋅ 10−4 over 1,000 steps, then anneals to zero. TABLE I Encoders and predictor architecture hyperparameters. The total number of trainable parameters during training is 96.7M, with 85.4M used at inference since the predictor is not needed. Context / Target Encoders Predictor 16 × 16 - Embedding dim 768 384 Depth 12 6 Heads 12 12 MLP ratio 4.0 4.0 Hyperparameter Patch size # Parameters 85.4M (each) TABLE II Linear-probing evaluation results on the X-ARES evaluation suite, comparing our Audio-JEPA implementation with wav2vec 2.0 and data2vec. Scores are reported as given by X-ARES; bold indicates the best performance, and underline indicates the second-best. Audio-JEPA (ours) Wav2Vec2 [5] Data2Vec [7] ASV2015 [23] 0.898 0.924 0.937 Clotho [24] 0.014 0.014 0.008 CREMA-D [25, 26] 0.427 0.541 0.523 DESED [27, 28] 0.306 0.313 0.136 ESC-50 [29] 0.338 0.510 0.229 Fluent Speech Command [18] 0.025 0.468 0.978 Free Music Archive small [30] 0.553 0.469 0.334 FSD18-Kaggle [31] 0.212 0.241 0.153 FSD50k [32] 0.151 0.166 0.085 GTZAN Genre [17] 0.628 0.630 0.448 LibriCount [33] 0.471 0.583 0.492 LibriSpeech-MF [34] 0.883 0.948 0.752 NSynth-Instruments [35] 0.404 0.443 0.336 RAVDESS [36] 0.303 0.442 0.467 Speech V1 [18] 0.152 0.714 0.927 UrbanSound 8k [15] 0.585 0.659 0.426 Vocal Imitation [37] 0.056 0.147 0.128 VocalSound [38] 0.526 0.768 0.803 VoxCeleb1 [16] 0.041 0.340 0.105 VoxLingua33 [39] 0.093 0.553 0.620 Dataset 11.3M V. Experimental Setup We train on 4 NVIDIA V100 GPUs, with a total batch size of 256 clips. In other words, each batch contained somewhere close to 42.7 minutes of audio. Training took place for 100,000 steps (~13 epochs) which took 14 hours to complete. This training necessitated significantly less resources than Wav2Vec2 Base [5] and data2vec [7] where each trained for 400k steps with bigger batches of 1.6h and 63 minutes of audio respectively. VI. Results A. Linear-probe (MLP) performance We first probe Audio-JEPA’s representations with a small MLP head to assess their linear separability across tasks. Table II reports Audio-JEPA’s performance on 20 of the 21 XARES datasets (all except LibriSpeech-100h [34]). The model reaches first or second place on several benchmarks but falls to last on roughly half of them. In particular, Audio-JEPA underperforms significantly on Fluent Speech Commands and Speech Commands V1. As noted in Section III.C.b, Audio-JEPA is at a disadvantage under linear-probe evaluation as Audio-JEPA’s embedding space is not guaranteed to be linearly separable due to the training objective favoring embedding cohesion. As experimented in [11], using attentive pooling might help in that situation. B. kNN performance We next evaluate pure embedding quality by k-nearestneighbor classification on the 16 X-ARES tasks compatible with this probe. Table III shows that Audio-JEPA achieves first place on 3 datasets (ESC-50, FMA-small, GTZAN) and second place on 7 more, outperforming wav2vec 2.0 and data2vec despite using far less pre-training data and resources. Conversely, it ranks last on about a third of the tasks, underperforming substantially on those same datasets as when evaluated with linear-probing. Because kNN probes frozen embeddings directly and without extra classifier capacity, these results may better reflect the true representational power of Audio-JEPA across music, environmental sounds and speech domains. Commands C. Closing summary Overall, Audio-JEPA demonstrates that a simple mask-prediction objective can yield high-quality audio embeddings with far less pre-training data. Under kNN evaluation, AudioJEPA often matches or surpasses baselines on music and general-sound tasks, even though those baselines underwent far more extensive pre-training, confirming the strength of latent-space prediction. Linear-probe results expose the limitations of a single-layer head for Audio-JEPA, but also point to clear remedies (attentive pooling or small multilayer probes). Across both evaluation methods, Audio-JEPA is weakest on tasks requiring fine-grained speech discrimination (e.g. speaker verification and keyword spotting), indicating that specialized data could improve these cases. These results validate Audio-JEPA as a data-efficient foundation for audio representation learning and set the stage for the architectural and tuning improvements detailed in the Conclusion. Audio-JEPA (ours) Wav2Vec2 [5] Data2Vec [7] rently untouched, is likely to uncover additional headroom. By open‑sourcing our code and checkpoints, we hope to establish Audio‑JEPA as a solid starting point for the community to explore these directions and to further unify JEPA research across vision and now audio. ASV2015 [23] 0.927 0.858 0.942 Acknowledgement CREMA-D [25] 0.267 0.221 0.351 ESC-50 [29] 0.140 0.081 0.040 Fluent Speech Command [18] 0.009 0.017 0.630 Free Music Archive small [30] 0.449 0.251 0.106 GTZAN Genre [17] 0.452 0.303 0.108 LibriCount [33] 0.307 0.311 0.176 LibriSpeech-MF [34] 0.545 0.606 0.724 NSynth-Instruments [35] 0.170 0.251 0.179 RAVDESS [36] 0.215 0.169 0.313 Speech V1 [18] 0.044 0.208 0.852 UrbanSound 8k [15] 0.303 0.339 0.156 Vocal Imitation [37] 0.017 0.010 0.018 VocalSound [38] 0.256 0.269 0.308 VoxCeleb1 [16] 0.002 0.003 0.033 VoxLingua33 [39] 0.057 0.034 0.058 TABLE III kNN evaluation results on the X-ARES evaluation suite, directly comparing frozen embeddings from Audio-JEPA against wav2vec 2.0 and data2vec. Scores are reported as given by X-ARES; bold indicates the best performance, and underline indicates the second-best. Dataset Commands VII. Conclusion We introduced Audio‑JEPA, the first open‑source, from‑scratch adaptation of the Joint‑Embedding Predictive Architecture to audio. Pre‑trained on AudioSet with off‑the‑shelf hyper‑parameters, Audio‑JEPA delivers competitive performance on the X-ARES. The method compares especially well under kNN evaluation, confirming that predicting masked latent targets, rather than the waveform or discrete labels, is a useful inductive bias for general‑purpose audio representation learning. Looking forward, we identify three straightforward upgrades: 1) Attention‑pooling head. Replacing the single‑frame MLP used in our linear‑probe evaluation with a lightweight attention‑pooling block, as proposed for V‑JEPA [11], could yield fairer comparisons and narrow the current linear‑probe gap. 2) Modern backbones and positional encodings. Swapping the vanilla ViT for recent audio transformers (e.g. ConvFormer [40] or CAFormer [41]) and testing rotary or conditional sine‑cosine encodings should improve modelling of long‑range temporal cues. 3) Hyper‑parameter tuning. A systematic sweep of mask ratio, EMA decay and optimizer settings, cur- This work was granted access to the HPC resources of IDRIS under the allocation AD011014754 made by GENCI. Support from the ANR-3IA Artificial and Natural Intelligence Toulouse Institute ANITI (ANR-19-PI3A-0004) is gratefully acknowledged References [1] W.-N. Hsu, B. Bolte, Y.-H. H. Tsai, K. Lakhotia, R. Salakhutdinov, and A. Mohamed, “HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units.” [Online]. Available: http://arxiv.org/abs/2106.07447 [2] H. Dinkel, Y. Wang, Z. Yan, J. Zhang, and Y. Wang, “CED: Consistent ensemble distillation for audio tagging.” [Online]. Available: http:// arxiv.org/abs/2308.11957 [3] H. Dinkel, Z. Yan, Y. Wang, J. Zhang, Y. Wang, and B. Wang, “Scaling up masked audio encoder learning for general audio classification.” [Online]. Available: http://arxiv.org/abs/2406.06992 [4] D. Niizumi, D. Takeuchi, Y. Ohishi, N. Harada, and K. Kashino, “Masked Modeling Duo: Towards a Universal Audio Pre-training Framework.” [Online]. Available: http://arxiv.org/abs/2404.06095 [5] A. Baevski, H. Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations.” [Online]. Available: http://arxiv.org/abs/2006.11477 [6] S. Chen et al., “WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing,” IEEE Journal of Selected Topics in Signal Processing, vol. 16, no. 6, pp. 1505–1518, Oct. 2022, doi: 10.1109/JSTSP.2022.3188113. [7] A. Baevski, W.-N. Hsu, Q. Xu, A. Babu, J. Gu, and M. Auli, “data2vec: A General Framework for Self-supervised Learning in Speech, Vision and Language.” [Online]. Available: http://arxiv.org/abs/2202.03555 [8] A. Baevski, A. Babu, W.-N. Hsu, and M. Auli, “Efficient Self-supervised Learning with Contextualized Target Representations for Vision, Speech and Language.” [Online]. Available: http://arxiv.org/abs/2212. 07525 [9] Y. LeCun, “A Path Towards Autonomous Machine Intelligence Version 0.9.2, 2022-06-27,” Jun. 2022. [10] M. Assran et al., “Self-Supervised Learning from Images with a JointEmbedding Predictive Architecture.” [Online]. Available: http://arxiv. org/abs/2301.08243 [11] A. Bardes et al., “Revisiting Feature Prediction for Learning Visual Representations from Video.” [Online]. Available: http://arxiv.org/abs/ 2404.08471 [12] Z. Fei, M. Fan, and J. Huang, “A-JEPA: Joint-Embedding Predictive Architecture Can Listen.” [Online]. Available: http://arxiv.org/abs/2311. 15830 [13] A. Riou, S. Lattner, G. Hadjeres, M. Anslow, and G. Peeters, “Stem-JEPA: A Joint-Embedding Predictive Architecture for Musical Stem Compatibility Estimation.” [Online]. Available: http://arxiv.org/ abs/2408.02514 [14] P.-Y. Huang et al., “Masked Autoencoders that Listen.” [Online]. Available: http://arxiv.org/abs/2207.06405 [15] J. Salamon, C. Jacoby, and J. P. Bello, “A Dataset and Taxonomy for Urban Sound Research,” in Proceedings of the 22nd ACM international conference on Multimedia, in MM '14. New York, NY, USA: Association for Computing Machinery, Nov. 2014, pp. 1041–1044. doi: 10.1145/2647868.2655045. [16] A. Nagrani, J. S. Chung, and A. Zisserman, “VoxCeleb: a large-scale speaker identification dataset,” in Interspeech 2017, Aug. 2017, pp. 2616–2620. doi: 10.21437/Interspeech.2017-950. [17] B. L. Sturm, “The GTZAN dataset: Its contents, its faults, their effects on evaluation, and its future use,” Journal of New Music Research, vol. 43, no. 2, pp. 147–172, Apr. 2014, doi: 10.1080/09298215.2014.894533. [18] P. Warden, “Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition.” [Online]. Available: http://arxiv.org/abs/1804. 03209 [19] K. H. N. Vo, D. P. T. Nguyen, T. Nguyen, and T. T. Quan, “TIJEPA: An Innovative Energy-based Joint Embedding Strategy for TextImage Multimodal Systems.” [Online]. Available: http://arxiv.org/abs/ 2503.06380 [20] T. Lundqvist and L. Delvret, “GeoJEPA: Towards Eliminating Augmentation- and Sampling Bias in Multimodal Geospatial Learning.” [Online]. Available: http://arxiv.org/abs/2503.05774 [21] J.-B. Grill et al., “Bootstrap your own latent: A new approach to selfsupervised Learning.” [Online]. Available: http://arxiv.org/abs/2006. 07733 [22] I. Loshchilov and F. Hutter, “Decoupled Weight Decay Regularization.” [Online]. Available: http://arxiv.org/abs/1711.05101 [23] Z. Wu, T. Kinnunen, N. Evans, and J. Yamagishi, “Automatic Speaker Verification Spoofing and Countermeasures Challenge (ASVspoof 2015) Database.” [Online]. Available: https://datashare.ed.ac.uk/handle/ 10283/853 [24] K. Drossos, S. Lipping, and T. Virtanen, “Clotho: An Audio Captioning Dataset.” [Online]. Available: http://arxiv.org/abs/1910.09387 [25] H. Cao, D. G. Cooper, M. K. Keutmann, R. C. Gur, A. Nenkova, and R. Verma, “CREMA-D: Crowd-sourced Emotional Multimodal Actors Dataset,” IEEE transactions on affective computing, vol. 5, no. 4, pp. 377–390, 2014, doi: 10.1109/TAFFC.2014.2336244. [26] M. K. Keutmann, S. L. Moore, A. Savitt, and R. C. Gur, “Generating an item pool for translational social cognition research: methodology and initial validation,” Behavior Research Methods, vol. 47, no. 1, pp. 228–234, Mar. 2015, doi: 10.3758/s13428-014-0464-0. [27] N. Turpault, R. Serizel, A. P. Shah, and J. Salamon, “Sound event detection in domestic environments with weakly labeled data and soundscape synthesis,” presented at the Workshop on Detection and Classification of Acoustic Scenes and Events, 2019. Accessed: May 07, 2025. [Online]. Available: https://inria.hal.science/hal-02160855 [28] R. Serizel, N. Turpault, A. Shah, and J. Salamon, “Sound event detection in synthetic domestic environments,” presented at the ICASSP 2020 - 45th International Conference on Acoustics, Speech, and Signal Processing, May 2020. Accessed: May 07, 2025. [Online]. Available: https://inria.hal.science/hal-02355573 [29] K. J. Piczak, “ESC: Dataset for Environmental Sound Classification,” in Proceedings of the 23rd ACM international conference on Multimedia, in MM '15. New York, NY, USA: Association for Computing Machinery, Oct. 2015, pp. 1015–1018. doi: 10.1145/2733373.2806390. [30] M. Defferrard, K. Benzi, P. Vandergheynst, and X. Bresson, “FMA: A Dataset For Music Analysis.” [Online]. Available: http://arxiv.org/abs/ 1612.01840 [31] E. Fonseca et al., “General-purpose Tagging of Freesound Audio with AudioSet Labels: Task Description, Dataset, and Baseline.” [Online]. Available: http://arxiv.org/abs/1807.09902 [32] E. Fonseca, X. Favory, J. Pons, F. Font, and X. Serra, “FSD50K: An Open Dataset of Human-Labeled Sound Events.” [Online]. Available: http://arxiv.org/abs/2010.00475 [33] F.-R. Stöter, S. Chakrabarty, E. Habets, and B. Edler, “LibriCount, a dataset for speaker count estimation.” [Online]. Available: https:// zenodo.org/records/1216072 [34] V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Librispeech: An ASR corpus based on public domain audio books,” in 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Apr. 2015, pp. 5206–5210. doi: 10.1109/ ICASSP.2015.7178964. [35] J. Engel et al., “Neural Audio Synthesis of Musical Notes with WaveNet Autoencoders.” [Online]. Available: http://arxiv.org/abs/1704.01279 [36] S. R. Livingstone and F. A. Russo, “The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS).” [Online]. Available: https://zenodo.org/records/1188976 [37] “Vocal Imitation Set v1.1.3 : Thousands of vocal imitations of hundreds of sounds from the AudioSet ontology.” [Online]. Available: https:// zenodo.org/records/1340763 [38] Y. Gong, J. Yu, and J. Glass, “Vocalsound: A Dataset for Improving Human Vocal Sounds Recognition,” in ICASSP 2022 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), May 2022, pp. 151–155. doi: 10.1109/ ICASSP43922.2022.9746828. [39] N. Yadong, “voxlingua33 in WebDataset Format.” [Online]. Available: https://zenodo.org/records/14723799 [40] X. Lin, Z. Yan, X. Deng, C. Zheng, and L. Yu, “ConvFormer: Plug-and-Play CNN-Style Transformers for Improving Medical Image Segmentation.” [Online]. Available: http://arxiv.org/abs/2309.05674 [41] W. Yu et al., “MetaFormer Baselines for Vision.” [Online]. Available: http://arxiv.org/abs/2210.13452